 Hey folks, in this episode of Code Club, I am going to share with you how I would use tidyverse and ggplot2 to generate what's called a biplot. A biplot is the combination of an ordination of ecological dissimilarity values, combined and overlaid on top of that with the correlation values describing what features of those communities or what taxa are responsible for the separation that we see in the ordination. Now, you can build a biplot in vegan. My experience, however, is that the appearance of that ordination isn't so great. And ultimately, by the time I start tweaking things, I just assumed do it from scratch using ggplot and tidyverse, right? So that's what we're going to do today. Again, I think the benefit also of doing it outside of vegan is that we get to play with these tools a little bit more and get a little bit more proficient in using them. So I've created a bit of an agenda for today's episode. We are going to use the tidyverse, we're going to use vegan to calculate those ecological dissimilarity values. I'm setting a seed because we'll probably do some random number generation along the way. And I've got a bit of an outline of the different steps that I want to work through with you today. Of course, you know, plans are great until you're right in the middle of it. And then plans might change. But I think this is where we'll start. And we'll see how far we get through these today. So I'm going to go ahead and load these libraries and set the seed. So we'll get going by reading in our shared data frame with read TSV. And we'll then do data forward slash mice dot shared. So this is a data frame that again in the rows has the different samples and the columns are the different ot us. One thing I noticed is that it added a column at the end. I think that's because the shared file has an extra tab at the very end. So to help clean this up, what we can do is a select on group and starts with OTU. And that needs to be in parentheses, of course. And so now we've got group and all of our ot us. And we want to filter this down to only have the samples that come from day zero through nine and 141 to 150. So I'll create a vector for that. So I'll say days wanted equals a vector 0, 9, and then 141 to 150. And then I need to create a days column here. And so to do that, we'll go ahead and do a mutate on days. And that will equal str replace modify that group column. And the pattern we want to look for is everything up to including the D. So period star D will match that F3D or the m 12 D or whatever we have. And we'll replace it with nothing. So that will leave the number. And around this, I want to wrap it in as dot numeric, which will convert the character into a double. And then I will pipe this to another select, where we will look at having a days first and then everything else. And so we see sure enough, we have that day as the first column. So we're in good shape. And then what we can do is a filter where we then say days percent in percent days wanted. And so this will return true if the column value for days is found in the vector days wanted. And so let's go ahead and pipe this to select. And now we see that that first column only has days from zero through nine and 141 to 150. Previously, we saw back up here, we had other things like days 11, 125, 13 and so forth, right? So this really does filter for just those days we're interested in. I'm actually going to move my select down here a bit to now basically get the group and everything that starts with OTU. What this will do is this will also drop that days column, which I'm not really interested in having on its own. And so now we should be in good shape with just that group column, and all the days that follow somebody commented on previous episodes where I've been doing these types of manipulations that I don't have to convert this to a data frame and then do row names that there's actually a call there's a command built in here that I could use, which would be column to row names. And I can say group as the values in group that I want to be the row names. And so this sure enough converts to a data frame and makes their row names, the group values. So what I'm going to do actually is I'm going to take this part of the pipeline, and I'm going to call this shared TBL. And then I'm going to take shared TBL and pipe it to column to row names. And I'll call this then shared D F, right? And so way I can have my shared tibble, as well as my shared data frame to input into generating my distance matrix AVG dist on shared D F. And we then need a value for sampling. And I think the value we want is 1828. But I don't totally remember. So we'll go ahead and take shared TBL. And I will do a pivot longer on everything but the group column. And then I'll do a group by a group. And then I want to count the number of sequences. So I'll do summarize n seeks equals sum on value. And let's go ahead and do a ascending sort. So we'll do arrange on n seeks. And let's go ahead and then print n equals 15. And yeah, 1828 was the value that we saw before that we wanted to use. So I will come down here then and do sample equals 1828. I'll call the output of the distance calculation mice dist. And we'll go ahead and run that. And then we'll have our distance matrix. And so that tells us that there were about 12 samples that were removed, because they had sequences that were fewer in number than 1828. So great, we now have our mice dist to generate the ordination will do meta MDS on mice dist. So again, as we've seen before, when we run this, we might not get convergence. And so if that happens, then we need to set a new seed. So let's do set seed and I'll try one, let's try two. So it converges if we use two as our set seed, I'm going to then call this NMDS as the output variable. So I'll go ahead and rerun it and save it back to NMDS. And so then if we look at MDS, we then see the output about how the solution converged, we can also get the scores, which are the positions for our NMDS. So if we do scores on NMDS, we then get two columns of scores or positions on the x and y axis for all of our samples. There's quite a few here. So we have NMDS one and NMDS two. This is a data frame, I'm going to go ahead and then convert this to a table. So again, we'll do as table. And we'll say row names equals group. I misspelled group, of course. And so now we have a table with our ordination and our positions, we can then go ahead and pipe this into ggplot, we can do AES, x equals NMDS one, y equals NMDS two. And we can then also pipe that into geome point. And what we see is that we have a cloud of points. As we've discussed before, we have those early and the late. So we can come back here and after as table, I will bring down that code we had for mutating to create the day from the group name, right? Let's just double check that these three lines work. So that works, we've got the day column there. And then I can then add in early as a new column. So I'll say days less than 10. So if it's less than 10, it'll be early, otherwise it'll be late. And then here I can add color equals early. And so now I've got my two clouds of points, the early points are the teal, and the late are those red points. Okay. So now what we want to know is what otus, what taxa are responsible for the position of each sample on this ordination. And to do that, we can go back to the shared file that we read in that shared TVL file. And we can try to correlate each column with the position on the x and y axis from our NMDS ordination. Okay. So the first thing we need to do is we need to go back to our shared file so our shared TVL. And we need to sub sample this so that we get 1828 sequences from each of our samples. We're only going to be doing this once. So it's not true rare faction. If you're worried about, you know, you know, getting lucky or unlucky and which which sequences we pull, you could always run this multiple times and see if the results change. I find they don't really change a whole lot. So again, we'll do pivot longer, minus group, I'll do a group by group. And then I will do a mutate for total equals sum on value. And so that's the total, right? And so now what I want to do is a filter for total greater than 1828. So this now keeps our samples that have more than 1828 sequences in each of them. And now what I can do is from each sample, I can grab 1828 sequences. So now I want to unroll or uncount my samples. And to do that, we can do uncount value. So to get 1828 sequences from each of our samples, we can do slice sample n equals 1828. And so now we have 1828 rows from each of our 226 different groups. And then we can go ahead and count these to get the number of times we saw each O2 in each group by doing count, group, name, and that got rid of that total column. And now we've got a sub sampled data frame representing our group, our name and the end of our individuals. I'll go ahead and ungroup our data. And so again, now we have our tidy version of our data frame. And so what we could then do is go ahead and call this sub sample shared TVL. Great. So now that's read in, we now are ready to look at the correlations between those frequencies that we saw in our sub sample data frame, and the position in the x and y axis. And so that's going to need the score. So the scores and MDS. So what I'll do is MDS positions equals scores and MDS. And then I can use that as the input to this pipeline to create the figure. I can use that MDS positions as what I'm trying to calculate the correlation to. Okay. So again, we have sub sampled shared TVL and our MDS positions. And now what we want to do is kind of put them together and see how they correlate with each other. So what we'll do is I'm going to go ahead and join those two data frames together. So I'll do an inner join with sub sample shared table and MDS positions. And I think they have the same column and column. Nope, they don't. So there's group. And, and this is not a table, right? So what did I do? Let's come back up here to our MDS positions. And yeah, let's go ahead and bring this as table part and feed that in there. So now our MDS positions is a table. And now when we come down here, again, we have our MDS. Now, now everything joins together nicely. Great. So again, now what I have is I have my group, my OTS, the counts and the position on the X and Y axis. And I want to know how does for each OTS, how does N correlate with an MDS one? And how does N correlate with an MDS two? Because then what I can do is I can draw a vector from 00 the middle of the ordination out to that correlation value to indicate kind of the strength of association for each OTS with the position of the sample. Okay, so we can take this and I will go ahead and call this NMDS shared. So now I want to calculate the correlation between the value in the end column and NMDS when we're looking for each OTS. So we'll take NMDS shared and we'll pipe that to nest. I'm using nest because I'm going to do a nest mutate with map and then unnest because the core dot test function that I'm going to use outputs multiple columns. If I was just interested in the correlation value, which maybe I should be, I don't know, then I could perhaps get away with a group buy and summarize because that core dot test or core would only return one value. So we'll go ahead and nest and we'll say data equals minus name. And so that way, then we have a separate table for each of our OTS. So when I look at the output of nesting the data, I noticed that my group data frames all have different numbers of rows. They should all have 226 rows. And so there's this variation reminds me that back up here, when I was generating that sub-sampled shared data frame, I ran count on group and name. And that will only return the combinations where there were observations. It doesn't include zeros. So I need to go ahead and to get those zeros, what I'm going to do is I'm going to make it wider and fill in zeros where there's missing data and then pivot longer again. So to show you how I'll do that, I'll then do pivot wider. And I can then do names from equals name and values from equals n. And I'll do values fill equals zero. And then we can feed that back into a pivot longer, where we can then do everything but the group. And so if I look at sub sample shared table, I'll, I'll have all that. And now let's try this again, with our join and with these two rows. And so now I see all of my group data frames have 226 rows in them. Now what we can do is pipe this into a mutate. And I'm going to call this core underscore x. And then I'll say map. And we're going to use a map function to apply core dot test to each of these separate data frames. We've been doing a lot of this recently. So hopefully it's sinking in a little bit. And we'll take the data column to do that. And we'll tilt that into core dot test. And to core dot test, we're going to give dot x. So dot x is the argument for data, right? So we'll do dot x, dollar sign value, comma dot x, dot x, dollar sign, nmds one. And we'll do method equals spearman. And we'll also then we'll then pipe that into tidy, which comes from the broom package. So we'll want to make sure that we have that installed. And now when we run this, I see there's more than 50 warnings. So let's go ahead and do warnings. I think, yeah, so it's complaining problem while computing core x cannot compute exact p values when there are ties. So if that annoys us that error message, we could always come in here and add exact equals false. I'm going to maybe put this across a couple lines so that it's easier to see without things scrolling way off the side of the screen here. So that's good. I'll go ahead and pipe this to select with name and core x, and then unnest core x. So yep, now I've got my name, my estimate, the p value, those are really the three columns I'm most interested in. So I'm going to actually grab this select and move it down here to get the name, the estimate and the p dot value, right. And I'll go ahead and call this core underscore x. And if I look at core x, yeah, I get those three columns, great. Now I'm going to take core x, copy it down. And we'll change this to be core y. So we can get the same correlations, but for the y axis, right? So core y there, core y here. And here I want an MDS2. And if we look at core y, we again get those same columns. Now I want to bring core x and core y together, I'll do an inner join with core x, core y, by equals name, again, I need that by equals name, otherwise it'll try to join on all three columns, since the two data frames have all three column names in common. And so now I get name estimate x p value x estimate y p value y, it tax on that x and y suffix from inner join, because I've got those two data frames, and estimate and p value are shared between those. Okay, so that's helpful, because those correspond with our x and y axes. Wonderful, we now have this inner joined. And I'll call this correlations. So if I take correlations, something I might think about doing is considering how I might filter this down. So there's 737 ot us in here that have correlations for it's way too much data to show on a bipod. So what we could do instead is let's think about filtering for only those p values that are less than 0.05. So we could do a filter on p dot value dot x less than 0.05, or p dot value dot y less than 0.05, that gets us down to about 222. That's better, but not a whole lot better. What if we go to 0.1? It gets us to 181, right, we could kind of keep playing this game, where we shrink down the significance. But that's not totally helpful, because we still have 146 here. Alternatively, what we could do is perhaps think about doing correlations, and pipe it to a filter, and we could look at the estimates, we could say estimate dot x greater than 0.5. And I want to look at the absolute value of that, because we have negative and positive values, or absolute value of estimate y greater than 0.05. That gets us to 28, right? So that's pretty good. Maybe if we go up to 0.75, right? So that gets us to 6 ot us that have a correlation, an absolute correlation, greater than 0.75. And we see that all of these estimates have significant p values, right? So some of these like 021, the p value on the y axis is not significant, but also the estimate is pretty small, right? So what we might think about is taking this data frame, and then plotting it on top of our ordination. So let's go ahead and see how we can do that. So I'll call these high core, and we'll now plot those segments on our ordination, right? So of course we could take high core and pipe that to ggplot, aes, x, we could then say is estimate s, x, and y is estimate y. We could do geom point as a starter, right? So we get our 5 ot us in the lower left corner, and our 1 up in the upper right corner. Alternatively, what we might like to do is draw rays going out of 00 to these different positions. And so to do that, what we could perhaps think about would be to use geome segment. And for geome segment, we need 4 aesthetics. We need x and x and y and y end. So x, I'm going to make 0, and x end, I'm going to make estimate x, y, I'm going to make 0, and y end, I'm going to make estimate y. And so now we can see we have those 5 rays coming out of 00 off to the left, and we have the 1 ray going up to 0.8. So in other words, the samples that are up high in the y-axis are being driven by that ot, whereas everything to the left is being driven by high values of those 5 ot's going to the left. And of course, if things are to the right, then they have lower values in those ot's, and if there's things down below, then they're lower than this ot that's up, right? So let's think about how we can merge this all together. I'm going to come back up here and get my plot for my NMDS. And let's go ahead and we've got our NMDS positions. And we can of course add to this a geome segment. It will do data equals high core, AES, and then I'll grab all this good stuff. And let's go ahead and run that, and it's complaining early not found. And so what I need to add to this is inherit dot AES equals false. And so now what we can see is we have those rays going off to the left, and that ray going up, and we can then see how that explains the samples to the left are primarily late samples from days 140 to 150. And there is a fair amount of vertical separation. And so it would be good to know, you know, what is that ot that's driving things up and down. And so to do that, what we can then do is think about adding on the text onto our plot. So let's go ahead and grab this. And what we'll do is a geome text. And again, I'm going to take a lot of this same stuff and pull that down here. And instead, we'll do go back to having x and y for the estimate x and estimate y. And then we'll do label equals name. So we see the names are at the ends of our vectors. Unfortunately, there's a bit of a overlap down here on the left. So something we might think about doing would be geome text repel. And so to do that, we need library ggrepel. So now we see that our labels aren't overlapping with each other. But they are kind of overlapping with the lines. And the lines between the label and the tip of the label are kind of overlapping each other. And it just gets a little bit messy, right? So one thing we might think about doing is actually turning off the vector. Let's give that a shot to see if that helps clean things up. But before I do that, I want to make sure that I have a line from each OTU label to the tip of its ray. So like OTU 9 here doesn't have a line there. So what we could do here would be to say min, segment length, and I'm going to do 0.01. And yeah, so now I have a line for each label going to the tip of the vector. And I could perhaps think about turning off geome segment. And so now we can kind of see what we're pointing to, maybe to help highlight that it's pointing to a point. So let's go ahead and add geome point. And I'm going to grab some of this stuff that I had up here before. So we'll get that. We'll also do our inherit. As equals false. And I'm going to do color equals black. And I don't want that label equals name. So I think that looks a lot better. I don't know what you think, but I like how the black points indicate the position of the OTUs, whereas the colored points are from early or late for the actual samples themselves. And we've kind of got a nice clean presentation of the OT labels with their position. There's all sorts of different arguments that you can play around with, with that GG geome text repel function. We've talked about those in previous episodes, basically kind of how you might move things around to make it look a little bit cleaner even still. Of course, we probably wouldn't want to label these with like OT001 and OT005. We probably want the actual names. But that's a little bit beyond what I wanted to demonstrate in today's episode. So again, this is a by plot, which shows how the abundance of each OTU correlates with the position of each sample on the x and y axis of the ordination. This is a really handy tool for understanding what are some of the drivers of the differences we see in the communities. Again, OTU9, we would expect to have a higher relative abundance in samples above zero on the y axis and lower relative abundance below the axis. These other OTUs, like OT1, 3, 5, 10, and 40, are highest in the later samples and lowest in the earlier samples. And again, we could come back and we could play with the thresholds we used for ascertaining which taxa we were most interested in. I think we are looking at correlations greater than 0.75 on an absolute value. Again, you could lower that, you could raise it, whatever. Again, this is part of our toolkit for exploring ecological data. Again, I hope you found this useful and it was a little bit longer than normal. But I think we really saw a lot of the techniques we've been talking about over these series of videos using GG Plier, even using GG Repel and Broom got in there. So there's a lot of cool tools all coming together. Oh yeah, vegan, right? All these great tools coming together to help us make a really informative figure to help us to better understand what's going on in our community data. Well, I encourage you to play with this. As always, please try this with your own data and we'll talk to you next time for another episode of Code Club.