 Hey, folks, I have a profound and important scientific question for you. Can you publish a microbiome study without including an ordination? Yes, actually, you can. In today's episode and the next episode, I'm going to show you two different ways that you can generate an ordination in R using a distance matrix. In this episode, I'm going to show you how you can run PCOA within R. PCOA is short for principle coordinates analysis. It's a lot like principle components analysis, but there's an important difference. PCOA allows you to give it any distance matrix. PCA is basically PCOA but where you use Euclidean distances to calculate the differences between your different samples. Euclidean distances are generally considered a horrible way to represent the dissimilarity between different communities because of the way it handles zeros. I don't want to go into all that, so don't use PCA. Instead, use PCOA with an ecological distance such as Bray Curtis. So my mental model for how PCA works is that you have this cloud of points, right? And it's a cloud of points that I'm kind of illustrating here in three dimensions, but it's potentially in hundreds of dimensions, right? And so if you think about this big cloud of points, you want to draw a line through that cloud of points that has the highest r squared, that explains the most amount of variation in the data, right? So you draw that line and that line is going to become principle coordinate axis one. Now, you draw a line perpendicular to that line that explains the next most amount of variation in the data. So it's a perpendicular line to the first line that has the next highest say r squared to the data. You repeat that. You know, what's the line that's perpendicular to the second axis that explains the most amount of variation? And it keeps proceeding. So PCOA is using linear combinations of data to represent descending amounts of variation in the data. We generally are only concerned with the first two principle coordinate axes and the first becomes our x-axis and the second becomes the y-axis. Some oddball people insist on publishing the third dimension as well on a 2D page. This does not make sense to me, yet people do it anyway. I think it's the default in a rival software package. Anyway, you're not going to do that because you're only going to represent two dimensions because your papers only have two dimensions. There are our packages out there that allow you to interactively visualize 3D. I've actually talked about one of those in the past called RGL. I think that's really cool for when you're working on your computer. But again, if you're publishing a paper, you're only going to have two dimensions. So only use two dimensions. Enough said, let's head over to our studio and I'll show you how we can make our own principle coordinates analysis plot in R. We'll use some base R as well as some ggplot. So I have taken the code that was in my analysis.R script from the previous episode and I've copied it into a new R script called pcoa.R. If you want to get analysis.R and all the other files and the data that's going into this analysis, go down below in the description. There's a link to get caught up. Also, I'll put a video up here that kind of shows you how you can do this by interacting with GitHub and RStudio. Anyway, let's go ahead and run all this. And this will load our distance matrix that has been filtered to the points that we want. The distance matrix is indicating the dissimilarity between different stool samples that I collected from mice in our laboratory's breeding colony. We collected daily samples from about a dozen mice or so, trying to understand the stability of the murine microbiota. So now we have our distance matrix loaded. We also have our sample lookup data frame, which tells us the animal, the sex and the day post weaning for each sample that's represented in that distance matrix. So the command I'm going to use to build the ordination is from basar. It's called cmd scale. And we'll then give it disk matrix. This outputs a lot of outputs. So I'll go ahead and pipe this into str. And we see that it is a matrix, a 227 by two matrix of the different positions in the ordination space. So by default, cmd scale outputs only two dimensions. In a moment, I'll show you how you can do three if you want to use it with the RGL package. So another thing we could do is we could actually pipe this to plot. And here I think is the very first base our figure that I have shown here on code club. If somebody out there would like to learn more about doing base our graphics, let me know. And maybe I'll think about how we can add more of that type of visualization. I think it may be is helpful to know how to go back and forth between base our and tidy verse GG plot. That being said, I've been using GG plot now for past four years. I've really had no need to go back to base our and don't really miss the base our graphics. Anyway, this is a helpful quick diagnostic of what the ordination looks like. So let's go ahead and see how we can instead of using base are use GG plot two to plot our data. So again, we have this matrix, and we need to convert it into a table. And we've seen how to do this recently where we can do as underscore table, we could do row names, equals samples. And this gives a warning message that the x argument of as table matrix must have unique column names. And so our cmd scale this matrix output, as we saw up above here, when I did the head, we don't have any column names here, right? So if I actually run this again, the error message goes away, which is always nice. But maybe what we should do instead is let's go ahead and give some column names to our distance matrix. And so I will go ahead and call this PCO a, and then we can do call names on PCO a, and I will then call these, let's call it PCO a one for the first column and then PCO a two. And then we can take PCO a, and then pipe that into as table. And let's just make sure that those column headings took sure enough, we have PCO a one and two now. And so we can then pipe that into as table, it'll take those row names and put them in the column called samples. And sure enough, we now have our table that's 227 rows by three columns. Hopefully we're becoming pros at this. Now we can feed this into GG plot to the GG plot function with AES X being PCO one, Y being PCO a two. And then we can do geome point. Now what we get is our GG plot two version of the figure. If you kind of straighten your eyes, you'll see the points are in the same place. Clearly the visualization and representation of the points is a little bit different. Before I go too much further, I would like to show you how you can get further columns of data for the ordination. So if I come back up to CMD scale, I can add the argument k equals three. And now if I look at PCO a and I send that to head. I now see I've got three columns, right? And so if I do say five, I now get five columns, right? So I would need to update my column names. And like I said, there's a package called RGL that I've talked about in a previous episode that allows you to make interactive 3d plots. I think you can also use plotly, which I've talked about in other episodes that will also allow you to make interactive 3d plots. I'm going to keep this at two, because again, I'm representing this on my 2d YouTube screen. And we'll roll with that. Now something that people often like to do is to indicate the percent of the variation explained by each of the axes being represented in the ordination, right? So typically people might have something like PCO a axis one, and then in parentheses, the percent of the variation explained and then the same thing on the y axis. So how can we do that? Turns out it's not that hard. What we can do is we can come back up to cmd scale and we can say ig equals true and look at PCO a. We now see that PCO is coming out to us as a list where we have this object within PCO a that's dollar sign ig. So these are the eigenvalues. So the little bit that I hid from you in describing PCO a is that it takes your matrix and it decomposes it into eigenvalues and eigenvectors. So if you've taken linear algebra before, this might be the only time since then that you've heard the word eigenvalue or eigenvector. The eigenvectors are the positions on the x and y axes. That's this data up here. The eigenvalues indicate something about the proportion of the variation explained by the different axes. So I'm going to use this ig data to get the proportion explained by each of the axes. Now, looking at these ig values, one thing I notice is that about halfway through at about position 127 here, the values become negative. And so that's not so good, right? We want our values, our eigenvalues to all be positive, because what we're going to do to calculate the percent explained is to look at each individual eigenvalue divided by the total of all the eigenvalues. So to rescale those values, we're going to add another argument to cmd scale, which is called add. So if we do add equals true, this will recenter the data so that our eigenvalues are all greater than zero. So again, if we look at pcoa and come up, we see that our smallest eigenvalue is basically zero, right? Negative 1.6 times 10 to the minus 16 adds zero, right? And so now what we can do is we could extract those eigenvalues and calculate the percent explained. But as I mentioned a little bit earlier, we now have everything in this pcoa object as a list. So we'll go ahead and call the x and y positions positions. And that will be pcoa dollar sign points. And so again, if I look at positions, and we send that to head, we now see that we've got that two column data frame. And so I then need to update my call names positions pcoa one and two. So go ahead and run that. And then this should be positions as well. And we get back the ordination that we had. So that's good. Now we need to get out the eigenvalue data. So if I do pcoa dollar sign, I, so the percent explained by the first axis is this value, so the 11.95954 divided by the total of all the eigenvalues times 100, right? And so I can get that by doing pcoa dollar sign, I divided by some of pcoa dollar sign, I, right? So this sum function will add up the sum of all the eigenvalues. And then I can use the vectorization built into our to get the fraction explained by each axis. So we have point 16017 as the first axis, and then point 0826 as the second axis. So if I want that to be percentages, then I can simply multiply that by 100. And now I see that I get 16.0 for the first axis and 8.3 for the second. So we'll go ahead and call this percent explained. Right? If I want percent explained on say the first two axes, I have my 16.0 and my 8.3. Now I'm going to embed this information into the axis labels on my ordination. To do that, I'm going to use the labs function. And so I'm going to hard code in my labels to kind of show you where the different pieces of information are going to go. And so we'll do x equals pcoa1, so principle coordinate axis 1. And then in parentheses, I'm going to have 16.0%. And then for y, I'm going to have pco2. And that's going to be 8.3%. And so again, looking at my ordination, I now see those values and the reformatted relabeled axis labels. And that looks good. Now, I do not like hard coding numbers like this into my R code. I would far rather have these values be generated from percent explained. Because if the data changes upstream to this script, then the 16.0 might not be correct anymore. If I say included or removed samples from the distance matrix, again, the ordination might change. So I like to have these types of values defined dynamically by the data and by the code. So let's go ahead and take percent explained. And I'll show you how we can dynamically define these labels. So to do this, we're going to lean on a old friend that I've talked about many episodes ago called glue. So glue is a lot like paste. So I'll create a vector called labs. And that will then be, like I said, a vector. And then the first value will be this pco1, right? And then in the parentheses, instead of having that 16.0, I'm going to put percent explained seat one. Now, the way glue works is it's a function that we wrap around the text we want to glue together. And so then this percent explained bracket one needs to be wrapped in curly braces. And so this curly braces tells glue to plop in the actual value of percent explained. We can then do the same thing for axis two. So we'll do pco axis two and change that percent explained to two. Now if we look at labs, we see that we've got pco one with 16.01 pco two with 8.26. We're going to we don't want this really long number, but it'll get us started. And so let's go ahead and for labs, we'll go ahead and replace the hard coded text with labs one y equals labs two. And so now we see that our x and y axis labels are updated. Again, they've got those large number of significant digits to the right of the decimal point that now I want to go about cleaning up. So again, if we take percent explained, and let's look at the first two values. Well, I want to round it so it's 16.0 and 8.3. So to do that, we can wrap percent explained within the round function. And so we'll then say round, and we can then say digits equals one. This then will round our numbers to one significant digit to the right of the decimal point. So I'm going to go ahead and call this pretty PE. I'll go ahead and put pretty PE in place of percent explained. We now see that our x and y axis labels are updated. We no longer have all those significant digits. Unfortunately, though, the 16.0 was truncated to 16 without the decimal place. I want that decimal point in there. So to get that, we can wrap the round function in another function called format. Now this wrapping of functions or nesting of functions is one of the things that kind of gives base are a bad wrap. And ultimately, why people kind of prefer having the pipeline notation. Anyway, we'll roll with this and it'll be fine. I'll add the argument and small equals one. And so n small equals one will ensure there's at least one value to the right of the decimal. It's the smallest number of values to have to the right of the decimal point is one running that we now see that we have our 16.0. But if you look at PCO two, you'll notice the between the parentheses and the eight, there appears to be a bit of a gap. So if I look at pretty PE, I see that sure enough, there is a space here between that quote indicating it's a string and the 8.3. I can come back up to my format function and add the argument trim equals true. And what trim equals true does is it will trim the white space from the front of the string. And so now if I look at pretty PE, I no longer have that space in front of the 8.3. And I can update all this and see that that space is no longer there between the parentheses and the eight. So we're in good shape. You could of course, take this GG plot figure and gussy it up and, you know, color the points by all sorts of different things. We'll save that for another episode. But one final thing that I want to do with you before we leave is I want to show you how to make what's called a screen plot. A screen plot indicates the percent explained by each axis. And so we can come back up to our percent explained. And we can go ahead and create a table out of this and then feed this into GG plot to so we'll use the table function and I'll call the percent explained the PE column. And then we'll have a column for the axis, which will be one to length on percent explained. And so now what we see is that we have the table, we have the percent explained and the axis, you know, typically you might think of these columns as being switched, but really GG plot doesn't care what order things are in. So we'll let that ride. We'll pipe this into GG plot where AES or X will be the axis, the Y will be the percent explained. And let's go ahead and do a GM line. And so now we see this kind of L shaped curve for the percent explained on the y axis and the number of axes on the x axis. So, you know, going out to 250 dimensions is a bit extreme. Let's zoom in on that. We can zoom in by doing chord Cartesian and we'll do X limb one and 10. Oh, I misspelled chord. And so now we're zoomed in between positions one and 10. And you know, one of the things to notice really is we already know that the first two axes only represent about 25% of the variation in the data, right. And so if you kind of come out, you're not getting that much more variation explained by adding, you know, a third, fourth, fifth axis, you can get a better sense of what that looks like of kind of the cumulative percent explained. If we take percent explained and wrap that in the cum sum function. So if we do cum sum on percent explained, what we'll see is the cumulative percent explained by the first, second, third, fourth, right. So if we wanted to explain say 50% of the variation in the data, we'd have to come out to the 17th axis, right. So this is kind of a hallmark of microbiome data that's really noisy. It's highly dimensionalized. My mental model of what's going on in ordination is that you have a 3d object, say like a globe, and you smush it down into two dimensions. We know there's all sorts of distortion that happens when you do that smushing process. And so, you know, this reduced percent explain is one way of thinking about that smushing process. So let's go ahead and look at the cumulative percent explained. And so we now see kind of that cumulative, so we can change the y limit to go from say 0 to 50. And so now we have kind of a better indication that if we go out to say 10 axes that we're only explaining maybe about 43% of the variation of the data, of course, we normally would only be limited to two in a printed screen or three if you're using one of those interactive tools. Now in the next episode, I'm going to show you non metric dimensional scaling and how we can do that within our PCOA is called metric dimensional scaling. And so we'll learn NMDS and how to use do that in our so that you don't miss that episode. Please, please, please make sure that you've subscribed to the Riffamonus channel, and you click that bell icon. So you are notified when that episode drops. Keep practicing with this, and we'll see you next time for another episode of code club.