 Hey folks as I have pointed out in recent episodes I am not the biggest fan of ordinations. As I've been pointing out my biggest concern with ordinations is that it's never quite clear what the point is. What do I need to see as a reader in this ordination? It just looks like a big mess of points and so that kind of gets us to the second concern I have with ordinations which is ultimately as humans we have evolved to see patterns because we want to know if there's a predator over there in the tree that I need to be aware of so we kind of overreact to patterns and so I think of an ordination as being a bit like an ecological Rorschach test. Do I see a pattern? Is that real or not? A third problem that I have with ordinations is that it ultimately removes information from the data. We've talked about this before that ordinations are a way to take highly dimensionalized data so say like 200 dimensions to represent the data and flattens it down to two dimensions. We see this commonly with three dimensions where you can think of a globe you know being a big sphere and then we smush it down in two dimensions. We know there's all sorts of distortions in the shapes and sizes of the countries on the earth. Well that's an even bigger problem when we go up to higher dimensional data like we have for ecological data. So what we're going to do in today's episode is I'm going to cover two different methods that are available in the vegan R package. One is called Adonis which has also been called Amova or Permanova. It's got a lot of different names and that allows us to see whether or not two clouds of points have the same centroid, whether or not they overlap the same spot. The second tool is another tool called PermDist or BetaDisper or Homova depending on kind of where you're getting the tool from and what this does is this allows us to see whether or not the variation in the data varies between different sampling groups. We'll head over to our studio and we'll get playing with those functions with the data we've been working on in recent episodes. As always if you want to get the data that I'm working with or you want to get a final copy of the script that I will be generating today down below in the description is a link to a blog post for today's episode. I've also got a video up here that'll show you all you need to do to get caught up. As always I'm going to start with LibraryTidyverse and we'll also do LibraryVegan so we can get the tools from both of those packages at our disposal. So now I'm going to read in a shared file generated from mother using read.tsv and so the file is in data and then it's mice.shared. And so this data comes from a study where I sampled mice, laboratory mice, over the course of about six months. I intensively sampled them between day zero and nine post-reading in day 141 and 150 post-weaning and so those are the time points I'm going to focus in on. There's other data in here but this will get us going. So this is a big data frame. It's got something like 2067 columns and 360 rows. I'm going to go ahead and get this into a tidy format so that it's easier to work with. So I'll do a select on group and anything that starts with Otu, capital O lowercase tu, and then I will go ahead and do a pivot longer on everything but the group column. So I'll do minus group and so now we've got a column for our group, the Otu or the taxonomic name and the value is the number of times we saw this Otu in this sample. So the next thing I want to do is filter for two different characteristics. First I want to get the samples that come from those intensive sampling time points and I also want to remove any samples that have fewer than some threshold number of sequences. So let's start by removing samples that don't occur in those kind of more heavily sampled time periods. So to do that I'm going to go ahead and use a function called separate which we've seen in some recent episodes here. So I'm going to separate group into and then I'll say as a vector animal and day and the separator that I'm going to use is the capital D, right? So that would make this like two columns f3 and a zero but I also want to keep that group column so I'll do remove equals false and now I've got a column for my group, my animal and my day. Now I'm going to create a variable that contains the days I actually want. So I'll do days wanted and we'll do c0, 9 to get days 0 through 9 and 141, 150 to get those days as well and now I can do a filter for day in days wanted. So if the day if the value of the day column is in this vector for days wanted I will get it out in the output. Now what I want to do is to find the number of sequences in each of the samples so I can get to that threshold. I think it was like 1828 but I want to show you again how I got there. So we'll do a group by a group and then we'll do a summarize on we'll do like an n equals sum on value because that is the name of this last column and so for each sample then we'll get that value the total number of sequences per sample and I'll go ahead and do an arrange and which will sort it descending by the n column and I'll do a print n equals 20 so we can get the first 20 rows of the data and so here are the first 20 rows of this sorted data frame and we see that yeah sure enough 1828 was the value that I wanted to use as a threshold that's what I've been using in all of the previous episodes as my minimum threshold that I rare find my data to. So I'm going to come back in here and edit this so I'm going to remove the summarize and instead put in a mutate and I'll go ahead and put in an ungroup to remove that grouping and so now what I've got is an extra column which is my n and so what I can then do is a filter for n greater than 1828 actually I want greater than or equal to 1828 because I want to keep that sample it has 1828 in it actually so now that I've removed those samples with fewer than 1828 sequences I can go ahead and pivot wider to get back to having a shared tibble file and I'll do pivot wider and we'll do names from equals and again this column by default is called name and then values from equals value and then we'll also do values fill equals zero and I'm also going to come back up here and then do a select minus n to remove that n column because I don't need it once I've filtered the data. Wonderful we now have our wide data frame that we can use to pull out information for our metadata as well as information that needs to go into vegan to calculate our ecological distances. I'm going to go ahead and call this early late tbl and we'll save that great and then I'll take early late tbl and then I'll do a select on group animal and day and so then this gets me three columns for my 227 samples and this will be the seed of my metadata data frame that I'll use for my later tests and analyses so I'll call this early late metadata very good and then we want to get early late distance matrix so we'll then take early late tbl I'll then do a select to remove the animal and the day so minus animal minus day we'll remove those columns and then I will do a column two row names on the group column so this will create a data frame rather than a tbl that I can then use as input to AVG dist to get my distances and then this group needs to be in quotes and so now I've got a data frame rather than a tbl and I can feed this into AVG dist from the vegan package this will do a rarefied calculation of the breakers distances if I give it a sample value of 1828 so it'll get 1828 sequences from every sample calculate the break distance matrix and then repeat that a hundred or so times and then calculate an average that it then returns to me great so that all worked I'm now going to call this early late dist now what I want to do is use this distance matrix to make an ordination I'll use nmds so I'll you do meta mds on early late dist also go ahead into a set seed we've seen in the past that the nmds procedure can be sensitive to the seed that we use so if use the wrong seed it might not converge so that converged using the seed of one I can use the scores function to get the position on the two nmds axes and this outputs a data frame that has row names so I'd rather this be a tbl that I can then easily feed into ggplot so now I can feed this into as tbl where I can then say row names equals group this then gives me a tbl as my output and we're in good shape I'll go ahead and call this early late nmds so now I have my nmds data for making the ordination as well as my metadata I'll go ahead and join those together I'll do an inner join with early late metadata and early late nmds and both of those have a group column so it'll automatically join by that group column I'll go ahead and call this metadata nmds and we can now use this as input to making some ordination plot so we'll go ahead and copy that down and we'll do some ggplot aes so on the x I'm going to do nmds one y nmds two and then my color I'll do period but I don't have a period column just yet so I want to come back up to my early late metadata and do a mutate to create a period column and so we'll then do period if else day less than 10 I want to say early otherwise I'll say late well I'm here I'm also going to make a column for sex because that's another variable I want to look at so then say sex if else str detect so if we detect in the animal column the letter f then we will then say that is a female otherwise we'll then say male so now let's double check that that all worked we'll go ahead and take early late metadata and count sex and so that looks right a pretty even number of samples from male and female mice and then we'll also do period and so here we get some weird skews so 24 earlys and 203 late and if I had to guess I think that is because early late metadata if we look at the day column it's of type character rather than type number and so if we come back up here to where we did that separate I can add an argument here convert equals true and that will convert it'll try to convert to a numerical if I've taken a string and then kind of split things apart and now if I go ahead and do that count on period I get an even representation of early and late and that's good all right now we can redo our join and we can now do our plot and we will make this a geome point and so now we see this cloud of points I think we've seen something like this in previous episodes the earlier points are more diffused more dispersed and the later ones are more compact they're clearly separated at least in this ordination and so that's pretty good we might want to add ellipses around these clouds so we'll go ahead and do stat ellipse and so that then gives those circles around those points and maybe what we'd also like to do is add a centroid so I'll take metadata mds and I'll do a group by period uh and I'll then do a summarize on um nmds1 equaling the mean of nmds1 and nmds2 equaling the mean of nmds2 and so this gives me coordinates for early and late and so I'll go ahead and call this centroid and then down here I'll go ahead and add another geome point where the data is centroid and so I can't really see what happened because I think it's adding a point that's the same size as everything else so I'll go ahead and do a size equals five so that gives me a big circle at the middle of those centroids I can see it fairly clearly here for the reddish one but not so well for the green because those are so jam packed so let me show you a trick so it's easier to see those points I'll go ahead and do shape equals 21 and shape equals 21 is in that range from I think 21 to 25 where you can set the fill color of a symbol to be different than the border color and so the border color I want to be black and that's that with the color aesthetic so I'll say color equals black and then I'll do aes and I'll say fill equals period and so that then gives me a centroid that is distinct and kind of pops up out of the points I'm not totally thrilled with having those changes happen over here in the legend as well so what I could do is add to stat ellipse show legend equals false as well as down in the second geom point so I'll do show legend equals false and so let me take that and put that in here as well so this isn't exactly a publication quality figure but you get the idea you could take it further and kind of clean it up a bit to make it look a bit more attractive so let's repeat this now but using sex as a variable so I'm going to go ahead and copy this down and instead of grouping by period I'll group by sex and then here instead of coloring by period I'll do sex and then I think I also have the fill here great so let's go ahead and rerun these and so now I can see this ordination colored and ellipse so to speak by sex and so whereas previously we had early to the right and late to the left now we have female to the right and male to the left and so a question we might ask is are the centroids in these different ordinations as I've colored them are they significantly different from each other you might also say well you know if we account for both the period and the sex are they significantly different from each other and so we can answer that type of question using a tool called adonis from the vegan package so I'm going to do the single factored approaches and then we'll build up to something more complicated as we go along here so we can do adonis and then we'll do early late dist tilde early late metadata and I'm going to do period and so what's important to note in this formula is that early late dist needs to be a distance matrix and so if I take that and do str on early late dist I can now see that it is in fact of type dist if it's not of type dist then adonis is going to try to calculate ecological distances of that data right so if you give it a matrix rather than a distance matrix it's going to try to calculate break distance using your distances that are in your matrix so that's not good so just be sure that this is in fact a distance matrix and if you're not sure you could always do something like as dot dist as a function around early late dist so let's go ahead and run this so what we see in this ANOVA table is that the effect of the period this top line of the ANOVA table has a p value of one in a thousand and so that is a significant p value because it's less than 0.05 we're getting this value because this is the most extreme value possible I could increase the precision on this by doing permutations equals say one e to the four so 10,000 so that takes 10 times longer to run and we see that we again get a very small p value in general a thousand is sufficient which is the default but if you need greater precision on your p value say you're right around 0.05 then increasing the number of permutations will give you better precision on that p value excellent I can also save this as a variable and so I'll go ahead and call this test and if I want to get the p value out of the test then I can do str on test and I get all the structural information about this list that is called test now right and so there's an object in there called aov tab which has a vector in it of my p values so I could always do something like test dollar sign aov tab and then in back ticks I could do pr parentheses greater than f close parentheses that gets me my vector I can then do one and that then gives me my p value so say I wanted to represent this in a publication so I can mark down document this would be kind of how I would get say like a p value right let's go ahead and take what we had here and let's now look at sex as a variable right I'm going to go ahead and remove that permutations it's actually not a thousand it's 999 that's why I'm getting kind of these funky numbers with the you know 0.000999 anyway let's go ahead now and replace period with sex and we see again that we get a very small p value and of course if we increased our number of permutations to 9999 we can get 10 to the minus 4 also remember I don't need this as.dist because early late dist is in fact already a distance matrix but like I said if you want to be sure go ahead and wrap that variable in as.dist so funny things don't happen so we have seen that there's a significant impact by sex as well as by period the next question then is what if we put these together well let's go ahead and do that so we can take that statement and I can copy this variable so early late metadata dollar sign sex and combine two together so we'll do period star early late metadata sex so that star is going to look for both the period and sex as main effects as well as their interaction term and so what we get is that period and sex are both significant as well as the interaction term I don't really know what a significant interaction term means in an experiment like this you know interaction terms even under normal situations are a bit funny to mess with but just just be aware that that's something you would probably want to investigate further if for some reason you don't want that interaction term to be tested for you can do the same syntax but instead of the multiplication sign you can use a plus and now you get basically the same ANOVA table but without that interaction so this data that we are looking at the samples aren't independent of each other right we have taken up to 20 samples from each of the different animals so I need to control for that repeated sampling of the same individual and what I have seen people do on various tutorials and help forums is to use the argument strata and so strata controls kind of the permutation and how samples are permuted when it does the test and so we can do strata equals and then early late metadata dollar sign animal and this makes sure that the permutation is done within each animal and so basically we get the same result we had and if I again repeat this but just look at the period without the sex as a variable we again get that that is still a significant result but we've of course controlled for the fact that we have repeated sampling of the same individual again this is a data set that's a little bit funny you know every set data set is a bit different this is how I would do it for this particular data set where I'm controlling for the repeated measures but if you have say a cohort of patients that you're looking at where you've got two different treatment groups then you probably wouldn't have to worry about strata but if you've got repeated measures you probably would but again this is certainly how I have used adonis in the past and when we published these data this was the model that we used now if you look at the literature for adonis they will warn you that if you have uneven variation in your data like we appear to have here then you can get things to be different even though they aren't really different we have done experiments in the past simulations where I don't see that effect but the authors who originally developed the method behind adonis pointed that out so I'm a bit conflicted but people ask about it frequently and so I'm going to show you how we would do that here in vegan to test for the variation in dispersion of the data for this particular data set I think it's actually pretty interesting to consider that as a variable because my eyes at least want to see that the earlier time points are less stable than the later time points and if you go back to some of the earlier videos where I look at some of these alternatives to ordination we certainly saw that those earlier time points had more day-to-day variation than the later time points okay so we're going to use a tool called beta disper now again this comes from vegan and we will give it our early late dist and then we'll give it the groupings that we want to give it so again we could give it early late metadata dollar sign period this gives us some output that's not super helpful I'm going to call this bd and so what we could do with bd is we can give it to anova bd and this then tests whether or not there is significant variation in beta dispersion by the early and late groups right and so here of course what we see is that it's highly significant right another way that you could do it is permutest on bd this gives us the same exact result but this was doing it based on a permutation based test so again this shows us that there is significant variation between the early and the late the early being much more variable than the late we could do the same type of thing uh investigating by sex where we again get significant p-values indicating that the two sexes have significant different variation in their variation and here again if we look at the average distance to the median the female is further away from the median than the male on average so if i come back to my plot for the female and the male we certainly see that the ellipse for the female is broader than it is for the male in terms of interpreting these results again there is that bit of caution in how we interpret the results of adonis if we have a significant amount of variation in the dispersion you know i could perhaps see that with the sex-based effect i'm less inclined to accept that for the period-based effect just because there is such a clear separation in the data i would be pretty willing to accept this one of the other things they say in the documentation for the adonis function is that of all the tests that do something like adonis adonis is the least susceptible to problems in this type of dispersion there aren't many alternatives out there and so i think you know this is again another good reminder that we need to not check out our brains when we run a test right so if it doesn't look like there's a difference and we see a difference by a by a statistical test and we have big differences in dispersion then you know maybe that's a good case to say well i'm not going to believe this maybe this is uh being sensitive to that difference in dispersion whereas if we have a case like this where things are really far apart and different in dispersion i'm willing to accept that as being you know statistically significantly different not a false positive so again don't check your brain out when you run a test always interpret in light of what you see in the data so you might ask how would we combine the two together the early late as well as the sex so i'll come back up to my uh early late metadata so i'm going to create another variable called period sex which i'll do paste zero on period and sex and again if i look at early late metadata i now get a column that's early female late female so forth right and then i can go ahead and redo my inner join and then i can repeat this beta to spur but instead of early late metadata sex i'm going to do period underscore sex and here i get a significant difference so one of the four groups again we have two sexes and two periods so one of the four is significantly different than the others and then if i run premium test again that gives me the same thing and so what i can say then is pair wise equals true and this will then give me the pair wise comparisons of those four groups so as we see the late male and the late female are just as stable as each other the variation doesn't significantly different but for those other comparisons that we see they certainly are significantly different this early male to early female is below a threshold of point zero five if we divide point zero five by six for the number of pair wise comparisons that we have here again two of the problems with ordination is that we tend to see patterns with our eyes that aren't necessarily there and so adonis and beta to spur are both great tests to help us to understand if the patterns we see are real and at the other hand when we do a ordination we remove a lot of the signal from the data as we go from many dimensions say down to two and so by having a statistical test we can bring that in and help us to see you know whether or not there are actual differences in the data as always i strongly encourage you to go back to the beginning of the video rewatch it running it with the data that i've given you also you'll get the most out of it if you can take what i've talked about today and use your data to go through these different steps all right well if you have any problems please leave a comment down below in the comment section we'd really appreciate it if you gave me a thumbs up and if you haven't subscribed yet go ahead it would be it just make me happy to see yet another subscriber and even better if you told your friends what we're doing here on code club