 Hi, my name's patch loss. I'm a hater Well, that might not be too surprising to many of you, but up until about four or five years ago I absolutely refused to use dplyr ggplot or other packages that are now part of what we call the tidyverse Why well because it worked for me, right? I was able to use the functions and functionality built into base are Really effectively and I thought it was really empowering to know how to do things using the built-in features of the language without having to add other packages, but As I've learned it's a hell of a lot easier to use things in the tidyverse But every now and then I do need to go back and use things from base are and so today I'm gonna show you how we can dip our toes back into base are to do some pretty cool things Including using other packages that just aren't part of the tidyverse. Hey folks, I'm patch loss and really I'm a lover Not a hater. Anyway For the past few episodes of code club We've been making an ordination trying to improve its appearance and trying to make it easier for our audience to read and understand well, as we've been going through that I've been kind of Implicitly or explicitly I guess saying that the three different treatment groups that I see in the visual are significantly different from each other Maybe I didn't say significantly, but I've been saying they're different from each other, right? And and perhaps you've been taking me at my word at that, right? Well, someone in the comments on a recent episode actually called me on that and said, you know You said you there's a way to test whether or not those three groups are significantly different from each other How would we go about doing that? Well, you're in the right place because today I'm gonna show you how we can do that I like to go back to the data back to that distance matrix and ask the question are the distances between samples from the same group Smaller than smaller or shorter closer to each other Then perhaps the distance between Sam between samples from different groups, right? So if we look at our ordination we see at the bottom We have samples from healthy individuals and up at the top We have samples from people with diarrhea and samples from people that with diarrhea and we're positive for C difficile So it looks like they're different, but are they and so again? we're gonna use that distance matrix approach of trying to understand whether the distances within a group are Smaller than the distances between samples from different groups to do that We can use a package in our called vegan that has a really useful function called Adonis now Adonis has many many many different names in the mother package. We have a program called a MOVA in Other programs. It's called the perm disk other places. It's called NP ANOVA or NP MOVA It's all the same function and it comes from a paper by McCartle and Anderson and It is wonderful because it allows us again to compare the distances within groups to the distances between groups So I'm gonna show you how we do this in Adonis and along the way I'm gonna show you different places where it pays to know a little bit about base our functionality how it works with Elements of the tidy verse and so what I would like to do is today We're gonna create a script that will run the Adonis analysis to tell us if either of the three groups Are different from each other and then that will then do a pairways analysis to compare the pairs of those three different groups To confirm that they're significantly different from each other and we will also then Correct for multiple comparisons to see that see whether or not our p-values are all less than point zero five When we correct for those multiple comparisons coming into our studio, I'm going to create a new R script We're gonna start with library Tidy verse because we will use some elements of the tidy verse Read Excel and we'll also do library vegan so if you don't have vegan installed go ahead and install it because That is important for today's analysis So coming back to Schubert, I'm going to go ahead and read copy this line for reading in the metadata I also want the distance matrix. So I will then do distance and Here it's important to know that the distance matrix isn't really well formed if I open up Schubert Dot break her to start dist we see it's got kind of a funky format One other thing I want to comment on is that I updated this file in the repository that you all originally installed for From so you want to have be sure you have a version 0.3 to get this to work This is what's called a square Matrix it is a philip formatted square matrix So the first value the first row is the number of samples So 338 samples and then all of these so I could make my life a little bit easier by deleting that 338 But I don't want to touch that. I want to leave my raw data raw That's a really fundamental part of reproducible research. So we'll keep that and I will do read TSV and we will then do raw data forward slash Schubert dot break her to dot dist and I don't want that first row. I don't care about the number of samples because read TSV can figure that out for me So I will do skip equals one. I'll also do call names Equals false because I don't have any call names in there, right? So if you look at this, there's no column names and if we run that I Think I forgot to I've got to actually run all this stuff Okay, so I think I think it might be call underscore names Yeah, and so we see that it reads it in the default specification is called double But x1 column x1 is call character. So let's look at what distance looks like now and We see we've got a whole bunch of columns, but we see x1 is the name of our sample IDs What I'd like to do now is fix my column IDs. How do we do this? Well? All of course to show you how so this is where a little bit of base our knowledge comes in handy So if I do call names distance and run that I get all of the column names as a vector from my distance matrix For my distance file and what I can do then is I can assign values to this to write over the column names It's pretty slick. So what I could do would be to do See and my first column is the group and Then I need the names of all these other samples. Well, how do I get that? So let me step back a step if I do distance again We get this distance matrix the column names for like x2 through x. I don't know What is it 339 or whatever? Are the values in column x1 and So if I I can get access to x1 by several different ways So one way that we've perhaps seen in the past would be to use select you can do select x1 This actually creates a data frame with a column called x1. That's not exactly what I want I want x1 as a vector, right? So to get a vector. I could instead do pull x1 so pull x1 Returns x1 as a vector The same thing could be done without the pipe so I could do pull distance X1 and that will get me the same result. Okay, so this syntax is Easy and it's convenient. It's intuitive, but it's a little bit verbose. How else can we get the same information? Well, I can do x1 and then square brace and then inside that I'm sorry not x1 distance square brace x1 and So this square brace notation is basically the same thing as select that this then pulls out a column from the disk from the matrix Column x1 again. That's not what I want. I want a vector. So if I want the vector, I Use the double bracket notation and that skits me that I can do distance Dollar sign x1 and that dollar sign is the same thing as what we see up here with the double brackets Which you use really depends on the context and what you're comfortable doing So what we can do here then is we could set call names distance to do group and then we could say distance dollar sign x1 and if I run Just that part of the expression I See that I get all the names of the samples and that my first element is group So now if I run call names distance and then I look at distance I See that I've got all these column names now that have the The same thing as the sample name and the first column is my group. So we're in good shape. So I'm gonna Leave that and that we now have Column names for all of our samples. So the next thing I want to do that I've already shown you although I haven't talked about it explicitly is an inner join and so I will do inner join on Distance and metadata And then by equals and this is the column that I'm joining on I don't know why but I could have let me rename this sample ID I don't know why I didn't do that initially because if I use sample ID Then that's the same column in metadata and distance and it makes it easier to join the two data frames So I'll do sample ID So I have to rerun everything and then I can do here sample ID and it joins at the data frame and at the very end are all of my Metadata variables that I might be interested in using So that's good So I can see those. Why don't I go ahead and switch the order here? so metadata distance by sample ID and so now I see my metadata as well as My sample IDs. So we're in good shape. I will go ahead and call this meta distance So to run Adonis and vegan I need to have my distance matrix and I also need to have a column For the different variables, but those need to be in separate data frames So something I should point out is that Adonis? Allows you to give it your feature table. So what we call in mother a shared table I find that that's it's a little bit slow It is very slow running that through Adonis and mother has some features such as being able to rarify the data So all your samples have the same number of reads That is now built into vegan, but it's quite slow So I did that distance matrix calculation in mother and we're bringing that in to Adonis but again, I have a distant I need a distance matrix and I also need a column or a Data frame that has explanatory variables. So like my my my my patient health status, okay? So to get the distance matrix, I'll call this all dist and I will pull this out of meta distance and I want to select the columns that correspond to the rows that I have to the samples I have and So I'll do a select and I can then say all of and I then want the values that are in the group column. So I'm inclined to do group but this isn't going to work because group not found and What I need to do instead is I'm giving this a vector of group names. I could do like, you know, meta distance Dollar sign group Oh, I got to spell it right though And that should be sample ID, right? So that would work and that would give me all dist and That then gives me a data frame of 338 by 338. I can see it's a symmetric data frame That's nice But to kind of motivate things that we're going to do here in a moment I'm going to revert back to that square notation and Perhaps you've seen this before but you can use a period to indicate the data that's flowing through the pipeline So I want to take the data that's flowing through the pipeline And I'm going to put square brackets around the name of the column I want and this then will again give me the same thing that I had before But as we go down and we think about sub-setting to do pairwise comparisons this structure will be Much easier to work with so I can then do as dot dist and that will turn all dist into a Distance matrix, which is what the input needs to be going into Adonis now We want to run Adonis, so we'll do Adonis and then we'll do all dist Tilda and so this is a Formula notation so all dist explained by some other variable and so I Can never keep these variable names straight and so if I look at the different columns here the one that I want to use Is the disease stat right so this one here and So I can do all dist explained by disease underscore stat comma and then I'll say take disease stat from Whatever I give the data argument so I can then say meta distance and I can then run it and This is a permutation based test. It does 999 permutations and It tells this gives me my ANOVA table And I see that I get a p-value of one and a thousand. I only did a thousand iterations So I could update this to do permutations equals Say ten thousand and this will take ten times longer to run But will give me greater precision on my p-value So that took a few moments to run but we can see that we get an even smaller p-value because we're again testing at a higher number of permutations Ultimately, you know that the size of your p-value is not The relative size of your p-value is not significant or meaningful. I shouldn't use significant so freely I'm interested in whether or not it's less than point zero five. It's either less than point zero five or it isn't And that's the model we're using So what I'll do as we're testing here is I'm going to create a variable up here that I'll call permutations and for now, I'll use a thousand and I can then Set this here to be permutations Because then when I run it for actually doing the analysis I could jack it up to ten thousand or a hundred thousand to get get greater precision But as I'm testing I don't really want to mess around with that so much You know, you might think about, you know, how can we extract that p-value? Well, I can I can save this as all test and Let's go ahead and put these arguments on separate lines So it doesn't scroll off the side of the screen and so again all test is going to output this ANOVA table But I want say this value here, right? So how do I get that? Well a useful tool for parsing things like this is called STR as you can do STR all test and This then shows you the structure, which is what STR is short for of the output and so this ANOVA table is a nice output of All this data and this data is Formatted as a list and so a list is kind of like a vector I guess a vector is probably a list but like a vector where each cell of the vector is perhaps a different type of data and So a vector really is a list where everything is of the same type Anyway, if I want AOV tab, I see that convenient dollar sign, right? So I could perhaps do all test dollar sign AOV tab dollar sign PR greater than F, right? So let's give that a shot and see if that works. So if we do all test dollar sign AOV tab dollar sign PR greater than F and you'll see it put it's in those nice back ticks for me if I run that and now get out a vector of P values and so again if I look back up here That's these three values, right and I want seat one so I can do square bracket one And that gets me then My p-value right so another way that we could write that we've already seen this with the double bracket versus dollar sign notation would be all test and then square brackets AOV tab Just need one period and then again in the double square brackets PR greater than F Run that we get our vector again, and then we can do dot square bracket one to get the first seat So these two get you the same result I think I'm going to use the double bracket notation Because it's a little bit easier to read I think I don't know and so we'll do all P as that and so now we have all P Stored as a variable before I move on. I want to show you a couple other things that we can do with Adonis so this is a fairly simple experimental design and how we're running Adonis we've got a Y variable our distance matrix We've got an X variable the explanatory variable which are the three different disease stats Statuses so we have healthy. We have diarrhea and diarrhea plus C diff. Well, if I wanted to say add gender I Could do star gender run that and that will then include gender as a covariate and we then see That we have that Gender itself is not significant. It's p values like point two six And that the interaction term actually is point zero four again This is based on a permutation based test We should be setting a random number generator and so maybe up here I will go ahead and do set dot seed and I'll put in my my birth date 1976 06 20 and that way then whenever we run this script will get the same output every time so that's one way We could put in gender we could put in race And so race Does not is not significant. There's not a significant interaction. You know, you could do race and gender You can you can do all sorts of things with the model as you define it here Great are p value comparing all three treatment groups is significant All that means is that one of the three groups at least is significant from the others that raises the question then of well Which pairs are different from each other if this p value had been greater than point zero five We'd be done and we'd say there's no difference between the three groups But because our p value is significant We're now able to take the next step and to say what pairs of samples are significantly different from each other Now that's where this starts getting a little bit hairy I'm in how we're gonna subset from all tests to get the different comparisons before I do that I want to make sure that I know what those different treatment groups are called and so I will then say meta distance And I'll do count on disease stat And so I know I have case diarreal control and non-diarreal control and and So now we want to subset for those three different comparisons between case diarreal control case non diarreal control and between diarreal control Non-diarreal control. So I'm going to copy this code Because I'm gonna build my sub setting off of that. So I'm gonna call this case versus diarreal control and We'll build this out and then we'll replicate it for the other two. So meta distance I then want to do filter disease stat equals equals case or Or disease stat equals equals diarreal Control so I could have said disease stat not equal to non diarreal control But I like being explicit in what I want rather than what I don't want And this then will give me if I run these two lines This will give me a hundred and eighty-three rows for again My case and diarreal control But I need to then make sure that the columns I'm pulling here Correspond to the rows that I have and so that way then I now have this select all of sample ID and again, this is why I did that up ahead and Again, we look at this and we see that we've got 183 rows in our distance matrix. So I'm gonna call this case diarrea And So I realize that I need a metadata file and a distance matrix file so I'm going to do case diarrea there and I will do a case diarrea Dist here Or I give it case diarrea Right and so that way when I call adonis Let me go ahead and load these that way when I run adonis. I can do adonis and I can do case diarrea dist explained by disease stat data equals case diarrea and I'll do permutations equals permutations Let's give us a run and so here we see that The case samples the seed of positive and the diarreal samples are significantly different from each other at a p-value of about point zero zero two Excellent, so I will call this case diarrea test Right and again, let me go ahead and put these on separate lines So you can see them on the screen and if I want to get case diarrea P Then I could do a case diarrea test And again coming back up to here. I can copy this down add that and Let me go ahead and run this we see what the p-value is So I got to run this Right when I ran it before I didn't assign it to the test variable And so if I look at case diarrea p, I get that p-value that we saw previously Now I don't want to save this as an individual p-value I want to actually create a vector that has the three p-values for the three pairwise tests So I will go ahead and call this pairwise P and I need to initialize it as a numeric vector again This is a little bit of base our knowledge and that way then what I can do is call this pairwise P and I will Put that in square Put that in a quote and I'll get rid of that. I have an underscore p So now if I look at pairwise p It's a vector that's got names for each of the seats So what I need to do now is to replicate this for the other two comparisons So I will copy this down This is if you're familiar with my other work in talking about reproducibility This is not what we would call dry. Don't repeat yourself. I could create a function, but Again, that's not what I want to emphasize in today's episode. So I will do now case and non-diarrheal control And I can copy that there and I will call this case healthy Because non-diarrheal controls Are the healthy people that don't have diarrhea and I can then do case health here case health there This is gonna be health case health test case health dist case health That's good and This again should be case health Test again This is why it pays to create a function to do all this because you don't have to worry about finding all the variables So if I run this And now I do pairwise p I now see I've got both p-values and the case health is Less than one in a thousand good now. I want to do the third case of my diarrheal controls versus my cases So I'll go do diarrhea And I'll kind of copy this through and so again, this is Diarrheal control And so wherever I see case now, I'm gonna put in diarrhea health Really hope all this talk of diarrhea doesn't scare you all off Maybe I should have thought of that before I picked this data set Anyway, if the diarrhea bugs you let me know down below in the notes Maybe I'll find a different data set if that annoys people But hey, this is science. This is this is real life, right? Okay, so now if we look at pairwise p, we've got our three different p-values, right? And they're all less than point zero five They're actually all about, you know, one in a thousand or two in a thousand The question then is because we're doing multiple tests with his multiple comparisons We're increasing the probability that we would falsely reject the null hypothesis So we need to correct our p-values for these repeated comparisons There's a function in base R called p dot adjust and we can give it pairwise p Running that we then see our corrected p-values The default method for p adjust is the Holm method, which I don't really know the method I like to use is bh for Benjamani-Hatchberg And so again using that method You get That correction Benjamani-Hatchberg is a little less conservative than the default methods and a lot less conservative than like a Bonferroni You can look at the documentation for p adjust to see how you might go about doing that Now I want to know are these values all less than point zero five So I could do less than zero point zero five Run that they're all true Okay So what I'd like to do is I want this script to get called by my Schubert and MDS dot our file And I want this to run and if they're not all significant Then I want like the analysis to stop and I want to get an alert to that so what I can do is I can say all Around that so what all does is it asks are all of the values of the values given to all are they all true, right? So in this case, yes, they are all true, right so true, right and what I can then do is Stop if not So stop if not so if if the value inside of stop if not is false It will stop if it's true. It will keep going and so you see nothing, right? So let me illustrate this if I do pairwise P and let me create something garbage And I'll say that's equal to point five, right and so now if I look at pairwise P I've got garbage and I've got at least one value. That's not significant. So now if I do P adjust Right, I still have one value. That's not significant if I look at all They are not all true. So that's false and so then stop if not will give an error, right? So error this is not true Okay, and so that would then stop the analysis We don't have that garbage variable But this again is a test that we can use Before proceeding with generating the figure to make sure that what we've gotten our title in other places in our analysis Are actually true Alright, so I will go ahead and save this. I'll call this Schubert NMDS Not NMDS Schubert Adonis So now I can call this from my other file. I can then do source Schubert Adonis dot r And then if I run source on that it will then run what's in that In that script and everything went through just honky-dory like everything was fine And again, if one of those p-values was false, it would stop before it goes on to make the rest of the figure So that's pretty slick, right? If you recall this code in the Schubert NMDS dot r script generates this figure And so maybe what I could do is add a caption to the bottom That indicates that all values were less than point zero five I don't know what this is going to do to our design aesthetic, but it's okay So I can then say caption equals all pairwise comparisons Were significant Using Adonis Benjamin Hatchberg correction for multiple comparisons And I think I had a typo somewhere. Yeah significant Okay Good and if we go ahead and run all that we then see That it kind of runs off the side of the screen That's okay. We can of course go ahead and We can go ahead and put in a line break here and it's it's right justifying it and so maybe down here I'll go ahead and do plot that caption element text H just equals zero we saw that last time right and so we see that we've got left justified and so maybe I'll put the line break in after the Using so again, it doesn't totally fit with our aesthetic and how things look But again, this gives you the idea of what's going on and I believe we can do plot caption Position equals plot and that will justify it to the left side of the plot So I hope you found this tutorial useful in talking about how sometimes we do need to dip our toes back into base R But there are certainly analogs with select and pull to the single bracket and double bracket notation We do appreciate select and all those helper functions that come with it But sometimes when we're interfacing with things like Adonis that are outside of the tidy verse It pays to know a little bit of base R and some of the functions that are there There's other things you can do in vegan in terms of ordinations and things like that There's another task that would be useful called that would allow you to look at the dispersion or whether or not The variation in the data is the same across your three different treatment groups But I'll leave that for you to play with and leave that as an exercise for you I believe you don't have to change much in those Adonis calls to get that and I believe the function is called perm dist If you want to check that out. Anyway, I hope you like this If you've got questions or if there's other things you would like to learn about or see by all means Please let me know I'm grateful for the commenter who raised this point and I'm happy to to give you all the solutions So anyway, I keep practicing with this and playing around Please tell your friends about these episodes and be sure to hit the like button It gives me all sorts of warm fuzzies and puts me in a far better mood than I normally am You don't want me to be a hater. We want me to be a lover Anyway, keep practicing tell your friends about Code Club and we'll see you next time for another episode