 Hey folks, if you've been following along in recent episodes, you know that we've been building out some R code to read in a lower triangular distance matrix. That distance matrix indicates the pair wise dissimilarity between pairs of fecal samples that I collected from different sets of mice. Well, in the most recent episodes, we talked about using regular expressions to take that sample name and extract different bits of information. We were able to extract an identifier for each animal, the sex for the animal, and the number of days post-weening for each of the samples that we collected. Well, I was looking back through my distance matrix and realized that I had some extra samples in there that I don't really want to include in my downstream analysis. And so what I'd like to do is see how we can filter out the rows and columns of that distance matrix that correspond to samples that I'm just not that interested in. So I thought that'd be a great opportunity today to talk to you about a few new concepts that I don't know that we've covered very well in previous episodes. The main concept that I want to talk to you about is how we can reshape a matrix or reshape a data frame to be longer or wider using pivot longer and pivot wider respectively. In addition, we will see inner join, which we've seen a number of episodes back where we'll bring back in that data about each of the samples for the samples names in the rows and the column names that we can then use to filter out the samples we don't want before again pivoting it wider so we can make a square distance matrix and convert it into a distance matrix that we can then use in downstream analyses like ordination or any other type of ecological analysis we might want to do with a set of distances. So today we're going to be working within this analysis dot r script. If you'd like to get your own copy of this script as well as all the other scripts in the project along with the data go down below in the description there's a link to a blog post that will get you the links and everything you need. You can use that with the link up here in the upper right corner that will show you how you can get a copy of the project so you can follow along and I always encourage people to follow along because that is the best way to learn what we're doing it's I'm just not entertaining enough to watch it without you typing along with me so anyway we have our r script here I'll go ahead and run this and load all the code so this will generate disk tbl which is a tibble form or data frame form of my distance matrix 348 rows by 349 columns it's got that extra column because it has that samples column and again if we do sample lookup we will then get a data frame that has the sample name along with the sex the animal id and the number of days post weaning that the sample was collected from so let's get started with this tibble and again this is that 348 by 349 tibble it's a special data frame in r that lives within the tidyverse and what I'd like to do is convert this into a three column data frame where I have one column that has the samples names from the row another column that has the column names so we'll take those column names and make a special column with just those names and then a third column that has those distance values so I'll do that with pivot longer and the first argument I will use is calls and so calls is an indication or a listing of the columns that I want to pivot longer to make a new column for my purpose is that new column will be the column names from this data frame and so I want all of the columns except for my samples column and so instead of typing out 348 column names what I can do instead is minus samples and so that's going to say take all the columns except for samples and pivot that longer so the names of these columns will go to a new column using the names to argument and here I will say B just any other just any name will work and then values to will be distances right and so now what we see is we do have that three column data frame right we've got samples B and distances now to complete the loop let's go ahead and do pivot wider so pivot wider we'll take our three columns and put it back to that 348 by 349 tibble and so here we can now say names from and I'll say from B right so that will be the column names are in this B column that we have here and then our values will come from the distances column right and so now we see we went right back to a 348 by 349 data frame and we're in good shape one thing we might think about doing would be to do a filter on our column names after this first step right so again if we remind ourselves what we had here for samples B distances in the three column version you know I might want to get I only want one of the pairs of distances right so I have both f3d0 against f3d1 as well as f3d1 against f3d0 only in one of those two and I also don't necessarily need the f3d0 f3d0 right so what I might do is samples less than B so now I see I no longer have that f3d0 f3d0 and I do have f3d0 f3d1 but if I went further down I'd see I don't have f3d1 f3d0 now you might be saying how can samples be less than B samples is a character and B is a character well it makes the comparison on an alphanumeric basis right so f3d0 is less than f3e0 right or f3d0 is certainly less than f3d1 right and so we could then say let's see what happens when we pipe this into pivot wider now we get a data frame with a whole bunch of na values and so why do we get those na values well that's because we got rid of all these values right those are all the values where samples was greater than B but we also had a value where samples was less than B right and so sure enough here we have that f3d1 we don't have f3d1 f3d1 because those were equal to each other not one less than the other right so we get that na value we could also you know perhaps get the lower triangle by flipping the sign of that comparison of samples being greater than B and so now we see that we have a lower triangle matrix of course we don't have the diagonal showing f3d0 against f3d0 but again this is just for illustration again this is mainly to illustrate how pivot longer and pivot wider work and what pivot wider does when it's missing some of the information that was in the original distival right again it's plugging in these na values to make the matrix to make the data frame rectangular so i'm going to go ahead and interrupt this pipeline because i'm going to add some more lines of code here that will allow us to remove distances from samples that i'm just not that interested in right so if we go ahead and look at distival into pivot longer we again get a three column data frame so what i'd like to do now is to pipe this into a inner join right and so i want to inner join using the data coming through the pipeline with sample lookup right so we'll do sample lookup and to remind ourselves sample lookup has four columns the first column is samples so what i'd like to do is take sample lookup and join it with the three column version using the samples column right so we can then say buy equals samples and now what we see is we have a six column data frame where we have the sex animal and day for the animal indicated in the samples column right so the samples column at least for the first 10 samples or first 10 distances there's all f3d0 and we can see that that's what we have here is f3d0 right cool well the next thing we want to do is we want to bring in all this data for the sample in the b column so how do we do that well we do another inner join so we'll do inner join again the stuff coming through the pipeline and we're going to join it right back to the same sample lookup data frame and here we'll do something special with the buy argument we'll use that c function to indicate that the data frame on the left we want to use the b column and we're going to set that equal to the samples column from sample lookup right and so we can use this buy argument you might remember to join two data frames where you don't have a column in common between them right and so i'm using b with samples because i want to take the b column and bring in the metadata from the sample lookup but joining it on that samples column so it's kind of cool that we're joining the same data frame twice to our data but doing it on different columns from the uh you know the disk table data frame we now see something cool we've got samples we've got b we've got our distances but now that sex animal and day that we originally had now has sex period x animal period x day period x right and then the new data coming in has period y concatenated on as the suffix and so interjoin when it sees two columns coming together in a new data frame will add this suffix and you can set the suffix however you want it to be the default again is x and y so this is the data frame that i can now do a filter on right so let's take sample lookup and i want to count uh the different days and so it tells me there's 35 total days in this data frame and i actually want to see all 35 to know what they are and so i'm going to go ahead and repeat that but i'm going to pipe it to print with n equals inf and again n equals inf will print all the lines all the rows from that data frame and so here now we can see all of the days as well as the number of samples we have from each and so what i want to focus on is that we have samples from days 0 through 9 so the first 10 days post weaning as well as days 141 to 150 post weaning and so i want samples from these two time periods right and so we could write a filter statement right and so we could pop this up so we can remind ourselves what the column names are so it's day x and day y so i could say day dot x less than equal to nine and day dot y less than or equal to nine right um or day dot y uh greater than equal to 141 and uh day dot x greater than or equal to 141 and then i could like add in more of the logic to make sure that i have stuff less than day 150 but that's debugging that and everything is just going to become a headache and this is already getting long so what i'm going to do instead is i'm going to define a vector and then i'm going to use that within my filter statement and so again this will allow us to flex some of those base r muscles uh for defining a vector and so if we think about the days we want i want days 0 through 9 right and so 0 colon 9 will create that vector um and if i do 141 colon 150 that will give me a vector of 10 values from 141 to 150 and i can actually concatenate those two together with the c function right so now i have a vector of values from 0 through 9 and 141 to 150 i'll call this days wanted right and so now i have days wanted which has those 20 values right so i can replace all this logic uh with day dot x and then i can use a special function which is percent in percent days wanted and so this will return all the rows where day dot x is in days wanted and i can then also say and day dot y in days wanted right and so now running that i can see that i have day uh day zero day one day 141 and so forth right and actually let's go ahead and run this into count on day dot y and we see that sure enough we only have those days and a bunch of distances for days 0 through 9 and 141 to 150 and we could do the same thing with day dot x where again we see uh the same days being represented right and so now we know that our data frame only has rows corresponding to the days we want from days wanted from those 20 days right so now i can go ahead and do a select because i only need three columns i only need samples b and distances i don't need all those dot x and dot y columns because those were really only brought in to help me to filter my data frame so i'll do samples b distances again we go down to our three column data frame and then we can then pipe this into pivot wider right and so now we have a 227 by 228 data frame we see that we do have that diagonal of zeros that our data frame is symmetric right so f3d1.9392 it's the same as f3d0 f3d1.392 right so it's a symmetric data frame now i need to convert this tibble into a distance matrix a distance matrix is a special type of a matrix and in both of them all of the values in the object are of the same type if you look at my tibble the samples column is of type character now i need to go ahead and get rid of that samples column and i can then convert it to a distance matrix and it will use the column names as the row names because it knows again that it's a distance matrix so i will go ahead and pipe this into select and i'll do minus samples to get rid of that samples column right so now everything in here is of the same type type double i can then pipe that into az.dist and now i see that this been piped out as a distance matrix right and so we can now see that we got the correct row names and the correct column names and i will then call this dist underscore matrix right and just to prove it to ourselves we could also do str dist matrix and we see that it's of type distance and that it's numerical and we've got the labels and everything looks good now that we have our distance matrix we're all set to use that as input to a variety of other functions that will allow us to make visualizations and analyses that are frequently used in microbiome and ecological analyses in the next video we will take this distance matrix and make a principal coordinates analysis ordination and in the subsequent episode we'll use the same thing to make an nmds non-metric dimensional scaling ordination so that you don't miss any of those episodes please make sure that you've subscribed to the channel you've clicked that bell notification icon and please please please tell your friends about what we're doing here it's been great to see all the positive feedback from these recent episodes all right take care and we'll see you next time for another episode of code club