 Hey folks, if you've been following along you know that we have been looking at different ways of analyzing a distance matrix describing the similarity of microbial communities collected from mice in a study my lab did a number of years ago. The kind of default strategy that most people take to analyzing these types of data is to build an ordination. I find ordinations for the most part to be totally underwhelming. If you have a large number of samples and by a large number of samples I mean like you know more than 10 you just get kind of like a blob of points right on top of each other and it doesn't really tell you that much. Well the data set we are working with is kind of unique in that it's a time course. We collected fecal samples from these mice over the course of about six months. We kind of had two periods of really intensive sequencing day zero through nine post weaning in days 141 to 150 post weaning. One of our observations was that the community early after weaning was very unstable. Later, as the mice got older, the community stabilized. We can kind of see that in the ordination that we've generated before where we start with kind of this big fat cloud of points and then it shrinks down right so that's one way to kind of get at this idea of stabilization. I'd like to kind of take another approach with you today by going back to that distance matrix and thinking about how we can parse that apart and what I want to do is generate this figure that I'm showing you here now. On the x-axis we have the number of days different between the samples we're comparing so if you have days one and two those are one day apart two and three one day apart right one and three those are two days apart and so forth right and so within the early range we can get between days zero and nine so that's nine days apart and then for the late we can get 141 to 150 and those are nine days apart as well right and so we can kind of take all these differences in time for each mouse and we can then calculate the median or the average dissimilarity for any interval between samples that we collected. We can then compare those for all mice early and all mice late so we can then see is the community more stable early or late by this type of approach. You might expect that with kind of you know a pretty stable community that the difference between two time points that are a day or two apart should be really no different than you know two samples that are maybe eight or nine days apart. Is that the case? Well we'll see and that's what the plot we're going to generate today will tell us. I have a script started already for us that you can get down below in the description by going to a blog post that will help you get the data and the code you need from GitHub. I also have a video here that'll help you out. Anyway this figure is going to be generated in a file called timelag plot.r. I've got about 22 lines of code here that will output a distance matrix comparing all of the time points for these two periods in the mice's lives for about a dozen or so different mice. So I'll go ahead and run this and we'll get going. This generates a distance matrix that we have called mice dist. So I'm going to go ahead and take that matrix nice dist and we'll pipe it into as dot matrix to convert it from a distance matrix to a regular matrix and then into a tibble with as tibble and we'll take the row names and assign that to a column that we'll call sample. And so sure enough we now have this data frame with sample and a square distance matrix. We now need to get this into a tidy format where we'll have a column for the sample and the column for the names across those columns. To do that we can do pivot longer and we'll do it without the sample column. So that gets us down to three columns. We have the sample, the name and the value. Sample and name are really the same thing but they reflect what was on the rows versus what was on the columns. I would like to filter this because we currently have a square matrix. You can see we also have the self comparison of the sample compared to itself. So if I go ahead and do filter sample less than name then that will give me one of the two triangles and you'll see I no longer have that self comparison either. So from the sample and name columns I want to extract the animal identifier as well as the day post weaning and we'll do that with a mutate and I'll do animal a and we'll do str replace on sample and we'll take everything from the d and afterwards and remove it so we'll do d star and replace that with nothing and so now we see we've got f3 for animal a and I can copy that down to do the same type of thing for animal b and that's going to come from the name column and then I also want day a which will be str replace on sample and we'll do everything up to the d and replace that with nothing but what we notice is that day a is now of type character so I want to wrap that in as dot numeric and we'll also want to copy this to do day b and that will be name and so now we can see we've got our columns for animal a animal b day a and day b and now I want to filter this down because I'm not really interested in comparing across 141 days like we have here I want to look within each of the two periods so now I'm going to do a filter because I want to filter down this data frame I'm really only interested in comparing within a mouse I'm not really interested in comparing female 3 to male 3 I want to compare female 3 to female 3 and I only want to compare within early or late so what we can do is we can say animal a equals equals animal b and we can now see we have 2,000 rows rather than what we had up here of 25,000 rows so now we have the self comparisons now we want to look at this difference and so I'm going to come back up here and I'm going to add diff equals a day a minus day b but we might have day a being bigger than day b or day b being bigger than day a so I want to go ahead and put the absolute value in here and then I want to add to the filter diff less than 10 and now we have 971 rows again we have comparison of the same animal and we can then see the diff column so here we have day one to day two that's a difference of one day one to day three difference of two and so we have the same thing here of like day zero to two difference of two right so now what we can do is we can start thinking about aggregating by that difference column okay so let's go ahead and think about how we want to group the data we want to group by the difference so that we can plot on the x axis that difference we want to draw a line for each animal right so I want to group by the difference by the animal I'll say a may as well pick one and then we also want to do early and late I need to come up with an early or late variable so I'll go ahead and add an early variable which will then be day a less than 10 again up at this point in the pipeline we're going to have day b's that will be larger than 10 when day a is less than 10 perhaps right but if we'd set early to be illogical of day a less than 10 then when we get down here we won't have that problem of a day a and day b being early or late you're right they won't have both values so anyway we should be in good shape doing it with this approach we can then group by diff animal a and early and now I want to do a summarize and I'll do median equals median on value right this column up and I meant group by and so now what I have is the difference the animal a early and the median and so I think we are ready to plot this I'm going to go ahead and ungroup this right and then we can feed this into a gg plot and we'll have a variety of aesthetics we want to include here so on the x axis we want diff on the y axis we want the median we want to color by early so we'll have lines for one color for those early periods lines for another period from late and then we'll do group by animal so I want one line for each animal and then we'll do geome line and I put an animal I meant animal a and look we get a beautiful plot so we're getting this sawtooth shape because we're grouping by animal a and what what's happening is that it's drawing lines for each animal right and so for at each time point we have an early and a late and so basically it's connecting the early and late for each animal and then going to the next time point early and late for that animal and so forth so what I'd like to do is group together so to speak animal a and early and to do that we could do paste zero animal a comma early and that's going to create kind of a dummy variable for us to group our data by both the animal as well as whether it's early or late now what we get is one line per animal per time period right so down here are all of the late lines for each of the mice and up here are each of the lines for the early period for each of the mice each line here represents an individual mouse and so we can see sure enough the the late ones are down at the bottom and yeah the difference between any two days that are next to each other so day one day two day two day three that kind of thing that difference in the community structure is the same as between day zero and day nine and so that's that's pretty fascinating right whereas with the early time points as that interval gets larger the distance between those time points the community structures also gets larger fascinating great what I'd like to do now is I want to go ahead and add a smooth line through these and so I can do that with geome smooth and here I'm going to add in an AES and I am going to say group by early right so if I left it with animal a and early it would draw a smooth line through each of my lines right which is not what I want I want a smooth line through the early and the late cloud of lines so sure enough we now see that line fitted through our plot we've got the standard error cloud here we also have a line for the fit that's a little bit thicker than those other lines but I'd like to make it pop a little bit more so what I'm going to do is add se equals false and I'll do size equals two I'm going to make it a bigger still so I'll go up to four yeah so that's nice and thick maybe we can take down the thickness of those individual lines by doing something like size equals 0.25 yeah so I think that then makes it more clear that that is the fit line of course you could play around with the width of these different lines to suit your own personal style so it's been a few episodes since we took a figure and tried to make it look presentable for say a publication and that's what I want to do in the rest of today's episode so as I always do I'm going to start with theme classic to clean that up and give a nice clean background we can also think about adding in labs and so x I will say is days between time points and then y as the median ray curve this distance let's also clean up the breaks on our x axis we don't want to be talking about two and a half five and seven and a half days I want to break that down for one through nine and so we'll do scale x continuous and then I'll do breaks equals one to nine and so that gives nice pretty breaks along the x axis which I'm happy with and then let's get some better colors than these red and green and so for that we'll do scale color manual I'll do name equals null to get rid of the title of that legend says early there right now we'll do breaks equals true and false and we'll then do values of blue and red and then for labels we'll do early and late so this is looking pretty attractive you know you could play around with the colors and use whatever color scheme you want but I'm pretty happy with these one thing I don't really like about the legend is that the line in the legend is the same thickness as the smooth fit through my data I'd rather it be a little bit thinner perhaps not as thin as the individual mouse lines but you know maybe somewhere in between so we can fix that using the guides function so we'll do guides and the guide we want to change and again guides you can think of as being like a scale or a legend is the color one and so I'm going to say color equals guide legend and so we're going to change the guide legend for the color and then I'm going to say override aes and we're going to give that a list object so we'll create it with list and then we'll say size equals let's say one and so that gives us a thinner line know that you could add other attributes in here right so if I wanted to do like line type dotted and so I get a dotted thinner line in my legend that's not what I want so I'm going to go ahead and remove that but that's a nice feature so that you can customize the shape and the appearance and the colors and all those things of your legend separate from what you have in your actual figure so I'm reasonably happy with how this figure looks I would be willing to submit this with a publication I think and now what I want to do is export the figure from our studio so that I could include it with my manuscript so what I could do is click on export and then save as image I don't prefer to do this approach I prefer to output it using gg save and so what I'll do is gg save and I will name this as time interval dot png I'll do width equals five height equals let's do three and so you're going to want to check with the journal or the size of your slides or wherever you're going to be putting this figure you're going to want to check on the dimensions that you need to suit for the publisher or for your slide deck or wherever you're going when you set these dimensions that then gives me this figure which I think looks pretty good I'm pretty happy with it we'd probably want to put a caption down below to help illustrate what's going on and otherwise I'm pretty happy with the way this looks again I think this is so much better than an ordination which is a scatter plot with a whole bunch of points right on top of each other right with all this data we would have like 226 points right on top of each other and it's really hard to see anything at a fine scale level it's certainly hard to see you know this this trend of increasing distance as you increase the number of days between the time points for the early samples whereas with the late time points they're much more consistent right day to day and so I think this is a pretty cool way of interpreting the data again that is much better than doing an ordination if you have time course data like I do here I would strongly encourage you to think about these types of plots in the next episode I'll show you another way that we can look at these time series data without consulting an ordination so that you don't miss that episode please subscribe to the channel so that youtube is sure to notify you when I release that episode keep practicing and we'll see you next time for another episode of code club