 Hey folks, if you've been following along in recent episodes of Code Club, you know that I have been trying to look for alternative ways of representing distance data than throwing it all in the garbage can that's an ordination. Sorry, I didn't mean to call it a garbage can, but it kind of is, doesn't it? Where we just throw all the data into one figure and we tell our audience, look, don't you see the pattern I see? And we're like, hmm, not really. So by looking at alternatives to ordination, we can hopefully help our audience to better see what we see in the data and communicate what we want them to see. So one of the figures that really inspires me to think differently about ordinations from this paper published in Nature a number of years ago by Peter Turnbaugh out of Jeff Gordon's lab when he was a graduate student there, I believed. So on the x-axis there's different combinations of individuals. They were looking at lean and obese twins as well as their mother and calculating different ecological distances between combinations of these individuals. And so they used a unifract distance on the y-axis. And then again, like I said, these different combinations on the x-axis rather than throwing all the data into an ordination and saying, good luck. Hope you can find a difference there. There's a couple of things I don't like about this figure, which I'll briefly mention here. So first of all, the y-axis is really zoomed in between 0.66 and 0.82. That's the difference of like what like 0.16. And so the differences that appear pretty big are actually quite small. So you might have a distance that's statistically different, right, like between twin mother and twin twin. But you know, biologically, is that distance really all that meaningful a distance of like 0.02? I don't think so. The other thing is that they are showing the data as bar plots rather than the individual points. And then they are plotting the standard error. And so what that does is that really hides the amount of variation in the data. What I would prefer to do is to show the individual points as a jitter plot, and perhaps map over that the median and the intra-cortile range. So here in our studio, I have a starter script called jitterdistplot.r. You can get a copy of this. If you go down below in the description to that link, there's also a video up here that will give you instructions for taking the information at that link to get caught up to get the data to get the code, everything you need to hit the ground running to follow along with me. So like I said, this is the script to get us going. It's very similar to what we've had in the past few episodes. So in the past few episodes, we've looked at days 0 through 9 and 141 to 150 of this mouse experiment that my lab did, because we were kind of comparing the time series over these two periods in the animals' lives. So what I'd like to do is to pick day 9 and day 141 to calculate that distance between day 9 and 141 for all of the mice, as well as to look at the variation at day 9 across all the mice, and the variation at day 141 for all those mice. So we'll get three different groups. I'm going to do 9 and 141. So I'll go ahead and run all this code so we can get the tidyverse, vegan, and everything else loaded so we can generate that figure. So as we've been seeing, we have micedist as a distance matrix data frame. I believe we have like a 24 by 24 distance matrix here. So let's go ahead and take micedist. So we'll go ahead and pipe this to as dot matrix, we'll pipe that to as tibble with row names, equals samples. Again, we get this data frame that perhaps we're familiar with at this point. And we'll go ahead and do a pivot longer on everything but the samples column. And then we can go ahead and do a filter on samples less than name. So name is the default column name that those column names went into the distances themselves went into the column value. So I need to get the animal ID as well as the day from the samples column and the name column. I've been doing this with a series of str replace functions within the mutate function. And it occurred to me that we could actually do it with two separate functions. So I'll go ahead and do separate on samples. And then I will say into, and I will create a vector that this is going to be animal a, and then day a, and then the separator is going to be that D. And so we can now see that it split that column. But one thing we notice is that day a is of type character rather than of type numeric. And I think for our application, I don't really care if it's numerical or character, but you might. And so let's go ahead and add an argument called convert equals true. So that will ask the question, can these columns be converted from a character to something else? And so sure enough, it said, yes, day a can actually be an integer, which is a type of number, right? And then we can also repeat this on the name column. So I'll do name, and then I'll change these A's to B's. I'll leave that convert. And so now we have animal a day a, it will be day B and value so much easier than all those mutates, right? So now I want to take this data frame and identify three different categories. The first will be of different animals from the same time point at day nine. The second will be of different animals from the same time point at day 141. And then the third will be the same animal at day nine and day 141. And if there's anything else, we want to get rid of that. So I'm going to go ahead and pipe this into a mutate. And to do all this complicated logic stuff, I'm going to use a case when function. And so we'll do comparison equals case when. And again, it's going to get pretty hairy. So I'm going to open up some white space here. And so what we'll do is animal a not equal to animal B, and day a equals nine, and day a equals day B. And so then this will be coded as early. So we'll take the same code. And instead of day nine, we'll do day 141. And this will be late, right? And then we'll do animal a equals animal B. So again, this will be the same animal, but we want day a equals nine, and day B equals 141. And that is going to be same. Okay, then what I'm going to have is true. So basically, if I have a line in my data frame that doesn't satisfy this logic, then I'll get to this fourth conditional, it'll always be true. And then I'm going to have comparison equal to an NA character that we can then filter out and get rid of. And so we can see sure enough, we have some NA values in here. Let's go ahead and count on comparison. And we see that we've got 55 earliest 55 lates, 121 NAs, that's not right. So let's double check what we've got going on here. So animal a animal B, that's right. Day a equals nine and day B 141. If I had to guess, it's probably supposed to be flipped, right? So let's do 141 for a and nine. Now, good. So now we have 55 earliest 55 lates, 11 of the same and 110 NA values. So that makes sense if there's 11 animals that have a day nine and day 141, then at day nine, there would be 11 times 10 divided by two comparisons. So 11 times 10 is 110 divided by two 55. So this makes sense. And so we then want to get rid of those NA values. And so we'll go ahead and then do drop NA. Good. So now we have 121 rows. Again, for the 55 55 and 11, that's 121, we're in good shape. Let's go ahead now and pipe this into ggplot. And on the as for the x, we're going to want to put we're going to want to put comparison. And the y will be value. And then we'll do a geom jitter. So now we see our jittered cloud of points, which look good. They're a little bit fatter than I might like. Let's go ahead and do width equals 0.25. And that tightens them up a bit. I'd also like to add a summary statistic to this. So let's go ahead and then do stat summary. And we'll do fun dot data equals median high low. And so I use the autocomplete here and you'll notice it put median high low with its own pair of parentheses. If I run this, I get an error message, right? So it's saying Aaron quantile argument x is missing with no default. And that's because fun data, it wants the name of the function, not the function itself. So I need to go ahead and remove those parentheses. And let's run this and see what we look at. It's a little bit hard to see because the point for the median and that range are the same color as our points, right? So you can kind of see it over here for same, where we have a larger, a smidge larger point for the median. And then we also get the 95% confidence interval. So what I'd like to do is let's make this point a bit larger. Let's make the point and the line red. So it really sticks out. And then I want to use the intra quartile range, the 50th percent tile range, right? So the range between the 25th and 75th percentiles is what I mean to say. And so let's go ahead and do color equals red. We'll do size equals let's say one. And then we want to use fun dot args. And this will take a list function where we can then pass into fun args the argument we want. So we want the confidence interval. So I don't remember how to do that exactly. So let's go look at the documentation. So I can type in median high, low, right? And then this shows that it's a helper function that wraps a variety of functions from the HMISC package. And what I want is S median dot high, low brings me to that documentation. And then I can see the arguments for S median high, low or X the vector of numbers, the conf dot int, and then NA dot RM equals true. So I'm going to go ahead and copy conf dot int equals 0.95. I'll put that there. And then I'm going to replace the confidence interval. Instead of using the 95, I'm going to use the 50. So I'll do 0.50. And then also, while I'm here, I'll go ahead and make my geom jitter points gray. So I'll do color equals gray. And to make those gray points pop a little bit more, I'll do theme classic. So now what we see is we've got those red circles for the median of the cloud of points. And we have the interquartile range, indicated by those whiskers on top of our gray points, which really, at all, I think works pretty well. Great. So I want to clean up our x and y axis labels, so that we can make this ship shape for possibly putting into a publication. So let's start by adding the labs function. So we can add labels to the x and y axis. So for x, I don't like having a x axis label when I've got categorical labels that are already. So I'm going to put null. And so null will get rid of that. And then for y, we'll put Bray Curtis distances. Good. So we have our Bray Curtis distances and no label on the x axis. Let's then go ahead and add in scale x discrete. And then we'll do breaks equals early, late and same. And then we need labels, right? And so we'll do enter mouse distances at nine DPW days post weaning. And then we'll also do the same thing for days 141 post weaning. And I'm probably going to have to add some line breaks in here before it's all set and done. And then I want intra mouse distances between nine and 141 DPW. And I've got a misspelling there distances. And again, all the labels on the x axis run into each other, which is not good. So I'm going to go ahead and put in some line breaks here with backslash n between that intra mouse or intra mouse and distances. And then I'll go ahead and put some in here as well. And I think that looks pretty decent. The next thing I want to do is take on this y axis. So I critiqued that Turnbull figure by saying that they had kind of zoomed in and made the differences look bigger than they probably really were. And so what I'll do is add scale y continuous. And we'll do limits. Let's try zero to one to get us going. And we'll do breaks of seek from zero to one by 0.2. And there we go. And so we're zoomed out from a range of zero to one. You know, I don't know. I like zooming out a little bit because I can see that, you know, the distances we're talking about here really aren't that large in the grand scheme of things. But at the same time, you can kind of see that this this distance within the same mouse, but at two different time points is larger on the median than the distance between mice at the same time points. Right. And so that's kind of what I'm seeing here. And I appreciate how that looks. So let's then go ahead and save this to make sure everything looks good. So do gg save jittered dist plot dot dot png and I'll do width equals four height equals four. So I'm happy with the way that looks. So one thing you might be wondering about is how would we change the order of our three different categories here? Currently, it's alphabetical, early, late, same, so ELS alphabetical, right? So we could change that by creating a factor out of comparison. So I could come back up to this mutate statement and add another line to say comparison factor on comparison. And then levels of same, early, and late, I think I've got too many parentheses going on here somehow. So now we have the intro mouse differences that same column going first, followed by the early and followed by the late. You know, I kind of like this better, because what I'm more interested in is the distance between those time points within the same animal. So I want that first, and then I can compare back the variation at day nine and the variation at day 141. You can play around with this and get the appearance that you want, but I'm pretty happy with the way this looks. So I really hope you've appreciated this dig into how we can look at alternative ways of representing the distances, other than putting it into an ordination, right? And so we've talked about a variety of tools that work well for time series data or spatial data, or in this case, categorical data, I would really encourage you to focus in on asking, what is the question? What is it that you want your audience to see, and then show it to them, right? Don't give them a big mess of an ordination and expect them to do all sorts of visual gymnastics to compare points. Give them something clean like this and say, you know, these points are far more similar, far more different than everything else. That's what I want you to see, right? Just make it crystal clear. Too often, I think people give their audience an ordination and expect them to do all the interpolation and kind of the visual comparison. That's just too hard. Again, I think these approaches that we have seen in the past few episodes work really well. If you have alternative ideas for different ways to think about ordination, let me know, and maybe I could try them out with this data set as well to share with others. Anyway, so that you don't miss that and other exciting episodes that I have in mind coming up, please be sure that you've subscribed to the channel. You've clicked the bell icon so you get notifications. Give me a thumbs up and leave me a comment below with any thoughts that you might have about what we're doing. Definitely tell your friends about what we're doing and I'll see you next time for another episode of Code Club.