 When I read the scientific literature and look at the figures, especially in the fields of like microbiology or systems biology, anytime where we have just tons of data, I feel like people take the adage that a picture is worth a thousand words and they think, you know, if I just plug more crap in there, it'll be worth like 10,000 words. The reality, it goes from being worth a thousand words to being worth, uh, not much. Hey folks, I'm Pat Schloss and no, no, no, I don't think your figures are all crap. As I've shown, I think some of my figures could be much more improved. Well, in today's episode of Code Club, I want to look at an approach that many people have been taking of trying to combine data from an ordination, looking at variation in community structure and adding other variables on that so that they could look at more types of data in a single figure. We looked previously at a ordination where we had, uh, the two axes of a principal coordinate analysis. We had color indicating the disease status of the patients and then we sized the point according to the diversity of each of those samples. Well, an alternative approach that we might think about trying is what happens if on the x-axis, we put the first principal coordinate, which is the coordinate axis that explains the most amount of variation in the data. And on the y-axis, instead of putting the second principal coordinate axis, what if on the y-axis then we put diversity? And then we could have our points colored by the disease treatment. That's exactly what we're going to try to do today. It'll give us a different type of scatter plot. Along the way, we'll develop a different kind of scatter plot than we've seen in recent episodes. But at the same time, it will allow us to review some of the concepts and techniques that we've already been working through to help review and reinforce those concepts. I'm here in my RStudio in my project for these Code Club episodes. If you'd like to get caught up and follow along with me, be sure to check the link up above for how you can go about installing RRStudio, installing the tidyverse package and getting the data that I'm working with as we go through these episodes. Also, down below in the notes, there's a link to a blog post that includes code that I'm starting with here that you'll be in the same place that I'm at. To review what we've got going on here is similar to previous episodes. We're reading in the principal coordinate analysis data. We've got the metadata from this data set from a paper published in my lab a number of years ago by Alex Schubert. We've got the alpha diversity data so we can look at things like inverse Simpson index or Shannon diversity, whatever we'd be interested in. We join that all together here. And then I also have read in and I've got the information on how much variation is explained by the first and second axes. We also have the color scheme that we've been using in our previous episodes of Code Club. Something that I think is important is across our figures if I have the same variable represented or same disease status say across multiple figures, I want it to be the same color that again helps my audience to know that if they see gray in this figure, it means the same thing in this other figure. And so it's helpful to be mindful of what do your colors mean across your different figures and are using a consistent color scheme. But that's not what I want to talk about today. What I want to talk about is again, these pre attentive attributes and those pre attentive attributes are the first thing your audience sees when they look at your plot. The two strongest pre attentive attributes are separation position on like our x and y axes. The second might be things like color and shape. And third might be like the size of a point and there's many other pre attentive attributes, but I think those are the strongest ones. And the ones we've talked about most frequently already in these episodes. So I'll start with my metadata, data alpha piece metadata PCO a alpha data frame. We'll pipe that in a GG plot. And our mapping to our aesthetics, the x is going to be access one, why I'm going to do inverse Simpson. Again, that is the metric of diversity that I we used Alex used we used I used whatever I in that paper. You could use Shannon or any other diversity and check index you'd be interested in hopefully it doesn't change the story. And then color, I will do disease stat. And I will then pipe that to geome point. And I like theme classic. I will save this as GG save Schubert PCO a diversity dot tiff give that all a run looking at my tiff I have this figure let me go ahead and size that. So I'll do width equals 4.5 height equals 4.5. Again, if you've watched me, you know, I'm not a big lover of the R studio plotting tab. I think it's useful for exploratory data analysis. But if you're trying to make a plot, that's going to go into a manuscript or a presentation, or on the website or somewhere else. I think it really pays to develop it. As though you're running this outside of our studio and giving it all the parameters for how you want to save it. And that includes the dimensions. And so we now look at our ordination or it's not an ordination, we look at our scatter plot and we see axis one is our first principle coordinate axis, our y axis is the diversity metric. And then we've got coloring by our different disease status groups. And so blue is non diureal control. These are our otherwise healthy people. The green are the people that are positive for C difficile. And the red are our diureal controls, the people that have diarrhea but don't have C diff. And so what we can see is that yeah, it does definitely seem clear that our healthy people are on the right side of that first axis, whereas the diureal controls and the cases are to the left. And that inverse Simpson really does seem to pull it apart. And that it does seem that our non diureal controls have, I mean, all these groups have a fair amount of variation to them. But the non diureal controls do seem to have a higher diversity relative to people with diarrhea. And I think this is kind of known that stool consistency is a really strong variable for driving community diversity. We'll talk we'll critique this later, but let's get this looking attractive. And then we can go in and critique it. Alright, so I'd like to get my color scheme in here, I will do scale, color manual, and name equals no. And then I will do breaks equals, and that's going to be these things here, a diureal control case, non diureal control. And let me get this all on one line. And we can then do labels. And we will do actually, I'm going to put this in the order I want it. And actually looking at my levels here, these, these don't look right. I wonder how long I've been doing this for. So I think I want these non diureal controls to be first diureal control and then case me double check I run that. Alright, so then I want non diureal control first case and then my labels are going to be healthy, diarrhea. And I'll do C, difficile positive. Alright, and then I'll add that values are going to be these values. So I'll do healthy color, diarrhea color, and case color. And let's give this another run. And we see that we've got healthy first, diarrhea second, C difficile positive third, and that we've got our coloring as we'd like it, what should we do next? Let's go ahead and clean up our x and y axis labels. And to do that, we can do labs. And we'll do x equals PC, Oh, axis one. And I'll go ahead and put in my cool glue. And we will do what is this variable up here called explained one, close curly brace, and I'll put a percent with the close parentheses. And then why I will do inverse Simpson index. And that didn't work. Because I forgot to use glue, I forgot to actually use glue. So let's go ahead and do glue here. So really, folks, I need you to yell at me when I'm doing stupid things. There we go. And we've got 7.4%. And perhaps we could we could do explained. Alright, so that looks good. I'd now like to move my legend inside of the plotting window, maybe to put it in the upper left corner of this plot. So we can do that with the theme function. And we can do legend position equals and this will take a two unit vector to element vector of x and y position. And so let's do say 0.2. And then why we can make say like 0.8. Let's give that a run and see where it puts it. And then we'll kind of futz around with where we want to move it to. So that that looks pretty decent. The other thing that that does then is that instead of kind of taking a third of the plot for the legend, it then gives us that free room to get more separation in the data. So that looks good. Let's go ahead and bring together the rows of the legend. Do you remember how we do that? So we can do legend dot key height can then do I think 0.5. Let's do one and see what happens. Kind of feel like I'm supposed to have it's supposed to be a type unit. So let's do unit one and we'll do lines. And that brings it together a little bit. Let's bring it together a little bit more. Let's do 0.75. Yeah, so that brings it together pretty nicely. Maybe we'll move it up a little bit. And so that would be the y position. Let's put 0.9. Again, so much of making a visual is kind of futzing around with where we want to put things. Nice, I think that looks really good. Let's go ahead now and this C difficile is not italicized, but we can make it italicized using the gg text package, which I called up here. So I should mention that all of these packages come installed, except for gg text. If you install tidyverse, you get reedxl and glue, but gg text has to be installed separately. So hopefully, you can figure out how to install that package if you knew how to install tidyverse. All right, so how do we do that? Well, if we come back here to our label C difficile positive, we can use markdown and the markdown for italics is a single star. But to get that to work, we need to then come down and do legend text equals element markdown. All right. And so that then will make the text in the legend markdown type. And so if it sees markdown like the star, it'll make it italicized. And sure enough, we now have Cdiv positive italicized. Let's go ahead and add a title to our plot. Looking at our figure, there's two things we want to say that we want the audience to pick up on. First, their separation in the community structure, according to disease status. And second, that healthy individuals have a higher diversity than those individuals that have diarrhea. So I'll say healthy individuals have a different microbiome and higher diversity than those who have diarrhea or diarrhea with C difficile infection. Okay. So let's give that a run. And of course, it runs off the title off the side there as we've seen in the past. I would like this title to be left justified with the plot rather than my access. And so let's go ahead and fix that. And then we'll go back and clean up the title. So we can do plot dot title dot position. And we can get it to be justified to the left by giving it plot. And so now we see it is left justified. And that looks great. I'd like to go ahead and now let's put in line breaks. I am going to end up coloring my different groups that I've labeled are titled in the title. And so I'm going to go ahead and use element markdown. So let's let's add that. So we'll do plot that title dot. Yeah, and that's going to be element markdown as well. So the first break is going to be after the and and so we'll put a br there or I want another break. That got us the other break. We want to go ahead and add our coloring. If we use glue that we can then use these backslashes to make it easier for us to see what's going on. And it's nice to see what's going on. So let's double check that that works. Yeah, we got the same thing. Now we're ready to insert our markdown or HTML code to make it more attractive so we can do healthy individuals and we will do span class equals and we'll do a single quote. I'll do color colon and in then the curly braces will do healthy color that and then close the span and we will do strong to make it bold. And then we will do back on the strong back on the span and let's make sure this all works and bam healthy individuals is bold black but not bold gray. Sorry, this shouldn't be class should be style. There we go. So we've got our gray bolded healthy individuals. And then we want those who have diarrhea to be blue and bolded. Again, I'm going to put in those backs to make it easier to read what's going on. Those who have and so again, this is going to be the same thing I have here. But we will of course change the color to be diarrhea color. And we then have strong and we then those have diarrhea. And so then we need to back out of the strong and back out of the span that and then we're in our diarrhea with C. difficile infection. And we will do here the same thing that we had up above case color. And then here we will then do back on strong strong is the bolding. And there, boom, nice. And so then we want C. difficile to be italicized. And I think we can put stars in like we did for the legend and that will then italicize C. difficile. Nice. And now that's italicized very good. All right, let's clean up our code a little bit. Part of the problem with really long titles like this or strings is that the the screen you can't see everything on the same page. And that gets kind of annoying. Okay, so I think we're good there. Run that make sure you didn't screw anything up. What do you think? I like this I think it looks pretty effective. I think with this title, we probably don't need to show the legend. Again, we've seen this before we can do show dot legend equals false. And that will hide the legend. What do we think of this visual? Well, I don't know. We still have a fair amount of over plotting here like right here I see these two gray points and there's a blue point in between them and behind. There's some also some of that also going on here. You know, I'm pretty confident the statement is true. We'd have to do a statistical test to make sure that's actually true. That I have a lot of gray points that are above where these other points lie. I don't think there's a difference between the diarrhea samples and the diarrhea plus see the fissile samples here in the blue and red. But again, what we're showing is that there's separation in community structures, but there's also separation in diversity. And that there are communities that least on the first principal coordinate have similar community structures, but perhaps very different diversity in their communities. And so if that's important to you, then I think this could be an effective way to look at the data. For my question for this question of looking at the relationship between the community structure, community composition and see the fissile status, I don't know that it really helps. I think it's interesting to know about. It's not going to be a figure in my paper, right? And if I were to present this in a talk, I would probably, you know, pull it apart more and wouldn't be so tempted to kind of throw a whole bunch of data into the same plot. That being said, I think this is more effective than what we saw previously where we had both X and Y, our ordination axes, and that we sized the points by the Simpson index. Now, you know, I think we could have a discussion about what this shows and whether or not this is better than two separate plots. I think for some context, there is something to be gained for having them both together like this. The thing I really want to emphasize here is that again, the X and Y positions, those pre-attentive attributes are really strong and that when we look at them, we instantly see kind of this lateral separation and kind of going up and to the right. That's the first thing we see. The next thing we see is, oh, the gray points are to the right and the blue and red points are kind of down and to the left, right? And so by focusing on those things that your audience is going to see first, you can really drive home the message that you want them to be able to see. Is this the perfect graph? I don't think so. I don't think it really says a whole lot more than kind of a separate plot showing diversity versus showing the ordination. But, you know, depending on the context, this might work for you. One thing I want to point out before we go is that this really works best when you have principal coordinate axis one on the X axis. It doesn't make sense to put principal coordinate axis two or three or some of the other axes. I don't think so in my opinion. It certainly doesn't make sense to use an NMDS axis. The NMDS axes are totally just fictitious. You could run the same ordination multiple times with NMDS and get the same layout, but it might spin. And so I have seen things recently where someone made a plot like this, but on the X axis, they put an NMDS axis. That makes no sense to me. I think it was actually the second NMDS axis, which absolutely makes no sense. If you find this type of analysis and learning about our code like this interesting, please be sure to check out the tutorials that I have up at riffamonis.org. I have a lot of material going through microbial ecology data as well as data from other sources that have really little to do with microbiology. It's all up there free. Those tutorials are also the basis for three day workshops that I teach. So if you'd be interested in taking one of those workshops, shoot me an email at riffamonis at gmail.com and I'd be happy to tell you about the upcoming workshop schedule. Anyway, like I said, tell me if you find a effective version of this plot out there in the wild and maybe we can critique it. If you will in a future episode, if you've got something like this that you'd love to share, please by all means, let me know. And I'd love to talk with you about it, perhaps here on another episode of Code Club. Well, keep practicing with these materials. The only way you're going to get better is by constantly practicing, always trying to do better. Hopefully you're seeing my own growth from when we originally published that paper to today that, you know, my evolution and thinking and skills really of my lab in generating better and better plots. Please tell your friends about these episodes, help spread the word. I have been getting a lot of positive vibes that people are really enjoying these episodes looking at data visualization. Well, we'll see you next time for another episode of Code Club.