 If you're like me, you never run out of ways to describe your data. Unfortunately, in scientific publishing, we're typically limited to about five figures per paper. And while we can have multiple panels per figure, the more panels we get, the smaller each individual panel gets, making the text and the points harder and harder to see. So if we have complicated microbial ecology data, like I've been describing in recent episodes, as we have more and more panels in those figures, it just gets way too hard to see. And your figure ends up being basically the size of a full page, which is not ideal. So what do we do about this? Well, some people have taken to merging effectively different panels together to have one figure or a small number of figures that represent the amount of data that you typically find in many figures. Is this a good idea? Let's see one approach today and I'll let you know by the end of the episode what I really think of this approach. Hey, folks, I'm Pat Schloss and this is Code Club. In recent episodes of Code Club, I've been trying to rehabilitate a figure that my lab published as a supplemental figure in a paper a few years back, looking at the variation and gut microbiota of people with and without Cluster Deoides difficile infections. Along the way, we've been also developing a rubric that I have at least kind of handwritten down or in the back of my head of how we might think about interpreting and assessing the quality of different visuals that we see in the literature, as well as visuals that we make ourselves for describing what's going on. One of the things we've talked about are pre-attentive attributes. Pre-attentive attributes are, again, the things that your eyes first see when they come upon a paper. The two strongest pre-attentive attributes are the position on the x-axis as well as on the y-axis. In an ordination, we use the ordination technique like PCOA or NMDS to separate the points based on their dissimilarity to each other on that x and y-axis. In PCOA, the x-axis represents the axis that explains the most amount of variation in the data and the y-axis, the second most. We've also used a secondary pre-attentive attribute, which is color and grouping. The coloring is the coloring of the point by the treatment or the treatment group or the diagnosis group or the disease group that the samples correspond to. In our case, we're looking at healthy people without diarrhea, people with diarrhea and the people with diarrhea as well as a seed of a seal infection. We used color and the coloring also allows us to group the data because if I see a group of red points together or blue points or gray points, my eye kind of integrates and pulls those together. We've also helped this by putting ellipses around those three different groups. And so that secondary pre-attentive attribute really helps our audience to see the groupings that we think are most important. Great. In our ordination, we already have separation on the x-axis and y-axis. We have coloring as a pre-attentive attribute. As well as those ellipses to group by disease status. And so looking at figure 1a, I wonder, can we also add another set of information, which is the diversity of those individual samples? Could we maybe change the size of the points? This is called a bubble plot. So by changing the size of the points, can we map on diversity information? And so I would consider this kind of a tertiary level of information. It's not my primary emphasis in looking at the plot. My primary emphasis in looking at the plot. Again, is on those x and y-axis, the separation of the points by using color. I'm separating perhaps by disease group. And so then this third level or fourth level, maybe even of analysis, would be looking at diversity by mapping on information from what we already had in figure 1a onto those ordination. Well, that sounds like a great project for today's episode of Code Club. So we'll see how we can do this in R and R studio. And then we'll reflect on whether or not this was a good idea after all. So if you're just joining us for the first time, welcome up along here somewhere, I'll put a link for a video on installing R, R studio, the tidyverse and getting the data that you need to follow along if you're interested. Also, if you want the code that I'm starting with, go down below in the show notes, there's a link to a blog post that has the code that I'm starting with here today. So this is where we left off after the last episode of you'll recall incorporating the percent variation explained on the two axes. I think this looks pretty nice. I'm still a bigger fan of the NMDS, but working with the PCOA will help us to kind of move forward with the progression of these episodes and where we go next. Anyway, again, what the goal is, is that I want to size each of these points according to the inverse Simpson diversity for each of the points and we'll then get a different sized points again, depending on the diversity. So how do we do this? Great. Well, I've already gone ahead and in this code that I've given you, I've put code for reading in the Alpha diversity information and reading it and joining it to the metadata and PCOA information. So we now have one large data frame, a metadata PCOA Alpha that has all of the columns from our ordination, our metadata and our Alpha diversity information. We now have our GG plot and I want to add size as a variable and I will go ahead and put this on a further line for our mapping. I'll do size equals in Simpson. And that should be pretty good. That should do what we want. So let's go ahead and give this a run and see what we get. And so again, this is a bubble plot where the individual points are sized by the value of the inverse Simpson index. One of the things we did was earlier, we turned off in GM point. We did show legend equals false. I got some feedback on Twitter and elsewhere that people hadn't known about this. And so I'm happy to show you that. Well, now we're going to turn that off and actually show the legend. But don't worry, we'll come back and we'll see another way that we can turn off the legend. So if we run this, we now see that we have two legends. So we have the color corresponding to the disease status as well as the inverse Simpson. And so we see that this circle size is inverse Simpson of 10 and this larger one is of 30. Now the thing to know about human perception is that our eyes interpret area, not radius. And so that makes it a bit challenging. One of the things I also notice is that I've got some circles say down here that are actually smaller than 10 and don't really fit in well with this range of inverse Simpson values. So what I'd like to do is maybe clean this up a little bit. Another challenge that I see is that these normals are healthy individuals are gray and they're large. And because they co locate in the ordination, they really all fall on top of each other and just like this big gray blob. And so maybe that 30 is a little bit larger than what we really would like to see. How do we turn this off for the color and keep inverse Simpson? Well, we can add a new function called guides. So in Gigi plot, those aren't really called legends. They're called guides or scales. But what we could do is we can say color equals none. And we can also say fill equals none. And that will get rid of the legend for both our color as well as the fill that we use for stat ellipse. And then we will also add in size equals legend. So you'll see that gets rid of the legend for the color and the disease status and leaves us with a legend for the size related to inverse Simpson. Again, we want to clean this up a little bit and make it better. So to modify that legend, we're going to use scale size and scale size allows you to modify the radius as well as the area of your plotting symbol. But know again that human perception, we tend to scale by area and not by radius. And because the size can get quite large, as we saw, what I'll do is set range to a vector of small to large values. And I will do, let's do say point five to four. And let's give this a run. Setting that range now gives us kind of smaller points than what we were getting before, where they could be up to six. So they'd be even, you know, twice the size as what we're seeing here for the largest points. One of the things I don't like about the legend is that it doesn't really have a bottom end, right? So if you had like zero, there's not going to be a point there because it's like zero, right? And so, you know, this point down here where my crosshairs are doesn't show up on the legend. We don't have a good sense of what that diversity value is. So we want to modify scale size very much the same way that you can see right below it for scale color manual. We'll go ahead and do breaks. And I will do one, 10, 20, 30. I'll do labels will be the same vector. And I will also do limits of C zero to NA that NA will let the upper end kind of float freely. I can also do name equals inverse. Let's see if it'll take a backslash inverse Simpson index. I think that's going to probably look a little funky, but we'll see. And we'll go ahead and put these each on its own line. So it's a little bit easier to read. Let's give this a run and see what it looks like. Yeah, it does look a little funky. Maybe we want to change the size of the font there. And that looks pretty good. We could move the legend down a little bit so it's closer to what's going on. We've seen this in previous episodes. Again, down here, legend position. We can set the X and Y coordinates relative to its position in the plotting window. And I think it's the center of the legend is basically where it's getting plotted. So we want it to come over to the left a little bit. So let's do 0.8 and down a little bit. So let's do 0.8. So that's maybe a little bit too close. Let's move it to the right just a smidge. So let's do 0.9. Again, it helps to do this in the actual format of the figure that you're going to be working with. If I was doing this down in my plotting window within our studio, it would look totally different. And I think for reproducibility's sake, we want to be able to script all of this as much as possible. That looks decent. The title of this legend seems a bit big. And so I think what we can do is legend title. We'll do element text and we can then do size equals. Let's do 12. I'm not quite sure what that will look like, but we'll give a rip and see what we get. So that made it, I think, bigger. Let's go ahead down to 8. Again, this takes some tweaking. And as I mentioned, it's best to do this within the format of the image that you're working with. I think we've got a margin problem that the box is going over the inverse Simpson index. And out of two minds, we could either remove that or we could go ahead and. Yeah, so we could either remove it or we can fix it. And so we're using a negative margin on the top. So let's go to minus one and see what that looks like. Yeah, so let's go ahead and do zero. That looks better. You know, we could even maybe go to one. I think that looks pretty decent. And so again, now we have the size of our plotting symbol and in the legend corresponding to the full range of points that we see over in the ordination. So let's look at this figure. Now that we've got it looking perhaps about as good as we could hope to get it. What do we think about this? What do we think about mapping that tertiary variable of diversity onto the size of the plotting symbol? Well, you know, I think I think it's clear that these points down here are one because they're so small or they're really close to one. These other points, I think, you know, these are difficult to resolve probably because we have so many points here and they're actually large points. So they're points with large diversities, but it's really difficult at this point to differentiate between the points. We could make their alpha smaller, but I don't really know what it means to have like a blue and red point overlap with each other. It just kind of gets a bit muddled and isn't really ideal. Again, humans are really bad at interpreting differences in area and, you know, while I can differentiate between one and 10, I think 20 to 30 for me at least is hard. And if something was 25, I don't know that I could tell the difference between that being 20 or 30. And that might be the difference that we're expected to see between the different treatment groups that we have. And so it really doesn't help us. I don't think if we look at figure one, a we're the average or the median, I forget what we did for the non-diarrheal control. So the healthies was about 10 and for the diarrheal controls was about six. And so that's, you know, we're kind of looking at points between that range. And it's just difficult to see what's important, right? And so, yeah, we can size them by diversity, but I can't help my viewers see what I want them to see in this, right? If I were to say that people with diarrhea and people with diarrhea and C. diff have lower diversity than healthy individuals, you know, I'm not convinced that they could see that with these data because I mean, it's just there's this mess over here and there's kind of points over here that are similar in size. And I don't know, you know, I've got some smaller healthies points in here kind of suggesting, you know, diversity may be down around five or six. And so I don't know that that helps. I don't know what helps to merge the diversity data on top of the ordination data. So what else could we have done? Well, another approach would have been to take the disease status and instead of mapping that to color, we could have mapped it to the shape of the symbol. And then we could use color to create a color gradient corresponding to the diversity index. And well, that certainly has some, I don't know, attractiveness. The problem with that is that if I had like circles, squares and triangles, my eye is not going to be able to integrate by shape as well as it can by color. Right. Like I can really help the viewer to see that there's a lot of red points here, a lot of gray points here, even if I don't have the ellipses, right? That because all those gray points are together, they're together, right? And so that's the color is helpful in this case in a way that I don't think the shape would be because we have so much data and it's it's highly variable. Right. The other problem with color on a gradient is that say we went from like white to red or white to blue. It gets really difficult to differentiate between say like 70 percent blue and 100 percent blue that we just don't have that quantitative of a grasp of variation in color. I know people love it for heat maps, but that's really one of the problems with heat maps is that people can't visually interpolate colors. So I really feel like this is a case where we should let the ordination do its own thing looking at variation in community structure. And we should have been a separate figure for diversity information. And so this comes back to something I constantly preach to people, which is that we should strive for our figures to do one thing well. And so adding this is now making it do kind of one thing poorly. Right. Like sure, it does separation by space well still. But it then adds on all these different sized shapes and it doesn't do that well. It doesn't help us to see the variation in diversity very well. And so we should save that for a different plot. And what I always tell people is that if you're contemplating making a figure, draw the figure to answer the question you want it to answer. And so if you give me a figure and you say, well, I don't really have a question that I'm just kind of showing data. Well, that's a that's a loser. That's not going to work. But if you say I want the figure to show this and this and this and this, well, stop, you're trying to show too many things. So really try to simplify the message of your figure down to one thing, maybe two things if you have to. But make sure that if it's doing two things, that it's actually doing those two things. And I think what we see here is that it's really not doing a good job of differentiating biodiversity in the next episode. We'll start looking at how we can differentiate biodiversity. We'll start pivoting from looking at this ordination to going back and looking at figure one from this paper. Anyway, I hope you found this discussion useful of creating a gradient of sizes of your points. There are times when that works pretty well. If you want to see a really cool video of a bubble plot, as these are called, check out Hans Rosling talking about what's now called the Gapminder data set. He's looking at various economic data on X and Y and he then shape sizes the point by some other variable and he animates it, right? So he's got like four things going on. But there's enough differentiation in the size of the points that it really helps make the point strongly. So I think sizing mapping a variable to the size works well when you have large variation or large differentiation, I should say, in the data. But when you've got, you know, small variation between points like we see here, small amount of differentiation, it just it just doesn't work very well. It looks cool, maybe. But then when you step back and say, well, what does that mean? I don't know. Right. Anyway, well, if you find this stuff interesting and you want to dig into it more, certainly give this a thumbs up that will inspire me to keep going on. Be sure you subscribe so you see the next videos that come out. But also, if you want a more systematic look at how to work with microbiome data within R using the tidyverse, by all means, head over to rifomonas.org slash minimal R. There's a link down below in the notes, and I have a full tutorial that I use to teach my three day workshops that's up there for free for anyone to use. And I really think you can learn a lot of great basics about using the tidyverse to better understand microbiome data. Well, I hope you've been practicing and trying to incorporate these ideas with your own research. By all means, let me know if you have any questions or ideas. If there are plots out there in the wild that you would like me to take a look at and give you my feedback on what I like, what I don't like, you know, perhaps you could try to recreate it or even make it better. Let me know. I would love to have that be part of our dialogue of what we're doing here on these episodes. Well, please tell your friends about these episodes and we'll see you next time for another episode.