 If I told you that your audience would have three seconds to look at a visual of your data, how would you design that visual? What would you do differently than what you normally do with your data? Well, I'd like to argue that maybe that's what we should do with our data, because the ability to interpret a graph quickly also relates to the ability to interpret a graph easily. And we want to make things as easy as possible for our audience. How do we do this? Well, this comes back to a concept that's called pre-attentive attributes. Never heard of those before? Well, stay tuned for today's episode of Code Club. Hey, folks, I'm Pat Schloss and obviously this is Code Club. In today's episode, I'm going to continue talking about some of the elements that were in the rubric that I shared with you from my last episode and talking about a concept that's called pre-attentive attributes. This seems like kind of mumbo jumbo or some just jargon, but I think it's really important for helping us to think about how we design our data visuals. We want things to be easy on our reader. We want to have empathy on our audience. We want it to be easy for them to understand what's going on. One of the things that I like to think about are my eyeballs. And so as I'm looking at a figure, how much do my eyeballs twitch back and forth up and down trying to understand what's going on in the visual? If my eyeballs don't move much, then it's easy to read. If they're moving all over the place, left and right to the legend and to the data and trying to figure out what's going on, it's not easy to interpret. We want things to be easy to interpret for our audience. What are pre-attentive attributes? Again, these are the attributes or in GG plot terms, the aesthetics of your plot, that your audience is going to see within the first second, maybe first half second, and that they're going to use to form an impression of your data of what's going on. So what are pre-attentive attributes? Well, position on the axis, right? So is something higher or lower on the y-axis? Where is it positioned on the x-axis, right, to the left or right of other points? Those are some of the strongest pre-attentive attributes, the spatial positioning, because we're really good at comparing relative positions of different points. We might also think about color and how different colors could indicate categorical variables or perhaps continuous variables. We might use shape as a pre-attentive attribute to indicate different groups or categories of samples that we're plotting. Shape and color are kind of, I think, seen as being equal pre-attentive attributes for categorical data. We might also think about length, right? If you have a bar plot, you have a length of that bar or perhaps the length of a line or a line segment to indicate distance or variation, right? We might also think about width, of a width of that line. And so if a line is wider than another, then perhaps that line has more emphasis or more whatever it is we're measuring behind it. But another attribute that I want to spend a fair amount of time talking about today is grouping. That if we can draw a circle or some kind of ellipsoid around a group of points and lasso those together, we can say these points all go together. And so yeah, while perhaps I'm showing you those individual points, what you really need to focus on is that grouping, since we can compare this grouping to another grouping. So if you were with me in the last episode, you know that we built a scatter plot of an NMDS ordination of some data that I published in a supplemental figure many years ago that wasn't very good. And so what I'd like to do is think about how we can use grouping within that ordination to make it easier for the audience to understand what's going on. So if you look at this ordination, one of the things you perhaps begin to notice is that your eyes are going all over the place, right? There's a couple hundred points here in the visual. And there's color indicating different groups of patients. And there's position x and y and where they are, right? And so what my eye does when I look at this visual is I'm looking back and forth, I'm looking up and down, I'm looking back up to the legend and back, trying to interpret what's going on. But if I were to group those points somehow, perhaps by drawing some type of lasso, if you will, around the three different groups of patients, it would become much easier for my audience to understand what's going on. No doubt, this has become a fairly common approach that people have taken to looking at ordinations in the field of microbiome research. So there's a couple different variations on that ellipsoid approach that we will talk about today. We'll go about building four different variations on the plot that we made in the last episode. As always, if you'd like to get caught up, I have a video from a couple of weeks ago where I showed how to install our studio and our four windows, as well as Mac, how to install the tidyverse and how to get the data that I'm using in these Code Club episodes. Also, if you look down below in the notes, there's a link to a blog post for today that includes the code that I'm going to be generating as we work through today's content. If you're excited about the possibility of learning more about R, you're in the right place. Down below in the notes, I've got a couple links to some tutorials that I've made. The minimal R tutorial is all about generating data visuals using microbiome data. Also, I do teach workshops, three-day workshops that go over the material in those tutorials. So be in touch if you're interested in learning more. But by all means, please be sure to like this video, subscribe to the channel, hit that bell icon so you know when the next video drops, where we'll continue to build out our skills in visualizing data. Let's go ahead to RStudio and I will fire up RStudio by launching that Rproj file again. So I'm in my correct working directory. I'm going to go ahead and open up that Schubert nmds.r file and I will go ahead and run everything to get to the point that we were at in the previous episode. And the the figure that has generated is not the figure that I ended with in the last episode. You'll notice that for some reason my legend is truncated here to the right. I'm not I'm not quite sure what's going on with that. This is really a point of frustration that I have with RStudio. And perhaps it's also my fault. And so what gg save which I'm using here does is it saves the image as it sees it down here in the lower right corner. What I need to do is if I set the dimensions better, I think it'll look good. So if I do width equals five, height equals four, and that looks a little better, but it still looks kind of funky. So again, this is just kind of a frustration of mine with RStudio that it's really hard in a reproducible way to get a figure made the way you want it to look unless you are working off of the TIFF. I'm really not sure what happened there, whatever. So this is an opportunity to show you a few things. And I think one of the problems I have is this chord fixed function. Let's add some arguments to this. Let's do XLM. And I'll do C. Let's do from negative 0.8 to 0.8. And then YLM from also negative 0.8 to 0.8. Give that a run. So we could mess with the margins later. But for now, I'm happy that I've got my legend here. And we can move on without worrying about it too much. But this kind of underscores the need that if you're trying to make a figure that is going to be going out for more public consumption, it pays to use GGSafe to save it directly and to set the width and the height. I think part of the reason mine looks so weird is that I've got things zoomed in so that you can see it more easily. But anyway, I think we'll be in good shape. And I don't want to spend a whole lot of time messing with that. I want to get to thinking about how we can group our data. So again, as we look at this ordination, and as I'm looking at it, my eyes are kind of really attracted to looking at all of the points and trying to, again, interpret what's going on here in the legend with the points to try to see, like, are there more gray points down here? Are there more red points up here? Are there more blue points up here? What's going on? And so I'm spending a lot of time with my eyes going back and forth trying to interpolate the data and see where the centroid is. So I think the first thing we can do to help our audience is let's put down a point in the middle of these clouds, the three clouds to indicate where the centroid is for each of these three clouds. How do we do that? Well, I'm glad you asked, we're in the right spot. So we've got this metadata NMDS data frame, which as we've seen from previous episode, has our metadata, and then at the very end joined to it are the two axes columns that we're plotting. And what I'd like to do is I'm going to group by disease stat. And again, that's the variable that we're using for our color. And to get the centroid, what I will do is a summarize. And I will say axis one equals mean axis one. And then axis two is mean axis two. And I'll go ahead and do dot groups equals drop. So I don't get those annoying error messages or warning messages. And so we see the centroid coordinates that we can imagine a point being plopped down into those different spots. Good. So I need to define this as a data frame that I'll call centroid. Save that. So then I can add points, extra points, extra data to my ggplot. So this geom point is using data from metadata NMDS. I can actually add another geom point that will use data from centroid. Right. And then my mapping will be AES. And I'll do x equals axis one, y equals axis two, color equals disease stat. And I'll add that. And let's go ahead and run that. Now the problem is that it's plotting it with the same plotting simple as everything else. So I have no idea which of those points are the new points I added. So I need to change that. And I can I can I can use a different plotting symbol by adding the plotting symbol outside of the mapping argument. So I can do shape equals 15. And so if we look at that, again, it's not obvious to me where that is. So what I'll do is make the size bigger. And so let's do size equals five. And bam, there we go. We see that we now have these large squares to indicate the centroids of the three different clouds. And so we say, yeah, the grays are down here at the bottom, the reds are over here on the left, and the blues are more towards the right. Now we can do a statistical test later to figure out whether or not those are significant or not. We'll probably save that for a future episode. One of the things I don't like about this, though, is that it makes the squares and those legends much larger. And I don't really want the legend to be the square, I still want it to be those circles. So I can show I can add another argument to the second genome point, which will be show legend equals false. And so that way, then, and be sure I run gg save, I now have my previous legend with the circles, and I then have those squares to indicate where the centroid is. Good. So that's helpful. And so that's that's approach one, right, plopping in the centroids to make it much easier for our audience to see what's going on. If they're familiar with this type of plot, it will be easier for them to see that. And so what I now do is I'm comparing these three squares rather than the couple hundred points that are behind them. Okay, so that's that's helpful, but still my eye might want to kind of wander to compare this red square to all the other squares. So a second approach that's commonly used, I don't know what it's what it's properly called, I'm going to call it a star plot. Tell me down below and then comments if you know what the actual name for this is, is a plot and I call it a star plot because if we take the red points, we'll draw a line from the square out to each of the red points and from the blue square out to each of the blue points and same for the gray. So let's give that a shot and see how it looks. I kind of have a feeling that we have so many points that it might just look really bad. So how do we do that? Well, let's start like we had back here. And I'm going to copy this and I'm not going to do summarize, I'm actually going to do mutate. And I'm going to then call this centroid one and centroid two. And and so what we're doing is we're taking our big data frame, metadata and MDS, we're grouping by disease stats of the three different disease statuses. And then for each of those, we're creating a column that is the mean of the axis one columns, the axis two columns, and that's then our centroid. And then I'll add an ungroup to the end of that. And again, at the end here, that output, we see the two centroid columns. So I will save this as star. And you know what, so that we can save this, I'm going to go ahead and bring this down below. And I'm going to save Schubert and MDS as centroid. And then I'm going to copy down my code so that so that when I put it up in the notes, it's easier for you to see what's going on. Okay. And so again, what we want to do then is plot a line segment from the centroid out to each point to do this. Again, we're not going to use centroid, we are going to be using, when I didn't copy everything I needed, I also need that GG plot line. So I'm going to change some things. So instead of metadata and MDS, I'm going to put star. So I'm going to add a third geome actually, which will be geome segment. And what geome segment needs is it needs x and y and also needs x and and y and so x and will be centroid one. And then y and will be centroid two. And I'll go ahead and put these on different lines. So it's easier for you all to see. And let's go ahead and see what this does. I'm kind of curious. We get an error. All right, so I'm getting an error of centroid one not found. And what that is coming from is this geome point where I'm plotting the centroid. And so I need to also add in here inherit dot AES equals false, because what it's trying to do is it's trying to find centroid one and centroid two in the centroid data frame, which only has axis one and axis two. So I think we're in good shape. Now, let's go ahead and run this. And so yeah, that looks pretty trippy. You can't see the squares in there. Not a big deal. Again, that looks pretty wild. It's a little bit distracting to have all that going on. I think what I've seen this be perhaps a bit more effective is when you have fewer points. At the same time, you know, I don't know that you really need the lines. I'm going to show you another way that we can perhaps think about grouping these variables together. We might also think about removing those points at the tips. And so we could, we could do that by commenting out geome point around the whole thing. And so now we have the stars without the exterior points, we also don't have those centroids showing up very clearly in here. We could see where they are. Actually, you know what, we've got the segments on top of the points. So let's go ahead and change the order here and see if that makes it easier to see. So we can kind of see that we have those there. What happened if we make the color of that centroid black? So let's remove color equals disease stat from there. And go ahead and do color equals black. And one thing to remember is that if it's in the mapping, then we're mapping the data from a column in the data frame to some aesthetic or again, pre attentive attribute. If it's outside of it, then everything is going to get the same treatment, right? So the three centroids will all have the same color. They'll all have the same size and the same shape. And the black black isn't doing it for me. Maybe we could make it the same color as the lines, but make the border black. And so let's let's see if we did fill equals disease stat that. And then we've got the default colors. So one thing we might want to do is copy that down and then make this scale fill manual. And then that gives you kind of a clear square for where the centroid is. Again, I'm not totally sold that I like this visualization. I think it's kind of busy with all those spikes going around. And I think what we're going to come to is the benefit of perhaps drawing an ellipse around the three different clouds of points. So let's give that a shot. I think this is useful. And what again, I think this might be a little bit more effective when you have fewer points. And perhaps, you know, it's not it's not so noisy. It doesn't look like fireworks going off here. So the next plot that I want to make with you, the third plot, I'm going to go ahead and copy from what we had earlier. Bring that down. And again, we've got our metadata and MDS and to remind you what that looked like, this was with our three centroid points. I'm actually going to remove those centroid points to get a clean ordination like so. And we're good. Let me go ahead and copy down a gg save. And I'm going to call this ellipse. And what we're going to do is we're going to draw ellipses around the three different clouds of points. How do we do this? It's actually not so bad. So what you can do is stat ellipse, and add that to the gg plot flow. Let's go ahead and give that a run and see what it looks like. And I will open this. And what you see is, well, big ellipses around our points. And one of the things I don't totally like about this is that the ellipse is much larger than the cloud of points. So this red ellipse goes way out beyond where the points are. Now, the ellipse is actually a statistical transformation. That's why it's stat ellipse. It's assuming a normal distribution in the data to draw these ellipses. Our data are not normally distributed. I really just want the ellipse to make it look pretty. So what I can perhaps do instead, one of the arguments is type, and the default is norm. That gives you the normal distribution. You can also do a t for a t distribution. That actually I think will get you maybe wider tail. So it'll give you a larger ellipse. Let's stick with with norm, the default. And what you can do is level. And so this ellipse is drawn is a 95% confidence interval. So let's do say 0.7. Because again, I'm not interested in the statistical transformation so much, I want kind of a general idea of where the points are. And that's maybe that's better and that it brings it in and constrains it a bit. Let's maybe make it a little bit larger with like, say, 0.8. Give that a shot. I think that let's let's split the difference and do 0.75. So I think that looks okay. Again, I'm not going for some precise fitting of the data. There's something called a convex hole. And so maybe in a future episode we'll talk about building those. And that's something you can easily do in a package called ggforce. But for now, let's work with pure tidyverse. And those ellipses I think look fairly decent. What I'd like to do is shade the ellipse. So fill the ellipse, but put it behind the points. So how do we do that? Well, if we want to put it behind the points, well, we have to put it behind or in front of geome point, we want to add geome equals polygon. All right. And so that then fills the ellipse with this dark color, which is not what I want. And so instead, what I want is I want to do fill equals disease stat. And so then we get the default color schemes, which again, isn't what I want. But like we saw earlier, when we were making those those centroids where we had a different fill from the the border the color, I'm going to do scale fill manual. And this then will give us ellipses that are the same color as the points, which isn't exactly what I want. So what I'm going to do is I'm going to pick colors that are actually a muted version of those colors that I already use. So instead of gray, I'll use light gray. Set of blue, I'll use Dodger blue. Instead of red, I'll use pink. And so again, this gives us ellipses that for the most part encompass all of our points. These are opaque ellipses. And so you can't see what's behind it. What we could do is up here in stat ellipse, we could go ahead and add alpha equals 0.2. That then gives us an ellipse that's shaded but it's not so intrusive, right? And I think this does a pretty nice job of showing the general distribution of the data. One thing I am noticing is this legend again has a box to indicate the geometry for a polygon. We can remove that as we saw earlier. Do you remember what we used? Yell it so I can hear you. Just show legend equals false. And that will then give us back our legend as we had it before with those three points. Perhaps what we might do is kind of fill that with white and then that looks pretty decent, right? And so I think that looks fairly attractive. Tell me what you think down below in the comments. I think this is a good way of thinking about having ellipses to again bring that pre-attentive attribute to group our data so it's easier for our audience to see what's going on. If we had more time, something that I might do is kind of get rid of this legend entirely and maybe over here in red text put C to facility positive over here in blue text diarrhea and down here in gray text healthy, right? That way I don't have to be going back and forth trying to figure out what color corresponds to each variable. I know where it is. So maybe again that's something that we can tackle in a future episode or maybe you can use some, you know, some of your Google skills to see if you can figure out how you would do that. All right, so I like this plot. There's one other thing that I want to try that is a little bit different. So again, I'm going to go ahead and copy this ggplot chunk down and I'm going to get rid of stat ellipse. And so let's again see what this all looks like. And we need to add our gg save line. And I will call this density. And so we've got what we've been working with. And I can also get rid of the scale fill manual. So I hope you can also see that so much of building plots is can be doesn't have to be but can be taking something you've done before copying it, pasting it, modifying it and seeing if you like it better. Anyway, so what I will add then is geome and then density 2d. And what this should do is create a density map for each of the three variables or three diagnosis groups on our ordination. So let's see what this looks like. And wow, that looks pretty wild. But I think you can kind of see it, right? Like the the gray points are focused right around here, which is kind of about where our centroids were. Our red points are clustered more up kind of at like 11 o'clock. If you think of this as like a clock face. And then our our blue, our diaries are more towards like one or one o'clock or so on the face. And so I kind of like this. I feel guilty saying that. But think I need to turn off the show legend here. And then I also need to get rid of the fill. And then that gives us an interesting plot, right? What happens if we turn off the points? I mean, I think I need the points. I think there's something called Xeno graphics, which is data visualizations that are very different from what you're used to. I think this is kind of up there. Normally you would have a density plot, I think of kind of one variable. So I think if I were to present this, this might be hard to really justify, perhaps it would be better with like even more points. But I don't know, I don't know that what I really feel about this, it's kind of out there might be something that would be cool to plot and hang on my wall. Who knows? Anyway, there's also a geom density 2d fill filled. Let's see what that looks like. Yeah, and I think it's only doing one variable. It's not doing two variables. So anyway, let's go ahead and go back to geom density 2d and add the geom point. And so something else we might think about doing with this actually might be turning off the points. So depending on your audience, you may or may not want to confuse them with the points, right? So if it's going for a scientific publication, I think you should show all the points. If it's something that you're showing for more of a lay audience, then maybe you turn off the points. And you can kind of show that, you know, healthy people have a community structure that's very different from people with diarrhea and C. difficile. But it appears that people with C diff are a little bit different than people with diarrhea. So how would we do that? We would obviously we could come back up here. We could turn off geom point and and see this, right? This is almost kind of a cartoon example of what the visualization might look like. And I'm going to go ahead and put those back in there and and call it good. And again, I'm pretty happy with how this looks. And I don't know, tell me what you think of the ellipses. Sometimes I feel like they're a bit of chart junk. You know, sometimes I'll see the ellipses around like four points. And it's kind of like, really, do I need an ellipse to tell me that those four points are all grouped together? I think that's kind of obvious. But I think in something like this, where you have more points, it is helpful to have the ellipses to make it easier to understand what's going on. Again, let me know what you think down below in the notes. And anyway, I hope this has been helpful showing you four different ways that again, we can think about pre attentive attributes of grouping our data. And if you will, drawing a lasso around our points to make it more clear to our audience, what's going on in the data and what we ultimately want them to see. Now, are the diarrhea samples and the are the are the diarrhea samples in blue, and the C difficile samples in red significantly different? I don't know. I'm going to assume that the healthies are, but we'll save that for another episode where we think about using vegan, another are packaged to test the significance of that. Anyway, see if you can give this a shot with your own data. It doesn't have to be ordination data. You could also do it with any type of scatter plot data. Again, drawing these ellipses to help group our data to make it instantly clear to our audience what they should be focusing on and what comparisons they should be making in their in their analysis. And of course, we want to make things easier for our audience, because they're going to be much more likely to keep reading our paper and being engaged and ultimately, you know, getting the most out of our work. All right. Well, that's enough for today. Hope you found this interesting. Please practice with this. Check out the code that I'm posting and tweak it and kind of modify it to make it your own. And again, see if you can apply it to your own data to see if you can draw ellipses or contour maps or whatever around your data. Anyway, keep practicing and we'll see you next time for another episode of Code Club.