 It's pretty easy to critique other people's data visualizations. You didn't have to make it, right? You didn't have to deal with the constraints and restrictions that the developer of that visual had to make. Well, I think I've perhaps been leading you down a path of thinking that there is an ideal way or perfect way of visualizing relative abundance data for microbial communities. In the last episode, I showed you my preferred approach. I didn't mean that it was perfect, but it's my preference. It's what we call a jitter plot or a strip chart, where we can see each individual point for a categorical variable. Now, the downside of that, as I showed in the last episode, was that if you have hundreds of points, then it just looks like a big blob of points. And it doesn't really tell you everything that you might want about your data. Well, today, we're going to add some more summary statistics to our plot, perhaps get rid of those points, and see if we don't come up with a better visual. Hey, folks, I'm Pat Schloss, and this is Code Club. As we've been marching through recent episodes looking at a variety of approaches of visualizing relative abundance data, there have been a few themes that I've come back to time and again, as I've tried to evaluate my data visuals. So we've talked about things like, is it easy to discriminate between our different taxa? So for example, with the stacked bar chart, as you add more and more taxa, the discrimination between your different colors in your color palette become harder and harder to make. Is it possible to make comparisons across different treatment groups? So if we have one taxa that we want to compare across different treatments or different groups of, say, patients, can we easily make that? And so we talked about the problem of stacked bar charts again, is that perhaps you can do that with a taxa on at the bottom, or at the top if you're, you know, thoughtful in how you design your visual. But those ones in the middle are harder to compare. And so that's why we talked about bar plots. So we put every taxa across the x-axis, and that way we have a common baseline on the y-axis to make our comparisons. Another theme that we've talked about is how difficult is it to see the amount of variation in the data? Well, in the last episode, we showed all the points. And that was helpful to some degree, because we could see all the points, we could see the variation in the data. Whereas if we look at something like a pie chart or stacked bar chart, bar plot or the heat map, we're showing the average and we don't show the amount of variation in the data. But that strip chart really did show us the amount of variation across all the individuals within each of our disease status groups. The fourth consideration is how difficult is it to see the end or the number of subjects or samples that you have in each grouping. Now, we did that, we showed all of our points in the last episode for that jitter plot or strip chart, whatever you want to call it. But you get to a point where there's so much data that is really difficult to see the distribution of the data, right? That our points kind of glom onto each other and make this looks like blob of data. And so it's hard to see what's going on. Sure, you can see the distribution, sure, you can indicate the median with, you know, that crossbar. But there's just too much data. So there are tradeoffs, right? And so, like I said, there is no perfect way to visualize the data that might get all four of those aspects right when visualizing relative abundance data. The other thing to consider is that we perhaps want to look at a lot of taxa. If we're looking at the phylum level, we might have four phyla and an other category. But say we want to go to the genus level or the O2 level, we might have 20 taxa that we want to represent. And so that way, then, you know, it gets harder and harder to see the data, see the distribution and the data, because things are going to get more compact with each other. So again, there are tradeoffs, there are constraints. And so in today's episode, what I want to look at is perhaps, how can we add more summary information to the data, so that we don't have to show all of the data, all of those points, and that, you know, we'll tell our readers, hey, we got a lot of data. We're not, we're not showing you summary statistics based on three values. We're showing you summary statistics based on 100 values, right? And I think that's far more compelling. And hopefully, you know, the reader is just going to have to trust us that we're not being loose with our statistics. Yeah, I mean, if you're, if you're showing a box and whisker plot and you've got three points, I mean, what's the point, right? Anyway, so that's what we're going to dig in today. Let's go ahead over to our studio and we'll see if we can't get this going. As always, if you want the code that I'm starting with, down below in the description is a link to a blog post where you can get this code that I'm working with today. Also across the top, there is a link to another video episode where you can see how I installed our studio, got the data, got the packages all installed, and you'll be right where I am and then ready to press play and follow along with me, which really is the best way to learn this material. Don't just sit back, take it all in. You really got to engage the material and practice it with yourself, practice it with what I'm doing. And then the pro move would be to move it and work with it on your own data. That would be so much more powerful than just sitting back and watching it and you'll learn so much. And in the end, hey, you'll have a plot like mine, but for your data and isn't that a win? All right, so I'm not going to go through everything in this code, but basically what we're doing is we're reading in O2 data, taxonomy data, metadata, throwing it all together, and then calculating relative abundances. And as I have here, we're looking at it at the phylum level. Then we're also pooling together any taxa that fall below a 3% threshold for any of the different disease status groups. So if the relative abundance is above 3%, for a taxon in any of the three disease status groups, we're going to keep it in the final output. This code chunk here from line 61 on down, goes ahead and builds a jitter plot, strip chart, whatever you want to call it, using geom jitter, and then uses stats summary to put a bar across each of the clouds, indicating where the median is. I'm going to go ahead and run all this and show you where we're starting. Again, we've got our jitter plot here of formicides, bactyroidettis, proteobacteria, rucomycrobia, and that pooled category of other for three disease status groups, people who are healthy, people who have diarrhea and are C. difficile negative, and people who have diarrhea, but are C. difficile positive. And again, those horizontal black lines indicate the median of the distribution of all the points. So the first thing I'd like to do with this figure is go ahead and convert it from a jitter plot to a box and whisker plot. I'll go ahead and comment out my geom jitter and stats summary lines, because I might want to reuse that code later on. I might want to add a jitter plot on top of the box and whisker plot. We'll see. It's always best to comment things out and then delete them later than delete them now and wonder, gee, I wish I still had that code. Anyway, so we'll do geom box plot. And let's start there and add the plus sign. This is the box and whisker plot version of our jitter plot. You'll notice the solid line in the middle of these rectangles. I guess it's not the middle, but in the midst of the rectangle is the median. The top and bottom edge of the rectangle represent the 75th and 25th percentiles. That's the intra-cortile range. The whisker extends one and a half times the intra-cortile range, the IQR. And so we then have these extra points at the top and for bactoid eddies over here below the whiskers indicating extreme values. Something that you may have noticed or remembered from looking at the jitter plot was that there are actually a handful of points up here around, you know, 95, 100 percent for the fremicides and proteobacteria. Those are all kind of getting collapsed into the same x-axis coordinate. And so this point actually represents many more points than what we're seeing here. And so with that in mind, what I think I'd like to do is go back, remove these outlier points and overlay on top of my box plot, my jitter plot. So again, we'll come back up to our ggplot and instead of fill, I'll do color. So that will set the border color. I'm going to leave my geombox plot. Let's get geomjitter back in the game. Instead of pch21, I'll use 19. I no longer need the stroke or the size. Those are things we did to kind of game the system, if you will. And if we look at scale fill manual, we want scale color manual. And let's just double check that there's nothing else we might want to change. And I think that all looks pretty good. So again, this is our box and whisker plot with the jitter plot incorporated as well. There's a couple things I want you to notice. First of all, we still have our outliers from our box and whisker plot. So here, you can kind of see this blue point. It looks a little oblong, right? So there's a point there that is the outlier from the box plot. And there's a point there that's there from the jitter plot. And you can kind of see that better, maybe up here at the top of this diarrhea-cetaph negative proteobacteria column. So we need to turn off the outliers. The other thing I kind of noticed is that if again, if I look at this proteobacteria triplet of three columns, that the blue box seems to do a pretty good job of enveloping the cloud of points. But the red seems to be shifted to the left, just a smidge, and the gray to the left, or to the right, or whatever, just a smidge. So I think I need to play with my positioning of either the jitter plot or the box and whisker plot. First, we'll come back up to our geombox plot, and we'll do outlier.shape equals na. And that will get rid of the outlier. The other thing that I'll do then is position equals position dodge. And I will then give it width 0.8. And so now I no longer have those extra outlier points from the box and whisker plot. And it does appear that that gray box is better centered on the cloud and the same for the red. So that all looks pretty good. So I think this looks pretty reasonable. One thing that I kind of feel like is that sometimes the box and the median line kind of gets lost in the data. So if you look at like the firmicities on the far left side here for the gray, that it gets kind of hard to see where the box and whisker are for the gray, because there's so many points there, right? Certainly if my jitter were wider, you might not even be able to see any of the gray box. So that's a that's a challenge, right? And again, this comes back to the problem that there's there's just a ton of data being shown here. And I don't know that we really need to show all of the individual data points, we could do different things like changing the alpha of the points, making the points smaller. But in the end, if we make things smaller or make them more transparent, then they're harder to see. So what are we really gaining by having all those points? Before we totally ditch the idea of the jitter plot, let's go ahead and convert this to look at the genus level to see what this looks like when we have a lot more data, because I think that might also further convince us that there's just too much data in the jitter plot for when we have a large number of samples, I will come back up to the top of my code here. And you'll recall that I have this filter on level phylum, let's do genus, and we'll run it and see what things look like. And again, what we've seen before is that the names all kind of run into each other. And what you also notice, of course, is that I will, you know, for these other classes from about the fifth set to the right, you can't even see the box and whisker for any of the taxa. And for most of these on the left, it's really hard to see those as well. Let's go ahead and flip our x and y coordinates. So I'll make tax on y relative abundance on x. And then I'm going to come back through and make sure that I update everything. So this should be scale x continuous. You know, I don't even know why I have scale x continuous in here. We'll do y for null and x for relative abundance. And then my axis text, x needs to be y. And I think that all looks good. So this looks like a bit of a mess. I think you can get a sense of what's going on the data enough to know kind of the strengths and weaknesses of this approach. Again, because we have so many points, it's really difficult to see the boxes and boxes whiskers the median lines of the box plot along with our points. Things just get lost in the cloud of points because we have so much data. Maybe this would work better if we had say 10 to 20 samples in each of the taxonomic groups. But as it is, I think we just have way too much data to be showing this way. We might do a few things again, we might make the figure longer than wide. That might give us more separation. We could also perhaps make the box and whiskers like black or a different shade of gray, blue and red. But ultimately, you know, I think I think there's just going to be so much going on in the figure that that's not that's not what I want to do. You might give it a shot and decide you like it. But for my aesthetic purposes, that's not really what I want to do. So I think what we'll do is maybe step back from the box plot and let's look at incorporating some of the stat summary functionality to see if we can replace kind of this mess of data to make it a little bit more of a cleaner depiction of the relative abundances. As we build that out, I'm going to come back up and recreate my plot at the phylum level. So let's come back up to our GMs. Again, one of the things that I really like about organizing my code by putting all my GMs together is that if I want to go change a GM or add in a GM or take out a GM, they're all in the same place. And that just makes it nice to have my code organized. So I always know where the different parts of the plot are. I encourage you to do the same. All right. So I had GM box plot. I'm going to go ahead and remove that for now. I'm going to uncomment stat dot summary. And you'll recall that we used fun median with GM crossbar. So the crossbar GM is the box with the median line. Let's go ahead and replace fun with fun dot data. And I will do median underscore high low. And so median high low is a function that returns the median, as well as the upper and lower confidence interval on our data. And again, GM is that the box of the box and whisker plot. And here we go. We have our boxes. But what you'll notice is that the boxes are much longer than what we had with the box and whisker plot. Something I'm curious about is what is the confidence interval that median high low is using? If I do question mark median high low, because that is a function that returns the median, as well as the Y min and Y max, I learn over here in my help that these are wrappers around functions from the H MISC package. And so it is designed to make it easier to work with stat summary. So we can see the H MISC documentation for more details. So we're using median high low, median underscore high low as a wrapper as a replacement for S median period high low. So we'll click on this. And if we come down and we look at S median high low, we see that the confidence interval that it appears to be using is 0.95. Indeed, if I look back at my figure, that makes a lot of sense that 95% of the data is within those rectangles. So I wonder how do I change that confidence interval? I'm going to go back to stat underscore summary in my help. And again, looking down through the usage description here for stats summary, I see these various fun arguments. And I see a fun args, which allows me to give it a set of arguments to go into the function. And if I come down somewhere here, optional additional arguments to be passed onto the functions. So let's go ahead and try that out with with our stats summary, where we can do fun dot args equals let me put this on a separate line list as a function. And then I will do conf dot int equals 0.5, add that comma, great, we see now that by using that fun dot args argument, we can change the confidence interval on the median high low function. Let's try a different geome. Perhaps we could try geome point range. And so here we've got I see a line. But it's hard to see anything else. So I'm going to go ahead and turn off the geome jitter and see if we can't get a better looking plot. Wow, that cleans things up a bit, right? So now we see that we have our points indicating the median, as well as a line indicating the 50th percentile. And so I think this looks a lot cleaner than what we had previously. I am getting an argument that it doesn't know what to do with the parameter width. And that again, makes sense because we don't typically think about using a width with a dot plot or line plot. I'm also going to turn off that size. I'm curious what happens if we turn off position if it naturally dodges, or if it puts things right on top of each other. Yeah, so it puts those right on top of each other. Let's go ahead and turn back on the dodge. Yep, so that dodge got those points separated apart, which looks good. I noticed we also lost our legend along the way. So back up here in stat summary I had show legend equals false. Because we got rid of geome jitter, which was providing the data for the legend, we need to keep that. So we'll go ahead and re insert that. I'm also going to change the position dodge to be say width equals point four. And that brings those in a little bit tighter to give perhaps more separation between the different taxonomic groups. And we see, of course, that we've got our legend back in the upper right corner of our plot. Let's come back and now look at the genus level. Again, I will change my level from phylum to genus. So again, this is still a bit of a mess with the ordering and the position of the legend and all that. Again, we've removed the jittered points to give us a cleaner figure. Yeah, we're not showing all the data, but we're showing the 50th percentile, right? So 50% of our data falls within those lines. We could come back and let's look at, you know, let's look at the 95th percentile and let's see what that looks like. And that gives us really wide span and variation in our data. And for some of these things like chronobacter focaliocola interococcus vectorities, the range is so wide that I don't think it's really meaningful. So I'm content to stick with the 50th percentile. Something I've noticed in previous episodes with the genus level is that a lot of our data are falling into that other category. So like the median between 20 and 30% of our data is falling into that other category. So let's go ahead and change our threshold for pooling from 3% to 1%. I'm also going to change the dimensions of my outputted figure to be a height of let's say seven. And maybe I'll make it a little bit more narrow of a width of five. Great, we've got a lot more genera depicted here. One of the things I'm really enamored with with this code is that I can very quickly pivot between phylum, genus, different pooling levels, different numbers of taxa without my whole figure just kind of blowing up. I think that's just such a slick thing about ggplot and why it's so nice to work in that environment. Anyway, one of the things I'm also noticing is that from Akramanthia back on up to this unclassified bacteria pool that I can't really see any difference between those different groups. I know because I set the threshold that one of the three disease status groups is greater than 1%, but I can't really see that difference. Sure, maybe I can for this like Agathobacter you know, Pheasant or Roseberry, but I can't really see it. Something we might think about doing in a future episode is seeing how we can put the relative abundance on a log scale. That will introduce other complications, but we'll deal with that in another episode. So maybe what I'll do is go ahead and put this back to 3% because again, I can't see the difference when I go down to that 1% threshold. The other thing I'll do is flip the order of my taxa as well as my disease status groups. So first things first, we'll come back up and we'll change this pool from 1% to 3%. And then up here, my disease status groups, I'll change the order, put in case first and non-diarrheal control last. Because of the way the factoring works in the plot, it actually turns out to be the opposite. Again, let's not think too deeply about it. We want things to look right. And then down here where we define our taxa as a factor, we reorder by the mean relative abundance across all three disease status groups in a descending order. So I'm going to set that to be false so it's in a ascending order. So it looks a lot better I think. I'm noticing that my legend is in the way and also that some of the points in the point range, the dodge isn't enough. So I'm going to move the legend and increase the dodge width. So again, we did that dodge width up here in stat summary. So let's give that 0.5. And then our legend position is currently in that upright corner. I think the exposition of 0.8 is good. Let's go down to 0.6. So we've got the legend to a good position, but I'm going to change the dodge width just a little bit. You all if you've watched this before know that I am really good at puttering with things and just tweaking and tweaking and tweaking. So I think this looks pretty nice. I've got my genera across the y-axis again and then across the x-axis, my relative abundances. I've got my points to indicate the median as well as the range to indicate the confidence interval. Of course, in a caption, I would need to indicate the n for each of the three disease status groups as well as what that line range represents. I could even if I wanted, go ahead and insert the n into this legend here. I guess if you're looking at this at first glance, I'm really hoping that you'll trust me. You'll trust me that I'm not trying to pull a fast one on you in terms of what I'm showing in the data. And so I'm more content to leave kind of the nitty gritty of the details down in that figure caption at the bottom where we put those things for our scientific papers. So otherwise, I think this looks pretty good. I'm happy with the way it looks. Let me know down below in the comments if there's anything you would change. Again, it's a sacrifice, right? We're getting rid of that jittered data, but it's just cleaner. It's easier to see what's going on. And we are relying on our audience to trust us a little bit in what we're showing them. I think sometimes we worry so much about showing all the data and making things as transparent as we can that we lose something in the storytelling of what we're trying to say with our data. Again, what I'm trying to say with this type of figure is that there are individual populations of bacteria that are differentially represented between my three disease status groups. So healthy people tend to have a lot more bacteriities than people with diarrhea. People with diarrhea tend to have more enterococcus than people that are healthy and so forth. So another thing that we all know is that this is kind of me telling a story visually or me looking at the plot and making inferences. I haven't run any statistical tests to tell you that any of that's actually statistically true. So in a future episode, we'll come back and we will identify which of these comparisons is statistically significant. And maybe we'll even use that information to get rid of more taxa so that we're only looking at the comparisons that are statistically significant. So anyway, that's for a future episode. For now, I'm happy with the way this looks. Again, let me know down below in the comments what you think is getting rid of the jittered data. Too much of a sacrifice for you or are you cool with that? I think because we have many taxa and we have these three groups and we have so many points that those clouds of points just get kind of smushed together. And also if you're to plot on top of that like a box plot, you can't see the box plot because there's so many points again. There's just so much going on. So again, I know it's a sacrifice and I'm trying to talk myself into this if you can't tell. But I think this is the superior way to show that this type of data for again, trying to identify populations that are differentially represented between our different disease status groups. Let me know. Tell me if you're convinced. I think I'm finally convinced. Again, take this code, apply it to your own data, see what it looks like. And again, that is the best way to learn how to use our use the ggplot functions to make attractive figures. I think this is an attractive figure. I would be happy with publishing this if I were to add on the statistical information. Anyway, keep practicing with all this. Please tell your friends about Code Club. Make sure they're subscribed. Friends don't let friends watch videos without them being subscribed. Anyway, we'll see you next time for another episode of Code Club.