 If you're like me, whenever you try to visualize data, you quickly run into the problem of trying to represent way too much information in a single figure. That's why when I try to make visuals, I try to keep things as simple as possible while still trying to convey the information. But this problem of trying to convey information is a real challenge. We have more variables we'd like to show. We have more summary data that we'd like to show. Where does it end? Well, in today's episode, I'll show you a strategy for combining raw data with summary data. In the last episode, we talked about drawing a single horizontal line to indicate the median. Today, we'll show more of the distribution of the data by overlaying our strip chart with the box plot. Hey folks, I'm Pat Schloss and this is Code Club. As scientists, we think it's all about the data. Something I've really tried to hammer home in these episodes is that it's really about our audience. We have to have empathy for our audience. And this is something that's perhaps soft and squishy and something that isn't what we hardened scientists like to think about, just the data. No, we've got to give empathy. We have to help our audience understand what's going in the data, what we see in the data. We have to help them to read our data in the visual. And so one of those tools is overlaying raw data with summary statistics. As I mentioned, we've seen this already with drawing a median line. And today, what we'd like to do is show perhaps the intra-cortile range, as well as some measure of what points are outliers in our distribution. We can achieve that by overlaying a box plot on top of a strip chart or also called the jitter plot. And that's exactly what we're going to talk about in today's episode. I'd love it if you could fire up our studio alongside with me as I go through today's episode. The code that I'm starting with, you can find at a link down below here in the description. And up across the top is a video link where you can find instructions on installing our studio, getting the tidy verse, as well as the raw data that I'm using, as well as how to set up your project directory, so that my code should work for you. Anyway, as a recap, we are loading in those different packages that come to us from the tidy verse package, hold on to this set seed, but we're reading in the metadata, the alpha diversity data, and then joining that all together. We're defining some colors as well as getting the number of individuals in each of three disease status groups. So the data that we're looking at comes from a study looking at the gut microbiota of individuals with and without C. difficile infections. We had a group of people that were healthy controls, a group of people that had diarrhea, but not C. diff and a group of people with diarrhea and C. diff. So to be tested for C. diff, you have to first have diarrhea. Those are the three groups. Anyway, we then made a jitter plot, which again, shows columns of points for each of the three different disease status groups. And within each of those clouds, the position on the x axis is randomly determined. So that set C that I had up above pegs the random number generator that when I run this multiple times, even though it's random on the x axis, the points will still fall in the same spot. We then have a bunch of other styling and we ran it. So let me go ahead and run all this to make sure it works. And then I've got all my libraries and data loaded. So you can see my strip chart here. It's also called a jitter plot. I'll go back and forth. Anyway, I don't know what most people call it. Tell me down below in the comments. What do you call these plots, jitter plots or strip charts? The GM and GG plot is GM jitter. So maybe we should just just call it jitter plot. Anyway, we also have here a horizontal line indicating where the median is for the data. That's only the 50th percentile. We'd also like to indicate the 25th and 75th percentile, which are the hinges on the box of a box and whisker plot. All right. So the first thing I'll do is to comment out GM jitter and stat summary, I don't totally want to delete them because I want to work with them later in the episode. The GM will use to make the box plot is GM underscore box plot. And I'll add that to the workflow. And so what we get then are these box plots. And I've talked about this before. I'm repeating myself that that legend doesn't add anything because we have the X, we have those X axis labels corresponding to the color in that column. And so I don't think that legend, like I said, adds anything at all. And this then gives us more realistic. Great. We have a box and whisker plot. We have more real estate because that legend is gone. And let's break down what we've what we're seeing here in the output. So first of all, we have this horizontal line that is thicker that indicates the median of the observations of the inverse Simpson index for each of our three disease status groups. The we then have a rectangle, the box, and the bottom is the 25th percentile. So 25% of the data falls at or below this line. 75, the 75th percentile is the upper border of that box. Sometimes they're called hinges. I don't know. So 75% of the data falls below this. And of course, the median is the 50th percentile. Okay, so the distance between the 25th and the 75th percentile is the intra-cortile range, the IQR. This line then, the whisker, people often are like, what is that whisker? Shouldn't it go to the maximum point or the minimum point? No, the whisker is 1.5 times the length of the IQR. To the point that is closest that is to that point or whatever the most recent point was, right? So this whisker isn't as long as this because we had a point right here at about like four, right? So the whiskers are not symmetrical. They only go as far as the point the data go. And if there are data beyond one and a half times the IQR, then we see those as individual points. So most people don't know that, I find. And it's a bit confusing what that represents. So a question for you to answer down below in the comments. What do you think or did you think that that whisker represented? And if this is something new to you that it's 1.5 times the IQR, go ahead and hit that thumbs up button so that I know you're getting something out of this video. All right, so that's one thing to note. And again, when we're trying to have empathy for our audience, we want to make sure they know what they're looking at. These points then are kind of like outliers. They're not outliers in the sense that they're bad data or something wrong happened in the analysis, but they have extreme value. And they, they are definitely part of the analysis that we want to keep. A problem with this depiction of the data, though, is I don't know if this is one point or two points or three points, if the points are on top of each other or not, there's no, there's no jitter, right? We talked about jittering in the last episode. So one benefit of adding data, the raw data to this depiction of the data would be then that we could see if that was a single point or multiple points. Another thing we could do is to turn the alpha down to say like 0.5. So let's go ahead and do that. So we can do alpha equals 0.5. Run that. And then we see that we've got two points here that are gray, gray or lighter gray, more transparent. And this point here is darker, right? And this point here is also darker, indicating that there's perhaps two points on top of each other. And so because the outliers are not jittered, then well, they're not jittered, we see them on top of each other. Something else you may have noticed about this plot is that my blue and red got a more muted color. And that's because the fill color has also had an alpha applied to it. So if I only want to set the alpha for the outlier, I could say outlier.alpha equals 0.5. And now my fill is solid, but my outliers then have that lighter shading. I don't totally like doing that with the alpha because I'll want to see the data. Ultimately, I want to see how many points are there. So I'm going to go ahead and remove that outlier alpha. But that's important to remember that there are things we can use, we can do to change the way those outliers appear. So what we will do now is to add back this geom jitter line. And so we're going to leave all the same styling that we had from the first, when we first started this episode, what we see now are our data, the raw data stacked on top of our box plots. And so we see that again, the width are different. We can modify that. Also, you know, I might go back and change the fill color of my boxes so that they're not, that they contrast more, right? And thank goodness I drew that black line around my circles, otherwise all the points in here would just kind of fade into the background of the box plot. The other thing I notice is that my outlier points for my box plot are still here. Thankfully, they're a different color. But the other thing I notice is that they are, they're at the same y level. So this is this and this is this, right? And so I need to get rid of those outlier points from the box plot. So I'm not kind of like double representing the data. So let's take that on first. And what we can do in our geom box plot would be to do outlier dot shape equals na running that, we now get rid of those outlier points, right? And so we no longer have that double representation of those outlier points and they're gone. Okay, let's go ahead and change the fill color to not be so full will make that 50% transparent will do like alpha equals 0.5. So that again gives it a more muted background and allows the data to really pop out more, you could probably make it more muted if you want if we did like 0.25. So that that's cool. Let's make the box the same width as the data. And I'm going to guess that it's the same width here. So 0.6 that we want to use here. So we'll do width goes 0.6 stealing that from our stats summary from before. And so that looks good. That's about the same width as the actual cloud of points, maybe a little bit wider. But I think that looks pretty good. And again, it allows the data to really pop in front of that data. If you were to change the color aesthetic, instead of the fill aesthetic, the color would change the color of the lines in the box plot. So let me show you what that looks like. Real quick, we're not going to use it, but we can also do color equals disease stat. And so then you can see that again, the border colors are different. Anyway, let's let's get rid of that because I think I like the black border, both on my points as well as on the box plot. As I was mentioning earlier, this whisker that comes off the box is one and a half times the length of the IQR, or to the last point, right. And so, again, that causes confusion in my sense is that most people in your audience will kind of automatically delete that line, they'll see it, they won't understand what it means, it might cause confusion. And so then they're like, I'm out. So we perhaps have two, three options, we could leave it and educate our audience about what it means, we could get rid of it, or we could perhaps extend it to the maximum value, right? I think all three are reasonable choices to make depending on what you're trying to say. So how do we do that? Well, funny, you should ask because I actually know. So there's an argument we can give Geom box plot, which is cof. And so this is the coefficient that you're multiplying by the IQR to get the length of that whisker. So the default is 1.5. So if I do 1.5, I should get the same output. And thankfully I do. If I do cof equals zero, I get rid of the whiskers, right. And if I do cof, let's just pick a big number like 1000, then it should extend to the full range of our data, which it does, right? And so I think that looks interesting. I don't know that that the line extending to the min and max actually adds anything because my eyes can see where the min and max is if I'm actually seeing the data, right? So if you're showing the data with the box plot, I don't know that the whiskers really help. I think the box helps you to define where that, you know, 25th to 75th percentile is. I don't know that I really need the whiskers. So what I'll go ahead and do is for me, I'm going to add zero, make cof zero, and get rid of those whiskers. And I think that looks pretty nice. It does look weird without the whiskers. But at the same time, if people don't know what that 1.5 IQR means, then it's not really helpful, right? And frankly, I spend my time kind of like trying to visually multiply one and a half times the IQR to see if that actually works. And that's distracting, right? I'm not spending my time interpreting the data the way that you, my presenter, want me to interpret the data. Anyway, so that's, that's cool. All right. So again, we have combined our raw data from, it's not raw, but, you know, our individual person patient data with a summary statistic of that box plot. We've been on a kick of talking about stats summary. And as I mentioned at some point, when I get a new tool, I love to use it wherever I can. Because I'm just so excited to have a new tool, it's kind of like everything, I've got a hammer and everything's a nail. So how might we go about creating this box plot without using Geo and box plot, but using stats summary instead? Well, let's see. So I'm going to go ahead and for now and come out, comment out Geo and box plot. So we have that there for us to kind of compare the syntax to. I'll go ahead and grab this stats summary and uncomment that. And instead of fun equals median, this is what drew the median line, I'm going to do fun dot data equals median, high, high, low. And what fun data expects is a function that will output three values, three data frame with three values. So the Y so position on the Y axis, Y min. So the lower edge of something and the Y max or the upper edge of so that can airbar or a point range or something like that. So median, high, low, will return the median, as well as a two and a half percentile and 97.5 percentile, actually want the 25th and 75th. So what I'll do then is fun dot args equals 50. And that should give me my 50th percentile, the 50% confidence interval. And I will then use crossbar like I was using before. That crossbar was a bit of a hack, where if you give, you know, the Y, Y min, Y max the same value, then you'll get a box that's just the median. Okay. And let's go ahead and run that and see what this looks like. And problems, this shouldn't be 50, this should be 0.5. And sure enough, we get our, our rectangles, our boxes from our box and whisker plot. So one difference between this and what we got using GM box plot, again, the fill color had an alpha of 25%, I believe. So we also don't need that size equals 0.5. I think that's the default as is this color black. So I will make alpha equals 0.25, give that a run and it should look the same. And sure enough, there we go, we've got our boxes and our points. And we've achieved this using stats summary. Now, which do I prefer? I'd probably prefer the GM box plot because if I did want to draw the whiskers, it'd be a lot easier to include the whiskers than I can hear with with GM crossbar. I could add the whiskers, but it would take some more finangling. Also, I don't totally know how I would do that. And I know how to make the box plot. So the box plot works just fine. Anyway, again, we have options and flexibility. And that's nice. The nice thing about stats summary is that it's much easier, though, to change the confidence interval, right? So if I did want to go back to the 95% confidence interval, I could change that one number and get that expansion in the size of my box. Anyway, I'm going to stick with the 50th percentile as a confidence interval, and we will be good to go. So looking at this new figure that we have, the difference between what we started with and what we have now is that not only do we have the median line, but we also have the top and bottom of the box to indicate the 25th and 75th percentile. Adds more information. This adds the information of not only the central tendency of the data, the median, but also tells us something about the distribution of the data. And that we can see that for healthy individuals, the IQR actually appears to be wider than it is for people with diarrhea who are CDIF negative. And, you know, that's even wider than people who have diarrhea and CDIF. Whether or not that difference in variation is meaningful, who knows. But it allows me to see in that box where 50% of the data lie. And I think that's good. I don't know that it's important. I'm not totally sold that I need to know the shape of the distribution to that level. I don't know that adding this ink to the amount of data I have is a proportional gain in information. But again, people see things differently and have different aesthetics and personal, you know, senses of style and how they like to present the data. You know, I would encourage you to play with this experiment. And as I am showing you as I create a new visual, I try to stop, think about what I like, what I don't like, see if I can make modifications to make it better, but then also realizing that there's kind of natural limitations to what we can do with these types of visuals and that there is no perfect visual. There will always be things that we can critique and wish that we could make better. If you're not, if you're not falling into that and that you're not, you know, seeing that you can always make something better, then I suspect you're probably not being critical enough of your own work. I would encourage you to kind of always be thinking of how can you make your visual better? And also, how can you make yourself better? Anyway, go ahead and see if you with your own data can combine the raw data with the box and whisker plot. See if you like to have the whiskers or don't like having the whiskers. Be sure to ask your friends, maybe people at a lab meeting, ask them to tell you what their box, what your box and whisker plot is actually saying. Do they know what those whiskers actually represent? And I think it'd be really illuminative. Let us know down below if you ask people what do they tell you those whiskers represent? I would love to know as well. I've hope you've gotten a lot of value out of today's episode in how we can combine raw data, the data for the individuals with the summary statistic representation using the box plot, seeing using geom box plot, as well as using stats summary. Please be sure to tell your friends about Code Club and the various ways that we've been working with different types of data visualization. I think this is also just hopefully making it perfectly clear that GG plot is an amazing tool to make all sorts of different visuals. It's amazingly flexible. And there's just so much there that I'm always learning more. And I'm sure and I hope at least you're learning as we go along too. Anyway, like I said, keep practicing and we'll see you next time for another episode.