 Whenever we present the results of analysis we've done, there's an element of trust between us and the audience. I am telling the audience what I see in the data. I mean, I have a bias. I have a position. I have something I am trying to tell you, right? The data do not speak for themselves. At the same time, the audience is trusting that I'm not cooking the data. They're trusting that I am using appropriate statistical techniques. They're trusting that I'm using visualization approaches that don't have sleight of hand behind them or that aren't deceiving. There's an expression, and God we trust, all others bring data. So how do we bring the data? How do we show me the data? Well, we can show the summary statistics as we've discussed in previous episodes. And in today's episode I'm going to show you how we can marry that with the raw data so you can show the data and the summary data to make a interactive and compelling visual. And we'll also discuss the trade-offs between showing the raw data or not. And, you know, why showing the raw data isn't always the best idea for an elegant and compelling visual. Hey folks, I'm Patch Loss and this is Code Club where we show you the data. My favorite type of plot when I've got continuous data for the y-axis and categorical data for the x-axis is something that's called either a jitter plot or a strip chart depending on where you're looking. It are the geome we're going to use is called geom jitter. This shows all of the points as a column, or I guess if you flip your axis as a row. Anyway, we'll make a column. A column of plot of points where the x-axis position within each column of points is randomly determined and so it doesn't really mean anything. But you get that horizontal separation within a category so you can see all of the data. What I'd also like to do is using the stat summary function that we recently learned about, I want to draw a line segment going across that data that shows the median value for that column of points so that my audience can see the distribution of points for my different disease status groups as well as a summary statistic of the median so they can see what the difference in median values is for my three different disease status groups. I'd love to have you follow along with me here in our studio to get the code I'm starting with. Go down to the description below and there's a link to a blog post where you can get this starting code chunk. Also up above here will be a link to a video showing you how to install R, RStudio, the tidyverse and also get the raw data that I'm using as well as how to set up this RStudio project that we're in. All right, so I am loading my tidyverse and readxl, glue and ggtext libraries for making these attractive plots. I'm reading my metadata in here from an Excel spreadsheet. I'm getting my alpha diversity information and then joining those all together. I also have here a variety of variables. I also have here a variety of variables for defining colors that I'm going to use as well as different code chunks to calculate the number of individuals within each of my disease status categories. Before I forget, I'm going to go ahead and run those lines of code. If you've watched any of these episodes in the past, you know that I frequently forget to run those lines. Starting here at line 41, I now have a chunk of code in ggplot to build a plot that for each of the three disease statuses shows the median as well as the 95% confidence interval. The median is shown with the ball and the line is depicted using a black vertical line for the error bar. Anyway, I'm going to go ahead and comment these out because I'm not going to use them just yet, and instead what I'm going to do is a function called geomjitter. And we will go ahead and insert that. No arguments. I'll go ahead and run it all. You'll recall in the introduction I commented that the beauty of a jitter plot is that we see all of the data and within each of the three categories then we have a random dispersion on the x-axis. The position on the x-axis within this gray, blue, or red cloud doesn't mean anything. What's important is that it's gray, blue, or red. It's all squished together because I have this big legend off here to the right. I'm going to go ahead and turn off show.legend equals false. We again see better separation, more real estate used by the data rather than by the legend. I personally don't like to show the legend in this case because I've already got a label here on the x-axis for my three different disease statuses corresponding to the colors. So the legend doesn't add anything. Okay, so again, good. We've got this cloud of points indicating the three different groups. Something that we could do with geomjitter perhaps would be to do width equals 0.5. Give that another run and that will make our cloud a little bit more narrow, or I guess a little bit more wide. I always forget what direction things are going to go. So let's try 0.25 and we then see that that makes it a little bit narrow. I always forget the default values and so playing with it a little bit can help. If I make this 0.15, we'll see that it's a little bit more narrow still. And so that looks pretty good. Perhaps too much separation between our three columns, but I think I'll go back to width equals 0.25. So one thing to note about this figure is that again the x-axis position is determined by a random number generator. So running it a bunch of times will actually get a different plot each time. So let me run it again and we now see that the points moved, right? Running it again, we see the points moved, right? Perhaps we would like the points to not move, to stay still and to be consistent. There's two ways to do this. The first is to use a function called set.seed and to give it a number. You can give it 1, 2, 3, whatever. I like to give it my birthday, 1976, 06, 20, June 20th, 1976. And what this will do then is that we see this cloud of points. We run it again and nothing changes, right? And so it's pegged. It's consistent. Whenever you're using a random number generator, you always want to set the seed and you want to have a decent reason for the seed that you set. The result should not depend on the seed you pick. I pick my birthday because it's my birthday. You could use 1, you could use 10, you could use 7 if that's like your lucky number or whatever, like lucky numbers are stupid. Anyway, I use my birthday because it's important to me. Anyway, that's one way to do it. An alternative is another function that you can use as an argument to position, to geom jitter, which is position. And we can then say position equals position jitter and then an argument to that would be width and seed equals 1976, 06, 20. And now if I run this several times, I forgot a parentheses here. Now if I run it, I'll get one version of the plot. Again, it's going to be different than setting seed outside of it. But if I run it again, it stays the same, right? So there's two approaches to setting the seed. I would encourage you to set the seed. Maybe if you're doing a, you know, data exploration, it's not so critical to set the seed. But for reproducibility down the road, I would encourage you to set the seed, either using set seed or within the position jitter. I don't know why they have both options. So I think it's probably worth knowing both because as soon as I say, just use set seed, I'm sure that as soon as I say just, then there's going to be a reason to do it within the function. So it's probably worth knowing. Perhaps you don't need to memorize this, but know that it's possible to set the seed within the function. I'm going to go ahead and remove the position jitter function argument and go with the set seed up here. And you know what, sometimes if I use that set seed, it's the very first thing that I run when I'm setting up and running my code. Great. So we see what our plot looks like here. We've got our nice jitter plot. One of the things that I kind of don't like about this, again, is that we have over plotting of points. Two things that we might try. I guess three things. We could make it wider, right? And so that way there's more space to have the points kind of separated on the x-axis. But then you get this kind of fat and not very attractive. Another approach might be to reduce the alpha. So if we use like an alpha of say 0.3 or half, then they'll be lighter. But if they stack on top of each other, they get darker. Another thing we could do is to perhaps put a black border on each of the points and then change the fill. So let's play with the alpha first and then we'll play with the fill. So let's do alpha equals 0.33. That way if there's three points on top of each other, it'll be solid. Otherwise it'll be lighter. Perhaps that's too light. But again, you can see that kind of down in here, you know, we have more points on top of each other. So let's change that to 0.5 and see what that looks like. So that looks a little bit better. That gray gets kind of muted. But we can kind of see more easily now that there are more points in this diarrhea and seat of negative column that are kind of stacked up down at these lower diversities than say up here. Whereas there are more points for the healthy that are around eight on the inverse Simpson index. So I'm not totally sold on changing the alpha. Let's try changing the color of the border and then to be black and then using the color as the fill. So I'll change color to be fill and I will then use I'll get rid of the alpha and I'm going to do shape equals 21 and the 21 through 25 allow you to change the color of the border and the fill of the center of the shape. And then I will do color equals black, which I think is the default anyway, give that a run. And so now we see we don't have our colors, but we do see that we have the points with that dark circle around them. I kind of like that look. I need to change my scale color manual to be scale fill manual. And that then gives me my color scheme that I've been using all the way through. And what this gives us then are each of our points with a black circle around the outside and then the fill color that now corresponds to the color scheme that we've been using throughout. I'm not totally sold on this. I do kind of like having that black border, I think it, I don't know, it makes the points kind of stand apart from each other rather than being a full blob. Let me know down below in the notes in the comments what you think of this depiction. What do you prefer leaving it as we had it with shape 19 and full alpha? Or do you prefer changing the alpha to be a little bit more transparent? Or do you prefer to have these borders on each of our points? Let me know down below and maybe we'll think more about how we use this in future graphics that we make. All right. The next thing I want to do is I'd like to put in a line here for the median. We talked about doing this in the last episode where we learned about the stats summary function. I'm going to go ahead and take these out or uncomment them. I only need one because I only want to draw that one line. So I'm going to use fun data. I'm actually going to use fun short for function. And I'm then going to put in median the line range. I don't think it's going to work. I think if I run this, it's going to complain. Yeah. So it's got missing data and it's it's complaining. So we don't want line range. I could put in point and that would give me a black point in the middle of my data, which is there. Right. That's kind of hard to see. Of course, I could make it huge. Right. I could say like size equals five. But I'd rather have a black horizontal line. So I'll go ahead and remove that. And what I'll put in here instead is crossbar. I now get a horizontal bar, a bar across the data. And this is a little bit of a hack. And to show you where things are going in the next episode, if I did fun data instead of fun. So fun is a function that summarizes the data and it only spits out one value. But if you do fun data, it reports the Y, the Y min and Y max fun only outputs Y. So if I did median high low, then show legend, you know, crossbar, running all that. What we get is the body to the 95th percent confidence interval of the distribution of the inverse Simpson index. But if you want to learn more about that, come back for the next episode. And please be sure that you're subscribed to the channel so that you know when that next video is dropped. All right, so I'm going to put this back to fun as median and crossbar, and we get our nice horizontal bar across our data. That's perhaps a little bit wider than I would like it to be. Maybe I only want it the same width as the data. And so I'll go ahead and do width equals 0.25 running that now. That's too short. So it seems that the width on these things are different. So the width on stat summary is different than the width on the jitter. That doesn't quite even seem to be the full width that I want. So let me go up to 6.6. And again, you can play with these stylings. They don't matter a whole lot. But I think what it does do those black lines on top of the data is really call your attention to that data to where the central tendency is where that median value is. And that we see that healthy has all of these have a fair amount of distribution to the data. But the median value for the healthy is larger than that for the two dire real columns, whether or not your seat of negative or seat of positive. I hope you found this useful for thinking about how we can plot summary data on top of the raw data. Now, I don't claim that this is the perfect figure. I like this depiction of the data because it shows all of the data. What are the downsides or what are the problems with this data? So first of all, with this many points with healthy 155, we have a lot of over plotting, right? We have points down in here where there's a cluster of points that are kind of on top of each other. And it's hard to see what's going on. Also, the horizontal distribution is random, but that doesn't mean it's uniform, right? Certainly not uniform. And so we have a lot of points over here, perhaps, but not so many over here. And that can kind of maybe skew your interpretation of the data because it doesn't show it kind of evenly divided or distributed across the x-axis. The other thing that's a challenge with this is that even though you and I understand what's going on in this figure and what the x-axis represents, if your audience has never seen this, they're going to be confused. Whenever I teach people about these types of jitter plots, someone always asks me, what does the x-axis mean within each of the columns? Even though I tell them it's random, people still will ask, what does it mean? It's random. It doesn't mean anything, right? And so people require a little bit extra coaching to tell them what the visual means. People aren't so used to interpreting these types of visuals. And so, again, that's why it's important to keep that in mind. And as we've talked about before, have empathy for your audience and that they may not know what this means, right? So what can you do to help them if this isn't a paper? Well, maybe down in the caption you can say, you know, the horizontal variation within each category is determined randomly to separate the points, because otherwise they'd fall on top of each other, right? So maybe putting that in there, or certainly when you're giving a talk, you know, a quick sentence to describe what's going on in this plot, maybe the first time you show a plot like this, would show a lot of empathy for your audience and help them out. Maybe you'd repeat it twice just for the people that didn't really get it the first time around. Anyway, I do think this is an attractive way to show both the, all of the data, as well as that horizontal bar. I showed you a glimpse of the future of what things would look like with a box plot. So again, be sure that you stay tuned for the next episode. We'll go into more, showing more summary data on a strip chart like this. Personally, I feel like having all that information for a box plot on top of data like this gets kind of busy. And so that's why I kind of prefer the simple horizontal bar, but certainly there are people that disagree with me. So stay tuned and we'll have that conversation in the next episode. I hope you found this conversation about plotting summary data on top of your raw data, useful, showing your audience the data and showing them kind of the distribution of the data. And I think, you know, yeah, we can make those comparisons, but by showing the audience the raw data, they get a better feel of what you're dealing with and what you're seeing that, yeah, inverse incident X can vary wildly for healthy individuals, right? Even for people with diarrhea, whether or not they're positive or negative, we can also see great variation in their diversity, that there are people with, you know, diversities that are above the median for healthy people. So perhaps diversity isn't the only thing, right? That's the added flavor that this figure shows that we couldn't get by just showing the summary statistics with a simple bar or error bar on top of that. All right. Well, I encourage you to play with this type of visualization on your own data. Let me know down below in the comments if, you know, which of those different aesthetics you liked by changing the alpha, by changing whether or not the border of the points had a color or leaving it plain without a border. Let me know what you think down below. Let me know if you've tried this with your own data and what you've thought about it. I do feel like this type of visual does tend to work better when you have fewer points. So perhaps you're doing a mouse experiment and you have like five or 10 mice per treatment group. This works really well in a place where like a box plot probably doesn't just because it's like so much summary information for a small number of points. But if you have a lot of points, you know, we kind of test the extents at which a strip chart like this or Jirra plot as it's also called is useful. Anyway, please tell your friends about Code Club and we'll see you next time for another episode.