 As we develop our skills with anything, but especially programming I find, I get to a point where I feel like, you know, I've got a really powerful tool here and I can use it to solve all sorts of problems. I don't really need to do much to develop my skills with different tools within this programming language. I fall into this trap frequently and those of you that have ever taken a workshop for me know that I learn something every time I teach a workshop. Well, I'll do something or someone will ask me to do something and I'll do it. I'll be like, huh, that worked. That's really cool. Well, that happened this past week within my lab meeting. In my lab meeting, we have something like a journal club that we call Code Club, like this, where a different person will present and I've been asking Nick Lesniak, a grad student in my lab for a long time to give us a presentation on the stat functions. These stat functions are really poorly documented within ggplot, they're part of ggplot. You see like stat underscore, we saw stat ellipse a few episodes back when we were making those ellipses around our ordinations. Well, Nick gave an awesome presentation on stat summary and it was so awesome that I'm interrupting all the other videos that I had planned and I'm doing it right now because I think stat summary is awesome and you need to know about it. It's just so awesome and what does it do? Well, it adds what we'll call a summary layer to our visuals and that's exactly what I'm gonna talk to you about in today's episode of Code Club. Hey, folks, I'm Patch loss. This is Code Club and I can't believe that I'm so excited up to learn a new function in ggplot and to share it with you, of course, that's the exciting part of sharing it with you. And what is it doing? Well, we've seen it previously with stat ellipse. And so if you recall, we had an ordination. So it was a scatterplot with continuous on X, continuous on Y, we had clouds, different colors for different disease statuses for our samples. And we wanted to put ellipses around the different clouds of points to kind of highlight that those points were all grouped together. And so we use stat ellipse and didn't think much of it. Perhaps we thought, well, why isn't there a geom ellipse? And what's actually happening now that I think about it and learn more about these stat functions is the stat ellipse was adding a statistical layer on top of our ggplot plot. So remember the ggplots are their layered, right? They've got different layers on them. We've got the data, we've got the geomes, we've got the scales, we've got the labels. Well, we can also add a statistical layer. And that statistical layer with the ellipse was drawing a two-dimensional distribution around our points that fit like a multivariate distribution. Well, that is a summary of the data, right? And so then using probably a polygon geome, it then drew the polygon with like, you know, a hundred sides or whatever the default is, summarizing the shape of that distribution, which is pretty cool. But that only makes sense now after I've learned about stat summary, which is what I wanna talk with you about today. So what does stat summary do? Well, if you recall from the last episode, we made a bar plot of the average inverse Simpson diversity index, it's a measure of the complexity of a community for samples from three different diagnosis or disease status groups. We also put error bars indicating the standard error. But what I had to do before I created that bar plot and those confidence intervals was that I had to summarize the data. I had to do a group by summarize to get the mean to calculate the standard error. And we also got the end to know how many points were there. Then I had to then plot that data as the input to GG plot. Well, let's say instead of the mean, I wanted to put the median or let's say instead of the standard error, I wanted to put in some kind of confidence interval, some other type of confidence interval, perhaps say like 25th to 75th percent tiles, right? Things like that. Well, then I'd have to go back to that table that group by summarize workflow, update that and then rerun my plot. What if we could do that directly in our GG plot pipeline? Well, that's exactly what we can do with stats summary and that's why I'm excited to share that with you today. Because again, looking ahead to alternatives to the bar plot, I can now see where the stats summary function will really come in handy so that we can show all of our data, which was again, one of the knocks against the bar plot along with summary statistics on top of that, right? So we're gonna get exposure to that today and then we'll see that repeated as we go along because again, I think this is a great tool and my bad habit whenever I learn a great tool is to use that great tool everywhere. Again, here's our code as we left it at the end of the past episode. If you would like to get a copy of this code, please be sure to check the link in the description below that will take you to a blog post where you can get the code that I am starting with if you would like to get instructions and information on how you can install R, RStudio, the tidyverse, get the data I'm working with, check out the video that I've got linked above here. That will give you all the instructions you need to get going. I'd love to have you work along with me and to try this out yourself and even experiment. All right, so again, we are loading our libraries up at the top here. We're loading in our metadata and our alpha diversity data. We then join that together. I've got some colors for healthy diarrhea and case situations. I now have this summary data frame that as I mentioned, I take the metadata alpha, I group the data frames by the disease status and then I get various summary statistics. I then take this and then I feed this in as my input to ggplot. I'd rather not have to do this summary for a couple of reasons. So first of all, it's actually, it's more code. It's more code that I have to maintain. It's not as flexible, right? If I want to update it and perhaps use the median or I wanna use a 50 or 95% confidence interval, there's a lot of code that I have to change in here to get it done. Also, if I wanted to show the data, the raw data with the bar plot, then I'm gonna have to display two data frames in my ggplot workflow and that's just a pain in the butt. So what I'd like to do is get rid of all this. And what I'll do is I'm gonna go ahead and for now I'm gonna delete this. But I still need to get these three variables for my ends and the way I can do that would be like with metadata alpha and pipe that to count on disease stat and remind myself that I have to run everything up ahead of that to make sure it all works. And so now I get this data frame out that has my disease statuses and my ends, which is exactly what I had here. So let me call this disease count and I can pop this in here for these three and it's a lowercase and I noticed. And so that all works and should be good to go ahead and put in those access labels when we get there. Now, the next thing that I want to do is I'm using metadata alpha, I'm not using inverse Simpson summary anymore. So I'll go ahead and replace that. So we'll do metadata alpha, good. And now we come to our geom error bar and geom call. And I'm gonna leave these here but I'm gonna comment them out for now. And instead what I will do is ggplot aes along my x, I still want my disease stat, my y, I want to put the inverse Simpson and my fill, I'm gonna say disease stat. For the most part that looks the same except now instead of the mean men and max variables, I now have inverse Simpson replacing that. And so where am I gonna get that mean and max? Well, I thought you'd never ask. We can now do stat summary, open close parentheses and let's try the default and see what happens. So we'll go ahead and run this. So if you look at the output, what we now have kind of resembles what we had with the bar plot where we've got a point to represent the mean and we've got plus or minus one standard error. Our legend is a bit screwed up. We'll come back to that in a moment but what we're seeing is the mean plus or minus the standard error. And so if you look at the red message that came out that no summary function was supplied defaulting to mean underscore SE as a function. So this is a summary function that stats summary is using that as you can perhaps guess by the name, mean SE is calculating the mean and giving plus or minus the standard error. Which is exactly what we were plotting in our bar plot except we used the bar. So what we can do to kind of get rid of that message is that we can say fun data equals mean underscore SE. We can also use different geometries. That's using the point and range. So it's the point is the mean and the range is plus or minus the standard error. So we can show that with different geometries. So we could do geome equals bar and run that, we get our bars, right? We don't get our error bars and that's because we would need a different geome. And so we could do that with error bar. So geome error bar. And I'm gonna put that before geome bar because I want the error bars to fall behind it and voila, we get our error bars and similar to what we saw last time when we were plotting geome error bar, we could also do width equals 0.5 as an argument to make the cap on that error bar a bit more narrow, right? And so that looks pretty slick, very nice. And that works great. Something that we are seeing is that we have our legend here to the right. And as I've mentioned in the past, we can add similar to what we did with our geomes, we can add arguments there like show.legend equals false to get rid of that legend and to then give us the full breadth, right? So I created the same plot that we had last time without using group by or summarize or calling geome call. All of the heavy lifting for that plot was done in these two lines and that's replacing these two lines up here that I had commented out as well as that code up ahead up above that I deleted for doing all those summaries, right? So that's really slick. Now, an alternative, let me go ahead and put this in front. As an alternative to mean SE for calculating the confidence interval, I could also do mean clboot to get the confidence interval using a bootstrapping procedure that is a non-parametric approach. And again, if we run all this, we see, I believe, a taller confidence interval. Let me go ahead and move that to be in front of my mean. And so we see those confidence intervals. They're still assuming the data are normally distributed and mirror the mean. An alternative to mean SE and mean clboot would say be median, high, low. So let's use those for both and see what we get. So what we get out are much larger confidence intervals. These are the 95% confidence interval where we've got the median as the top of the bar as well then from the 2 1⁄2 to 97 1⁄2 percentiles. And so we see that the distributions are not mirrored across the median or the mean. This gives you a better representation of the variation in the data. So there are a variety of these different functions that we can use for fun data to summarize our data. The way these functions work is that I could say median, high, low, and then give it metadata, dollar sign, inverse, Simpson, run that, that will output a value for Y, Y min, Y max. And then stat summary with the GMI pick is plotting Y at the top of the bar and then the confidence interval are the Y min and Y max. Again, we also saw mean SE, right? And we can get again, the mean and the min and the max. And what we see actually is that the median is actually lower than the mean. The median is less than the mean and that we get these broader confidence intervals. Some other normally distributed things we could do would be say like mean, we saw a CL boot and then I believe mean CL normal. We could also do mean SDL, which will give you the mean and I believe it's two times the standard deviation. And so you can see two times the standard deviation would give you an error bar that's below zero, which is kind of a giveaway that the data are not evenly distributed or normally distributed around the mean. So that's not ideal. One of the problems with the bar again is that you're using a lot of ink for not much data. It hides the data on its own. It doesn't show the distribution. What we saw of course was with that high low we could get and the confidence interval we could begin to see what was going on in terms of the shape of the distribution. So let me come back now that we've looked at the functions and now let's look at the different geomes that we can use with stat summary. So when we talk about geomes, this geome with the point and the range is geome point range, which is what we've seen already. We could also do geome error bar, which will give us the error bar, but not the point. And so if you want the cap and the error bar, the point, you could also then do geome point, which will then give us the point and the error bar, right? Something to notice is that I'm not getting color anymore because I'm using fill disease stat. So if I change that to color disease stat and then scale color manual, then I get my point and my lines to have those different colors. If I want my bars, my error bars to be black, I could also do call color equals black, point will be colored and my lines will be black. And I can then change the individual sizes and arguments to get what I wanted. You know, I could change the width to make the cap more narrow. I could change the size with the point to make it larger, whatever you wanna do. So let's do one iteration of that to kind of give you a sense of what I'm talking about there. So again, with the error bars, I could say width equals 0.5. And then with the point, I could say size equals three. So I'll have a larger ball and a narrower cap on our error bars. And so again, this kind of gives you a sense of what we're looking at. One other thing to think about, and this is a little bit silly, that there's no real reason that we would connect these three points. They're not, we're not looking at temporal data or connectedness or pre-post, right? But to show you that we can connect these with a line, we could also do a geom line. And we could then, I think we need to add group equals one. And we'll run that. And what we should get now is a line between our three points. It's using the color from the previous one. So again, I could say color equals black. And we'll get a black line connecting those three points. We could, of course, put it behind or put this line before the previous one so the point sits on top. Anyway, hopefully you can see that I haven't actually calculated the median or the mean values or whatever these summary statistics are explicitly. That's all happening within stats summary. And you can now see that I can very easily change the statistic that I'm creating or I'm plotting and then showing it with these different geometries. I could also do ribbon, which will now create a ribbon between these points. And you see that's black. So that's clearly not ideal. Maybe I'll move that up and retry it. And so now you see that we've got this ribbon or polygon behind the points. Again, this is silly for this example, but I just wanna emphasize that you can use so many geomes with this function to get different appearances and what allows you to tell your data story the way you wanna tell it. So an alternative to geome error bar would be line range and I can then get rid of width as 0.5. And so this isn't totally different than what we saw before with point range, except that allows me to change the aesthetics or the appearances of the line versus the point. So before, if I had used a point range, then the whole point and range would have been colored by the diagnosis group or disease status group. And the size would have also for the line and the point would both be amplified by the three that I used, right? And so instead, if I use line range, I can modify the appearance of the line and the point separately. So I'll put this back to point, rerun this. Again, I think this is a much more elegant and simple depiction of the data than what we saw before with the bar plot plus or minus the standard error. We've got a lot less ink here. We're showing the median, which is a better depiction of the central tendency of the data. We've got the 95% confidence interval as calculated using the data rather than two times the standard deviation, which we saw had problems because we went below zero, which doesn't make any sense. And we get this empirically determined confidence interval, which is nice. Yeah, you know, you might quibble and say, it'd be nice to also include the data points here as well. I might push back and say, do you really want 155 points shown here? It might that just be a little bit over the top, too much information for our audience to look at. Sure, if you had like five points, yeah, show all the data. But if you have 189, 94, eh, probably not. One other thing to comment on is that our y-axis automatically went to zero because some of our diversity values were quite small. There is debate in the data-vis world of whether or not you need to include zero. For bar plot, there does seem to be consensus that yes, you should include zero because humans perceive things by area before we perhaps perceive things by position on the y-axis. I personally would like to include the zero on the y-axis as much as possible. And perhaps that means scaling your y-axis accordingly so that you have a good reference point on your y-axis. And that allows people then see like, yeah, this point is higher than the blue or the red, but it's also kind of twice the height, twice the value as the blue and the red. Anyway, I think that does tend to be a little bit context dependent as much as possible. I do try to include zero on the y-axis, but no, there's debate out there and it probably is more context dependent. So I hope this was interesting to you, as interesting as it was to me to learn a new function, something that I know that I will be using a lot more in the future. I can't tell you how many times they've made two data frames, one of the raw data, one of the summary data, and then in GG plot, I kind of use them together. And again, it's just a little bit kluji. This stat summary to me is a game changer for a lot of my types of data presentation so that I can do the summary within GG plot because I'm not gonna use these mean or standard error values anywhere else. It's not necessary to store it as a separate data frame. It makes my code simpler. And as we can see here with slight tweaks, we can already make our data visualization a little bit simpler and more powerful and impactful. Anyway, I hope you found this interesting. Go ahead and try this with your own data. I'd really encourage you to explore around with the different geomes that you can use that are really poorly defined and not really well described in the documentation. Try things out, see what happens. We'll be trying some of them out also in future episodes coming up here. Know of that to kind of keep getting practice. It's not enough to just learn the function. You have to keep using it to truly integrate it into your workflows. And I can tell you, I'm gonna be integrating this into my workflow. So I'd encourage you to check it out as well with your own data. So please tell your friends about what we've talked about here in Code Club and all the exciting things of new functions. It's not that exciting, but it's like Pat exciting, you know? Anyway, keep practicing with us and we'll see you next time for another episode of Code Club.