 Are you one of those people that when you make a plot for a paper or for a presentation and you you have two columns or bars of Points and you have a line across the top of them And then you just put like a bazillion stars there to indicate that you just this like really minuscule p value And it's like a really big result Yeah, I thought you might be hey folks I'm patch loss and this is code club in reality You only need one star to indicate that something is significant You don't need all those stars to say something super duper significant I'll explain why later in the episode but before we get there We first have to figure out if we need any stars at all to indicate significance. How do we do that? Well, the data that we're looking at are inverse Simpson diversity indices of three different groups of people So the first group is a group of people that don't have diarrhea They have normal bowel function, you might say the second group of people have diarrhea Abnormal bowel function you might say the third group of people have diarrhea But they're also infected with Clostridioides difficile To get tested for Clostridioides difficile C diff you have to first have diarrhea So we have these three different disease status groups that we've been comparing as we make plots of their inverse Simpson diversity indices across the three groups and we've been looking at this over the past few episodes and what we've noticed is that the The people without diarrhea that the normal or healthy individuals have a much higher diversity than people with diarrhea Whether they're with or without C. difficile infection looks like it might be significant But we don't know and we don't know if the difference Between people with and without C. difficile who have diarrhea is significant Actually the people with C. difficile look like they might have a little bit higher diversity than people without C. difficile So we want to run a test to figure that out. So how are we going to make these comparisons? Well, the data that we have are not normal We could try to play with scaling them By, you know, taking the square root or squaring or doing all sorts of transformations to scale them to make them normally distributed But just take my word for it with these data I have not been able to get them to be normally distributed so that we could use a parametric test So instead what we're going to use is a non parametric test called the Kruskal Wallace test Now this is not a workshop or a class to teach you statistics. Go take a statistics class There's lots of other people doing youtube videos about statistics And I'm sure the people in your stats department at your home institution would love to teach you about statistics Anyway, if we find that the Kruskal Wallace test again this non parametric test Yield a p-value less than 0.05. That will mean that one of the three groups is different from the others Then the task then becomes which one or ones are different from the others to test that We will then do a pair wise wilcox test between Pairs the three pairs of comparisons that we can make we need to correct for multiple comparisons So that we don't inflate the risk of falsely detecting a difference that it's not actually there Finally, I am assuming we will find differences We then want to see how can we graphically express that and we will do that with two gg plot functions So the first being Geom line and the second being geom text that will allow us to draw those little bars on our plot Along with a little star or an ns on the plot and finally we'll talk later about What is the importance of significance and do we need all those stars anyway? I hope you're excited to dig in this material with me today Let's go ahead over into our studio and we can get going If you want to have the code that I have that i'm starting with here Down below in the description of this video is a link to a blog post where you can get my starting code Also across the top here. I've got another video that will help you to install our our studio to get the tidyverse and other packages And the data that i'm working with In these videos, so I'd love it if you followed along you'll certainly learn a lot more Following along making your own mistakes solving your mistakes than say just watching me do it and watching me make my own mistakes Anyway, um, let's look at the code that we have here again. We're loading a handful of libraries that come to us from the tidyverse We're setting a random number generator seed here on line 6 because we are doing a jitter plot and the x axis position Within each cloud is randomly determined. We read in the metadata the alpha diversity data We join it all together We've got some styling information here for different types of colors as well as counts of You know how many patients we had in each of these groups and then we have the code to generate the strip chart And then we finally save all that into a tiff and so as you can see and I've shown in the introduction This jitter plot for the three different disease status groups Again, we might think that these two groups are probably not significantly different from each other It is kind of odd that this cdiff positive group has a slightly higher diversity than the cdiff negative diarrhea group But again, I doubt it's significant, but what we'd like to have is a bar that goes across here Um that is black or so and then maybe has an ns on the top of it saying not significant If it isn't significant and then maybe if if this difference between healthy and the diarrhea groups is significant We'd like to have another bar across the top here that on top of that has a star or perhaps you could put in the p value or whatever Um, and so that's the task for today's to get started We'll go ahead and open up some space here in the code and do cross goal test And we will do in the simpson For the column that we want to test And we'll use the tilde for the formula notation and then we'll do disease stat and we can then do data equals meta metadata alpha And if we run that so this is the output that comes to us then Showing us that we have a tiny p value two times ten of the minus sixteen or smaller than that Actually, I don't know how many stars you'd want for that like I don't know like 16 or so Please don't so let's call this a variable kt And we will store the output there and one of the cool things that we could do is do kt dollar sign p dot value and that will then show us the actual p value We see that it's not actually less than two point it is actually less than two point two times ten minus sixteen It's actually three point one four times ten to the minus twenty one That might even be the smallest number that r can express. So it is small. There's no doubt that there is a statistical difference between At least one or two of those samples one or two of the groups in the study The next question then is which of the groups are different from the others to get at that. We can then do pairwise Dot wilcox dot test as you see here This notation the the arguments that you give it is going to be different than the cross goal test Which is really annoying and so what we will do then will be to do meta data alpha dollar sign In this inverse simpson And then we need a g variable or grouping variable, which will again be metadata alpha dollar sign disease stat And again looking let me make this a little bit bigger so you can see the output It tells us the comparison that it made and then it gives us the pairwise comparisons and we see that That the non-diarrheal control, which is what we call healthy Is significantly different from the people with diarrhea and the the people with the case where diarrhea plus c the facile Whereas the comparison between diarheal control and case Is not significant. So we see at the bottom here It tells us that it is adjusting the p values using the method home I don't use the whole method I'm not super familiar with it the method that I see more commonly used in the literature and that what I use myself Is benjaminie-hatchberg and so I can use benjaminie-hatchberg by doing p.adjust Dot method equals and then in quotes bh And then that tells me uh that's there and so we find very similar story That the non-diarrheal control is significantly different from people that have diarrhea Regardless of their c to facile status, but there's no difference between people That have diarrhea with and without c to facile. So that's great. And again, we could save this as pairwise or pt, I'll call it pt And and we're good and we could also get pt dollar sign p dot value And that comes out as a matrix if we wanted to play with that You know, you what you could do if you're trying to make this automated and reproducible is to build in some logic here to say like, you know, if kt Uh is less than 0.05. I guess I should do kt dollar sign p dot value Then run this other task, right? And then if different values of the pt matrix From the p values were significant, then you could do different things I'm not going to go too far into the weeds there Because I want to move on to thinking about how we can go about drawing those lines and stars on our figure An important thing to remember, though, is that we do not we do not Run this pairwise test unless this experiment wide test, the cross call test is significant If the cross call test is not significant, then we do not go on to the next step So that's a little bit of statistics that I'll drop on you. Um, but again talk to your local statisticians For more help on how to appropriately model the data. This is a very basic comparison that Even I feel up to the challenge of looking at what we know now Is kind of what our intuition showed us which was that these blue and red groups are not significantly different from each other But that the gray group is significantly different from both the red and the blue group And so again, we'd like to draw little segments on these To make that clear to our audience that you know, these are significantly different So how do we do that? I'm thought you'd never ask something you may not have noticed that I did was that this entire block of code builds a plot But I stored it to a variable called strip chart. And so if I run this Nothing happens. I don't get a plot generated. Um, again, nothing happens But I can then do strip underscore chart And that will then open up my plotting window and show you the plot that I have in the lower right corner Strip chart is a variable. It's an object in r and every object in r has a way of being printed, right? If I have a data frame and I write the name out at the prompt and then press enter There's a special way to print out the data frame or if I type in three Um, there's a special way to print out numbers, right? It's pretty basic, but it'll print out three Or when I print it out the result of the cross go Wallace test, there's a special formatting or special printing of that output Well, it's the same thing with the strip chart or any other plot we make But the nice thing about this that is that I can save strip chart as a variable And I can add I can keep adding to it, right? So we've seen here as we built out this plot that we added different layers to it So the first thing that I want to do is draw a line segment between the blue and the red cloud To do that we will use geome line Um, and we will also create a special data frame to do that typically when we create um of ggplot we call ggplot and the first argument to ggplot or to any of these geomes Is the data that's being piped through or inputted in? um We can also explicitly describe a data frame in that first slot and that's what we're going to do here So again our object for our plot is strip chart And I will pipe that to geome line And and the position on the x axis again will be two and three and the y axis I'll put out about 23 and we'll go from there and see how that looks. So we'll do data tibble x equals and then I'll make it a vector two and three y We'll do 23 and 23 That looks good. And then we need to have our aes still. So we'll do x equals x y equals y And let's run this and see what happens Ah I have two I have a problem here. Uh, so did you use a pipe instead of a plus? Yes, I'm so used to piping stuff that I forgot this should be a plus Because I'm adding this geome to the previous plot. So let's run that again Got another error object disease stat not found and so that's happening because up above here when I define the gg plot object and creating the plot I had fill equals disease stat Well, my data frame that I just created for geome line doesn't have disease stat and nor do I want disease stat So what I'll do instead here is I'll do inherit dot aes equals false And I'm going to go ahead and break this across a couple lines So that it's not quite so messy And what we get out is a nice black line segment over the top of our diaries and seed of negative and diarrhea and seed of positive And so that looks good. So let's repeat this now Creating a line over the top of our healthy and kind of lining up with the middle here between the two diarrhea groups So we'll do one and then 2.5 on the x axis And then the y y axis will do maybe like 33 and we'll see how that looks So again, I'm going to copy this down and put a plus there And so I said one and 2.5 and let's do 33 There and we'll give this another run and we again see that we now have our horizontal bar up here at 33 As well as the one at 23 What we'd like to do now is to annotate these lines with text in the middle and we can do that with geome text I will go ahead and copy this first one This first geome line and replace the line with text And my x coordinate is going to be 2.5 and my y I'm going to for now make 23 And that looks good. Maybe I'll put label equals n dot s And let's give this a shot and see what happens. Great. So our ns is right on the line We can probably bump that up a little bit by doing 24 and that looks great. Okay now for our star Let's go ahead and Copy the same geome text down, but again, we're going to do 34 And we're going to want this position to be between one and two and a half So one and two and a half is three and a half and then half of three and a half. I believe is 1.75 And we will put in a star And there's our star So let's make our star a little bit bigger and we can make it bigger by doing size equals say two That made it really small. I think the size here is the font size So let's maybe make it 12. That's really big So let's make it maybe eight and that looks pretty good. You know, we can probably bring it down a little bit Let's do 33.5 and that looks pretty nice. And so we've got our star We've got our ns for not significant and life is good. We've got our bar as well indicating the groups that We're making the comparison on Something that I'll leave for you is that sometimes people like to have additional bars that kind of come down From the ends of these horizontal bars. How would you make those? Here's a hint It's exactly the same as what we did to make the horizontal bars But instead of changing the x coordinates, you're going to change the y coordinates So see if you can't get those four little tags inserted And let me know down below in the notes if you're successful with that See if see if you could be the first person to post some code to make those little tags and I'll give you a special star Anyway, so what do we think of this? So the first thing to talk about is one star. Why do we only have one star? We have like this really freaking significant p value. Well, p values are significant or they're not It's either less than 0.05 or it's not If you have a Astronomically small p value like we have that doesn't mean that the difference is bigger Or more, right? It tells you that the difference is different right that that we're confident That there is a difference between our medians in this case because we're comparing medians If you want to know about the effect size. Well, look at the data, right? Compare these bars down here If you want to know what the effect size is If you want to know if something is significant that is a yes. No question I'm almost tempted to put significant not significant on those lines So that you know, it makes it clear But putting, you know, some lines with two stars some with three some with one some with four It just it doesn't mean what you think it means put that it's significant or that it's not significant This is a relatively simple plot in terms. Let me make this bigger Relatively simple plot in terms of the number of comparisons that we have as people get more and more comparisons or things are more and more different Then you start getting more horizontal lines going on here. So imagine that these three groups were different You know, we'd want to have perhaps three different lines indicating whether or not things were similar or not What do you do there? Well, again, it can get confusing because you have all these lions flying all over the place Again, you don't put multiple stars. Just put one star, please And in fact, what sometimes people will do instead of stars or ns would be to You know put, you know, what we could do is say like a here, right? That we could say these two are similar They're in group a and then we could say well, this is in group b And so you could then draw a line between things that are in the same group or that are in different groups I don't know saying that to make sense. But again, you could you could denote you could group the Clouds that are similar to each other and say these are b. These are a Sometimes though, you will have something in the middle. That's both in a and b. And so that that a or b Gives you an indication of what group things belong to and then if they're not, you know If something's an a and not in b then they're significantly different from the things in b That's kind of a more stylistic thing about how we present the p values or present the results of the hypothesis test Ultimately, it does get really confusing and and really difficult for the audience to interpret and look at Especially when you've got a whole bunch of horizontal lines going all over the place And so I would encourage you to perhaps Don't feel like you always have to represent this in the graph in the figure But perhaps save that for the legend or for the text And and also it's an argument for having fewer groups fewer, you know Treatment groups or disease status groups like we have here rather than many, right? So, um, you know the the number of comparisons goes up the more groups you have to make comparisons on Please tell your friends about what we've done here today In code club I think another thing that they would appreciate hearing is that you don't have to take your figure from here Put it into power point and draw your lines there and stars and all that stuff But you can do it directly here in r without having to further modify The plot keep practicing with this go ahead and work on those exercises I've given you and we'll see you next time for another episode of code club