 Scientists love to put stars and bars on their plots to indicate what comparisons or what treatment groups are significantly different from each other. I've talked before about why you don't need to put five stars next to one of these comparisons to indicate that it's super duper significant. No, you really just need a single star. Anyway, one of the challenges is, you know, how do you put those bars and stars onto the plot if when you're creating the plot from the beginning, you don't know how many comparisons you're going to have and you don't know where exactly you want to put those stars and bars. Well, we'll figure that out in today's episode of Code Club. Hey, folks, I'm Pat Schloss. In recent episodes of Code Club, we've been building up a figure to display the differences in relative abundance of bacterial populations that are found in patients that have varying disease statuses. And these data are coming from a paper that my lab published a number of years ago, looking for biomarkers associated with Clostridioides difficile infections or C. diff infections. Right. As you can see in this plot, healthy individuals are indicated by the gray points and lines. Those that have diarrhea, but no C. difficile infection are blue. And those that have diarrhea and C. difficile are red. Now, in that original study, one of the things we found was that it was really difficult to differentiate between people with diarrhea plus and minus C. difficile piece of cake to differentiate between people that were healthy and people with diarrhea. But within the diarrhea between people with and without C. diff was a lot harder. Well, what we've been developing in the recent episodes of Code Club is this plot. Again, the ball indicates the median relative abundance across maybe close to 100 individuals for each of these three groups. And the line indicates the range of the interquartile range. Because there are so many points in the dataset, we thought it was just way too busy to show all the data. And so we're substituting that making the compromise by putting in the range of the interquartile range. Now, when we look at this, I naturally want to see that there are significant differences in the data, right? So if I look at this Bactroides OTU two, I'm pretty confident that the healthy individuals are significantly different from the diarrhea, the people with diarrhea, right? But are these two are diarrhea samples that with and without C. diff? Are they significantly different from each other? We don't know. Well, in the last episode, we ran a set of statistical tests where we use non parametric Kruskal-Wallis test. And then on the comparisons that were significant, we then did a pairwise Wilcoxon test. What I'd like to do is take that information and put it onto this plot, so that we can then draw lines between different disease status groups for each of our bacterial taxa to indicate what comparisons are significant. Let's head over to our studio where we'll get going working on today's code. We've got a lot to do today. If you want to get this code that I'm starting with, please, please, please do that. Down below in the description, there's a link to a blog post where you can grab this code. Also across the top, as you probably know by now, there's a video where if you haven't installed our studio or the tidyverse or gotten this data, you can go there and get that. So you can be following along. And that is so important that you follow along, do what I'm doing, and then take it and adapt it to your own data. So you can really make it your own. And you can really just, you know, flourish in your skills with R. Anyway, as we've seen before, we load all these different libraries, we get our metadata, O2 counts, we figure out the number of sequences we have per sample so we can calculate a limit of detection, we process our taxonomy data, we get our O2 relative abundances, we pool things, pool O2Us where the maximum median for any of the three disease status groups for that taxon is less than half a percent. Anything lower than that, then you're really kind of looking, you know, looking at medians that are based on only a handful or a couple of reads. We then pull it all together and we come up with this O2 disease relabund data frame that has for all the subjects, their disease status, each O2U that's in them, as well as the relative abundance of that O2U and its taxonomy. Then in the last episode, we went ahead and we calculated the experiment-wide significance for each O2U using the Crosco-Wallis test. We did that test on each O2U to figure out which O2Us were significantly different across the three disease status groups. And if we look at experiment significance, we find that we have this data frame of our taxa, our data, and our P experiment. Now, something that came up after I published the video is that some eagle-eyed observer noticed that this line here where I calculate P experiment using P adjust didn't actually do anything. So that kind of sucks. So to show what they saw, I'm going to do P.value in this select. And I want to run all these lines to show you what they saw. And I should have noticed that my P value, which was the raw P value out of the Crosco-Wallis, is the same as what I thought was the P adjusted. The other thing that you perhaps notice is that this data frame is still grouped by taxa. P adjust was working. It just works within groups, right? So if my data are grouped, then it's adjusting within the group. And because there's only one P value in that group, then it doesn't really do anything. So what I could have done was after unnest, I could do ungroup and then pipe that in. And so now if I give this another run, now I see that my P values are different from the raw P value and the adjusted P value. If I didn't want to have to do the ungroup, and perhaps what I should have actually done was I could get rid of this group by taxon, and then I could say nest data equals minus taxon. And so that means nest in the data column, everything but the taxon column. And so if we run this, we now see that again, everything but the taxon column is grouped within the data column. And if I then go ahead and rerun down to that select, I again see that my P values are different, showing that it worked. I'm going to go ahead and get rid of that P value because I don't care about it so much. And update this to look at experiment significance. And we get our updated data frame. Now, for this example, that didn't matter, right? Nothing changed, nothing was significant that should have been significant, and so forth. So I'm not so worried about it. But you never know down the road, you might have some borderline P values where that P adjust actually matters. And so you want to make sure that your data are not grouped before you do P dot adjust. The next step in the pipeline then was to take experiment significance. And for those ot us that were significantly different that has significant difference among the three groups, you then we then ran a pairwise Wilcox test. And we could then run this to figure out what comparisons were significantly different, right? And so this builds in a correction for multiple comparisons in both the experiment wide as well as the pairwise, we used BH as our P adjust method, which is short for Benjamin Hatchberg. So we're in good shape there. Now again, what I want to do is I want to take this testing data and use that to map on to our plot, our bars and our stars. Before I can do that, though, I need to clean up my code because I kind of left it in a bit of a mess from the last episode. So I'm going to grab all this testing data. And I actually don't need that final ungroup. So I'm going to delete that. Because again, the data was ungrouped. And I'm going to throw that down here at the end. And I'm going to save it. And what I want to do, because I kind of moved things around, is I'm going to take this O2 disease relevant, and I'm going to put this back as part of my pipeline to build the plot. And we'll make sure that it works and looks good. And that this is then what we're going to add on to. Excellent. This is the plot I had hoped to generate. And so we've got our code back to working. I'm going to come back in here to my code. And I'm going to save this as my O2 you plot. And what we're going to do is we're going to use O2 plot. And we're going to add on to O2 plot our stars and bars, right? So I'm going to go ahead and grab gg save and move that to the very end, because I like to have that at the end for what we're, you know, incorporating all the changes to the figure. So we've got our experiment significance here, as we've seen, and we now have our experiment significance going into the pairwise test. So I will call this pairwise test. And so now I have the variable or the data frame that we'll call pairwise test. And again, we have our pairwise comparisons for each O2, we have group one and group two, and the associated p value. And these p values are all significant, they're all less than point zero five. And so I'd feel comfortable drawing a line between say the non diaryl control or the healthy, and the case, which is the seat of positive, and then putting a star next to it because the p value again is quite small. So how are we going to do this? I'm going to do O2 you plot. And I'm going to add to it, right? So I've got the plot saved as a variable. And just like I've added lines, you know, like I added back up here, like theme classic and the theme and the labs and all that other stuff, I can keep adding stuff onto it. And so that's what I'm going to do. I'm going to add on extra geomes and extra styling to add those bars and the stars to indicate significance. And what we'll do is we'll do a geome segment. And geome segment, we will give it data. And that data that it's going to use is my pairwise test. And I'll have an AES for aesthetics. And we will say, we'll have to give it x equals something x and equals something y equals something and y and equals something, right. And so those are the required aesthetics for geome segment. So we're basically giving it coordinates that we want to draw a segment between. To get started, I'm going to put x and x and as let's do 90. And then my y and y and I need to figure that out, because I'm going to put that in my pairwise test as a data frame. So I'm going to put y and y and and I'm going to create those columns now in pairwise test. So what I'm going to do is I'm going to use mutate to create a column called y and one called y and to do this, I want to create another column that I will perhaps call tax on position. And tax on position is going to be the y axis position that the label for that tax on is being plotted at. So to show you if I do as dot numeric on tax on so tax on is a factor. And so a factor is a vector special vector that is numbered and has orders and has names or labels associated with those numbers. I see that tax on position has these numbers. So my bloudia is at position two, and back droidies at position 11. If I go back to my plot, I see bloudia is at y position two. And back droidies is up here at 11 and 12, I think, right. And so that is going to be the position that I am drawing my segments from using tax on position. I need to program in the dodge that I was getting out of position dodge before when I did the stat summary with the ball and the lines. That's not going to be that hard. So we'll go ahead and do why because I need a column for y and y and for the coordinates to draw the line between. And I would do if else. And so if else group one equals equals, and I'll say non diaryl control. So if the group one is non diaryl control and again to show you what that looks like, then that's going to be the gray bar right the gray line. And so then my y position should be tax on position. And I'll say plus 0.2. Otherwise, I'm going to do tax on a position minus 0.2. And you know what, I'm just going to call it pause, because this is getting too long. All right. And so then I need y and which will be if else group one equals equals. And that's going to be a diaryl control. And that will be pause. So if that's true, then we want to use that position. And actually come to think of it, I want this position to be just the position right. So if it's if group one isn't non diaryl control, then it's going to be diaryl control. And again, the ordering is non diaryl control, diaryl control and then case. And so that line needs to start at up above bump for not for the healthy at the tax on position for non diaryl control. And then and then it needs to go if it's to diaryl control, it's going to go to position. And if it's case group of group two, rather, not group one, if group two is diaryl control, it's going to end at the position. And if it's false, then it's going to be case. And then I want it to be position minus 0.2. Okay, so that gets me my y my y end. Let's go ahead and run this and see what we get. So I'm getting an error message that object disease stat not found. And that is happening, because it's inheriting the AES from the original ggplot command, right? So I can do inherit dot AES equals false. And now what I see is I have a vertical line at position 90. And of course, there are three lines or two lines overlapping here. And so I need to dodge those apart on my x axis. So to dodge the lines horizontally along the x axis, I need to add an x and x and column. So x is the special one x and will be the same as x because it's a vertical line. So to do this, we're going to do something like if else, but it's called case when. So do case when. And so case when you can give it a logical question. And if it's true, then you you then give it the value to use if it's true. If it's false, it then goes to the next logical question, right? So we'll do case when group one equals equals non diaryl control and group two equals equals diaryl control, right? So if that's the case, then we'll do a tilde. I'll then say 60. And then we'll repeat this, right? Because we have other groups, right? So non diaryl control case, I will then make that 75. And then non diaryl control should be diaryl control for the third line to case. And I'll make that 90. Okay. And so then that is good. And so again, what's going to happen is it evaluates this first question. If it's true, we get 60. If it's false, then it evaluates this next question of is group one non diaryl control and case. If that's true, it returns 75. Otherwise, it comes down and does diaryl control in case. And that's 90. Generally, you want you want something that you know will evaluate to true, I'm going to live on the edge and not worry about it so much. And so we'll go there. And then we'll do x and equals x. So I'll save that. And now I'm going to change x to the x and x and to the x and lucky there, we've got two lines indicating the comparison between healthy and diaryl, seed of negative and healthy and diarrhea seed of positive for our all of our ot us. The LS stippies shows a significant comparison between the two diaryl samples, which is interesting. So that's good. The next thing I want to do is be able to put in a star next to those lines to indicate that it is in fact significant to do that. I need to again modify my my pair wise test data frame. And where we're going is I'm going to add a geome text. And so that is going to take again data equals pair wise test. And I need AES and I'm going to have x equals x star and y equals y star. And so those are the two aesthetics. And I will then have label equals star. And again, I don't want to forget that I want to do inherit dot AES equals false. So now I need to get those x star and y star variables. And so I'll go ahead in here into this mutate statement, my x star is going I'm going to say x star equals let's do 1.1 times x. I think that should be good. I should bump it to the right, like 10%. We've got these weird logarithms. I'm not totally sure how that will look. We also need then y star. So for y star, I'm going to borrow this case when statement down. And for if group one is non diaryl control, the diaryl control, that again is going to go from up point to to the right to the position. So that's going to be position plus 0.1. Because I want to start to be in the center of that vertical line. For the non diaryl control case, that is going to be right at pause. And then for diaryl control case, that's going to be ultra pause minus 0.1. And I'm picking 0.1 as my increment for the star, because that's half the length of the line. Let's go ahead and see what this looks like. And there we go. We've got our stars. Actually, the stars are, I think there must be some space underneath the star, the vertical justification. That's okay. I'm going to go ahead and bump it down just a smidge. And so maybe instead of 0.1, I'll do 0.05. And then the position will also have to be dropped down 0.05. And then this will have to go down another 0.05. So I'm pretty happy with the way this looks. One thing, as you've heard me say that I worry about a little bit, is hopefully it's clear that this bar is the comparison between the healthy and the diarylcid of neg. And this bar is between the healthy and the diarylcid of positive. We'll see. Maybe we'll do some focus groups and figure that out. One thing I don't like about this, though, is I don't like having this column of stars and bars. I would rather have this column be adjusted over to the left. So it's kind of hugging the data. To do that, what I'd like to do is increment it to perhaps 10% to the right of the largest 75th percentile. So again, these whiskers or the line segments indicate the intra-cortile range. So if I can figure out what is the largest 75th percentile for all of the taxa across all three disease status groups, then I can use that value and increment it by 10, 20, 30%. And that is then what I can use as my exposition for drawing those lines and the stars. So I need to calculate the 75th percentile for each of those disease status groups for each taxa, and then return the largest 75th percentile for each taxa. So I'm going to do that using a map function. And again, I'll kind of rough in the code to do mutate max, quartile equals, and I'll say map, and then dot x equals data. And then the formula that I need to use is get max quartile. And we will then give it dot x, and then we'll pipe that into the rest of the pipeline. But I may need to make this get max quartile. So get max quartile will be a function, right? And so we give it an argument x, and we've got the function body here in the braces. And then I will pipe x into group by, and it's already coming in data is chunked by O2. So I'll group by disease stat. And then I will pipe that to summarize. And I will do third q as quantile. And I will give it the relabund column. So I'm taking all the relative abundances for each disease status group in this taxon, and prob equals 0.75. Okay. So that's going to get me the three quartile values, I'll say dot groups equals quote drop. And then we will pipe that and do a summarize and do a max quartile. And that then will be the max of the third q. And I then will pull max quartile as the output from get max quartile. So I forgot the function keyword. So we'll do function all that. And so that loads it. And so now if I run these first two lines, so we get a column for max quartile. And this is a list of type double, right? So there's a special map function we can use, which would be map underscore DBL. And so now if we use max double, then we should have a column with the actual value for the maximum quartile. So now as this pipes through, we get down to here where we're defining x. And what I want to put in here is max quartile, and I'll just say times 1.1. And here I'll do the same thing, perhaps times 1.3. And here perhaps I'll do times 1.5. So it's complaining that it can't find object max quartile. And I wonder if I don't have a select function up here. Yeah, so I forgot to bring in the max quartile. So I'll go ahead and put that in. And that looks pretty good. We've got the stars and bars closer to our data. You know what, maybe I'll increment the value a little bit more. So it's not so on top of each other. And so I'll come back up and maybe I'll make this 1.3, this 1.5, maybe 1.6, let's say, and then 1.9. So I think that looks good and gives a decent amount of space. If you've watched any number of these videos, you know, I can spend all day just kind of futzing with the position of things in my plots. And so I think that looks pretty good. Again, the thing that I just kind of wrestle with is whether or not it's clear that the bar represents the comparison between the groups that the that the line is connecting. I hope so. You know, we could put that information in the legend. If we had to, maybe we could go ahead and use like the same approach to put little whiskers on the on the bars here. I don't really want to do that. An alternative approach could be over here in the margin just to the right of the y-axis. I could put like abc. So, you know, for like, for right, for coca cola, I could put like a for healthy and then bb for the two diarrheal samples to kind of indicate significant scripts to say that like a is distinct from b, right? And then down here where we have for alastepis, the three different would be like abc. I don't really like that either. Let me know down below in the comments what you think of this. Maybe I'm overthinking things too much. I'm kind of happy with the way this looks. Ultimately, you know, I think these data are pretty clear about what's significant just by eyeballing it without even having to really look at a p value. The one thing we know is not significant are these other categories where we pulled together those otus. So I'm pretty happy with the way this looks. And I think I think it turned out pretty well. Again, the nice thing that we've done is use the data itself to determine the positioning of our bars, the length and span of the bars and the position of those stars to give it that nice dynamic look. Again, I could change the amount that I'm pooling the data by and then it should work. So I'm going to go ahead and try that as just kind of my final hurrah for today's episode. And I will come all the way back up to the top where we did pool. And I'm going to change this to 1%. And so changing it to a 1% pooling worked perfectly. Everything adjusted without me having to go into the code and muck with anything. So I'm really happy about how that worked. And I think that looks that looks really good. So again, let me know down below in the comments what you would do to make it more clear what's going on in the comparisons. Is this clear enough? Or would you do something else to clarify it? Again, we could maybe put little whiskers on it, but I don't want to mess with that. I think that would just kind of make everything look all the more busy. Alternatively, we could put letters indicating the significance groups. I think that also just kind of makes things look messy and kind of confusing. So I think this is good. I'm pretty happy with it. And I'll leave it here for now. Anyway, I really hope you found this interesting. Again, there's a lot of different things going on here. You know, if I had to boil it down to what did we cover? Well, we talked about adding other layers onto an existing plot. We talked about building out a data frame and adding columns, where we're using if else, as well as case when both of those are really powerful functions for conditionally creating other data, right? We created new columns based on the values in other columns. And so the ability to use those conditional values is really awesome. Anyway, I hope you found this interesting. And if nothing else, it's kind of a capstone to tie together a bunch of other concepts. Keep practicing with this. I think it's really valuable to see concepts in other contexts. And that really helps you to reinforce the skills that you're learning. Please tell your friends about what we're doing here in Code Club. I've got some exciting episodes coming up here in the next few days and weeks. And I want you to be tuned in. So be sure that you're subscribed and you're getting notifications about what's going on. Thanks for coming along and spending your time with me today. I really hope you've gotten something out of this and we'll see you next time for another episode of Code Club.