 It's pretty common in biology to need to run a statistical test many times across many different entities from whatever the system is you're studying. So for example, you might be looking across the 4,000 E. coli genes to look for differences in gene expression under different situations. Or you might be looking at the different 4,000 different taxa within a microbial community to find those taxa that are differentially represented across your communities. This can be very helpful for figuring out what features to represent in your data visuals. That's exactly what we're going to talk about in today's episode of Code Club. Hey folks, I'm Pat Schloss. When we're making a visual of our microbial community data, as we've already noticed, is that because we have hundreds or thousands of different taxa in a community, we can't possibly show all of them in our figure. People try, I know, you know, they see those stacked bar charts where they've got the legends of like hundreds of different taxa and you just you can't see them all, right? Anyway, we've talked about that been there done that. So what we've been looking at is a plot where we can plot the median relative abundance as well as a line to indicate the 50% quartile, right between the 25th and 75th percentile plot that interquartile range for a variety of different disease status groups for what we've been doing looking at the most abundant taxa. Well, maybe you want to do something other than the most abundant taxa. You might also perhaps want to look at those biologically interesting organisms, right? So things that you think are interesting. And regardless of whether or not they're statistically significant or even abundant, you still want to show them, right? So for in my work, we do we also do a lot of colon cancer work. And so a popular organism is fuzobacterium. And so even if we don't have fuzobacterium, we might still want to include fuzobacterium in the plot. So people know we looked for fuzobacterium, but we did or did not see it. Alternatively, besides abundance or biological interest, we could also look at statistical significance and only show those taxa where there is a statistically significant difference. I would say for most cases, if there's no statistical significance, then is the bug really interesting? Again, if previous people have found that it was interesting and you're saying it's not, then maybe, but for the most part, I'm really only interested in the statistically significant taxa across our disease status groups, as well as those that are the most abundant. And today's episode, we will do both using some great functions from the broom package. And this is going to be an idiom or set of functions that's a bit different than what we've seen before, but has parallels with the group by and summarize functions. So instead of group by summarize, we'll do nest and then mutate with the map function. I think we've actually seen the map function in the previous episodes of Code Club, but we'll see it again here for running our own statistical tests. And because we have three disease status groups, we'll do an experiment-wide comparison using the non-parametric cross covalent tests. And then if we have a significant result there, then we'll do pairwise Wilcoxon tests. So we'll get to see this nest, mutate, map, stream of functions multiple times to get some practice before I send you off to work with your own data to see if you can't implement these on your own. Let's go ahead over to our studio now and see if we can't implement these functions. Again, we are going to add library broom. Again, it was installed with the tidyverse, but you need to load it separately with its own library call. If you haven't been watching, go ahead and check back those previous episodes. We're working with data from a paper that my lab published a number of years ago looking for biomarkers in the feces of people with and without C. difficile infections. We've got people that are healthy, people that have diarrhea, but no C. difficile, and people with diarrhea who do have C. difficile. And we've got about 300 people in the study overall. We've got metadata to know what disease group they belong to, the number of otus for each person in the group, and the number of times each O2 occurs, the number of sequences per sample. We've got all this great taxonomy data. We spent the last episode trying to make that pretty. And then we get our relative abundance data. And then in here, we've got this tax on pool where we're figuring out which taxa to pool together based on their abundance. Finally, then we go ahead and we join it all together to make a plot. I'm going to break into the pipeline right about here, where we're reordering our taxa based on their relative abundance. Again, this gives us a data frame with our sample ID, our disease status, the taxa, the relative abundance and the median. I'll go ahead and save this as O2 disease, relabund. And again, we can do O2 disease, relabund. And we've got that that we can work with going forward. So to motivate this a little bit, I'm going to go ahead and do a filter where I'm going to grab this Alice stippies. And I will do tax on equals equals that. I got to put it in quotes. What we get for filtering our dataset to get the Alice stippies is our sample ID, the disease status, everything obviously is from that O2 six and the relative abundance. I don't worry about that median column. And we've got 338 total rows, we've got data from those 338 different subjects. What I'd like to do would be to do a statistical test to see if the relative abundance of this Alice stippies O2 varies across the three different disease status groups. So what we could do would be to pipe this to cross goal dot test. And then I can do relabund, tilde, disease stat, right? So we're looking at explaining the variation and relative abundance by the disease status column. And I can then say data equals period. And so again, with these pipelines, I can put a period to indicate to this pipeline that I want the data coming through the pipeline to be used as the data frame as input to the cross goal test. The output then gives me the output of the cross goal Wallace test, and tells me that my p value is super small, right, less than 2.2 times 10 to the minus 16. And again, this would be useful, right? So I could I could copy this and paste it many times for all the O2s that I'm interested in. But that's not dry, right? So dry is don't repeat yourself. And it's and it's very error prone. And it's really difficult to manage and keep track of. So what we're going to do is we're going to take this framework, and then use that to basically run the statistical test on all of our data. So I'm going to leave this code here to help us see the parallels between what we're doing using the broom package, and what we've already seen for a single O2 you, again, we can take that O2 disease relevant. And I'm going to pipe that into a group by, and then I will do tax on, right? So I'm going to treat each tax on separately. And I can then do nest. And what we get with that nest function is that it nests the data for each group together. And so now I have a column called data, that is a column of tibbles, a column of data frames, right, which is pretty cool, that each row here now I have 12 different taxes. So these are the 12 most abundant otus that I had. Again, we wouldn't necessarily have to do that abundance filtering, but we've got it. So it's fine. And then we have a data frame, separate data frame for each of those 12 taxa. An important thing to remember from the nest output is that we now have a column called data that has each of those different data frames. And I can then do mutate to create a new column, which will have output from our statistical test. So basically, we'll have another column that is made up of multiple data frames. And I will call this experiment tests. So these are the experiment wide tests. And that will equal to map. And so what are we going to map over? Well, we're going to map over that data column, right? And so data takes the argument period x, right? And then we're going to give it formula notation. So we'll use a tilde. Again, I don't know a good way to explain what that tilde is about why we need that here, except that it works. And just go with it. Okay, so we'll do cross goal dot test. I'm going to go ahead and copy and paste this down. And maybe put this on separate line. So it's easier to see. And so we've got relevant explained by disease stat. And then our data isn't going to be period, it's going to be the value from the dot x argument. Okay. And I think we're missing a parentheses, maybe. And so what we get back then is we've got a data frame with our tax on names, our data, and then the column experiment tests. And we see that it is a column made up of each test. So that's the type of data. We can make that a little bit more useful by piping that to a function called tidy. And so tidy will take each test and make it into a data frame. And now what we see is that we have our tax on our data experiment tests, and that experiment tests is of type tibble. So it's made up of all these different data frames. Now if I want to see what's in that column, I can pipe this and do unnest experiment tests. So we now get as output from running the cross goal test. Now that we've run unnest, we kind of explode open that column of data frames to see that we've got tax on the original data that still nested, our test statistic, the p value, the parameter and the method that p value is not corrected for multiple comparisons. So we need to go ahead and correct for multiple comparisons. And so I will pipe this to mutate. And I'm going to say p dot experiment. And this will then use p dot adjust. So p adjust will adjust for multiple comparisons. And the argument we will give it is p dot value. And the method that I'm going to do is BH, capital BH, which is short for Benjamin Hatchberg. And we now see that our p values changed a little bit. But for the most part, these are all very, very small p values. I will then go ahead and select on tax on data p dot experiment. So that we've got those columns. And if you wanted, you could also then filter out for p dot experiment less than 0.05. And what we'll see is that other category now goes away. And so again, we've gone through this iteration one time to run the experiment wide test to take each OTU and to see if each OTU has a significant difference between the three different disease status groups. And what we saw was that they all did except for that other category. Now the question is, of those three different disease status groups, which ones are significantly different from each other? So I could go ahead and save this as experiment significance. We're going to basically repeat what we just did in the step instead of experiment significance, we're going to look at the pairwise significance for all of these experiments or these ot us rather, that were statistically significant. And so we'll see again that the syntax is very similar. We'll take experiment significance. And we've already got the data nested. So I'm going to do another mutate. And I'll do pairwise tests. And we will then do map. And we'll do dot x equals data pairwise Wilcox test. So we'll then do x equals relabund g equals disease stat, we'll then do p dot adjust method equals bh. We'll have a small problem with this though, in that pairwise Wilcox test does not have a data argument. And so we can't write this in the same formula notation with data that we did up here for cross call test. So instead, we'll have to use the dollar sign notation to get access to the data frame in dot x, which is not too bad, we can do period x dollar sign relevant period x dollar sign disease stat. And as we saw earlier, we then need to pipe this to tidy. Again, this gives us our taxa, the data, the P wise experiment, and then those pairwise tests, where we have a three by three data frame. I can then, like we did before, unnest pairwise tests. And we now get our taxa the data P experiment group one group two where we have our diaryl control say versus case or non diaryl control case, whatever, as well as our p value. And so we'll see, and again, these p values are adjusted for multiple comparisons using Benjamini Hatchberg. And so this p value, the first one is 0.055. So that would be greater than 0.05, not significant. And again, we could take that filter and do filter p dot value less than 0.05. And we get 23 total rows of significant results. If we wanted to clean this up a little bit, we could do select tax on group one, group two, and p dot value. So one thing I'm noticing is that my data frame is grouped on the tax on. So I'll go ahead and add to my pipeline on group. And so now we are no longer grouped on on tax on, or any other variable. Again, what we can see is like for Alice Stippy is that O2 is six that I picked originally as our test case, all three of those Gen disease status groups have a significant different relative abundance. Whereas like Agathobacter O2 20 is distinct between non diaryl control and case non diaryl control and diaryl control. So basically, the two diaryl samples are not significantly different from each other, but they are significantly different from the healthy subjects. So the last thing I'd like to do is let's go ahead and remove this pooling by abundance. I'll also remove these three lines that we had here for the kind of looking at one O2 you. And so again, we did that pooling up here, basically on our with that tax on pool data frame. And so I think that we could do O2 relevant as input to all this. So let's give that a shot. So that took a moment or so to run. And if we look at experiment significance, we then see there's 348 otus that had a significant p value after correcting for multiple comparisons. This then could be fed into the next stage of doing those pairwise Wilcox tests to figure out which comparisons were significantly different from each other. And here we see 566 pairwise comparisons that were significant. It's not quite double what we saw. I think for the most part, based on what we saw in the study is that the diarrheal samples, the case and the diarrheal control are generally not significantly different from each other. But they are going to be significantly different from the healthy individual that that didn't have diarrhea. And so we kind of see that really for for most of these otus. In general, I think we're going to want to use a combination of both statistical significance, as well as abundance based significance. One of the things that I come back to frequently is the idea of like how many counts does a relative abundance represent. So for example, if I do my pooling at 0.5%. I had n seeks per sample so 1507. So the n seeks per sample times 0.005. So that was half a percent. So that's representing about seven and a half sequences as the median number of sequences per sample. If I was looking on a per read basis rather than a relative abundance. And so that's probably about as low as I want to go, right? I could cut it down to say 0.001. But then I'm kind of basing things on a median of like one sequence per sample. So that's probably not really what I want to do. So I think what ultimately I'll do is go ahead and bring this back and use O2 disease relabund. Because I need to have some kind of critical mass number of reads for my analysis to really be robust. There's other ways of dealing with getting significance out of relative abundance data. We'll talk about some of these methods in coming episodes of things like lefse, metastats, people also like DEC2. And ultimately, they're all trying to do the same thing of identifying features, O2 use that are differentially represented between different groups of samples. So I'm going to leave this here for now. What we're going to do in the next episode is take the result of these pairwise tests and use that to annotate our figure to show which comparisons are significantly different and which ones aren't significantly different by kind of putting in some stars and bars and NSs to make those comparisons on our plot. So it's easier to see when you look at those relative abundance plots, which comparisons are significant and which ones aren't. So stay tuned for that. And then after that, we'll go ahead and look at some of these other methods like lefse to see if that gives us any kind of difference in our results. Anyway, lots of cool stuff to do kind of going a little bit into the weeds for more science based audiences than what you probably do if you're giving this for the general public to look at. Again, the bigger point here is to learn about the broom package and using that combinations of functions nest, mutate with map and then unnest to repeat an analysis on different chunks of a data frame. Here we're doing it grouping our data frame by O2U. You could perhaps do it grouped by different disease statuses and kind of getting summary statistics or some other comparison across other chunks of a data frame. And again, this isn't only for relative abundance data. You can use it all sorts of other applications where you're getting more complicated output, multi-column output from a function. Remember that the group by summarize requires that the output of the summary function only have one value. And so here you could see we're getting many columns out from our analysis. So again, broom package really powerful makes kind of doing these types of more complicated analyses a lot more straightforward than they might otherwise be. So I encourage you to play around with that, see if you can find places where you can use this new idiom with your own data. Report back and let me know what you found if you run into any problems and we can dig into this a little bit more in the future. I'd love to do that. I know I still get a little bit rusty in using these broom packages functions with my own work. And so practice is good, right? Even for people like me who have been coding a lot. This broom package is still relatively new for the world of the tidy verse. So keep practicing. You'll get better. Please be sure that you subscribe to the channel that you tell all your friends about what we're doing here. And we'll see you next time for another episode of Code Club.