 Did you know that you can run other programs from within R in RStudio? Pretty cool, huh? Well, in today's episode, we'll do just that when we run the lefts of function from the mother software package as part of a script that will write to generate those classic lefts of plots. And then, well, maybe we'll discuss why those plots aren't such a great idea. Hey folks, I'm Pat Schloss and this is Code Club. If you've been following along in recent episodes, you know that I've been looking at a variety of approaches of representing relative abundance data for microbiome data for three different disease status groups from a study that my lab published a number of years ago. Those three disease status groups are healthy people, people with diarrhea, and people with diarrhea who also have C. difficile. We wanted to know are there biomarkers, are there bacterial taxa associated with those three different disease status groups so we could perhaps learn a little bit more about the disease? What we've done in recent episodes is that for each OTU or each taxonomic level, we did a Krosko-Wallis test. We then corrected P values for the multiple comparisons. And then for those comparisons that were significantly different across the three disease status groups, we then did pairwise Wilcoxon tests to figure out which of the different taxa were differentially represented across the three disease status groups. And then in the last episode, we went ahead and put those beautiful stars and bars on all of our data to indicate which comparisons were significant. Doing that requires a fair amount of sophistication in your use of R, right? You needed to know how to use Nest and MAP and Krosko-Wallis and Wilcox, pairwise Wilcox, all these different things, right? Well, one of the things that's been really handy in the microbiome literature is people developing statistical tests that kind of perform generic tests like a test across different disease status groups, say, and doing it in a packaged way. So you put your data in, you get results out and then you're good to go. These pre-packaged tools also often output visuals for the user that many times users will take and then use in their own papers without further modification. Hopefully if you're following along and you're interested in a video like this one, you realize that that's just not gonna cut it. So this figure was taken from the original paper that described the lefts of package. And we will see figures like these populate the literature now. On the left is a cladogram that shows each kind of ring of this radial phylogenetic tree is a different taxonomic level. I'm not such a big fan of this. We'll move on and talk about that another day if people are interested. If I look at this, you know what this looks like? It looks like a pie chart. The right side, I see this type of plot showing up quite frequently. And so what are some things to like or not like about this figure? First of all, on the x-axis is the LDA score which is in a log 10 scaling. So tell me, what is an LDA score? I mean, really, come on, tell me, what is an LDA score? Well, if you can't tell me what an LDA score is then please don't publish this figure because if you don't know what it is then why do you expect your audience to know what it is? This LDA score is a measure of the effect size. It is a measure of how separated are the two different groups based on this one taxonomic grouping. And so if you have a value more to the right then that is a more associated in this case with Tibet rag knockout or Tibet, I don't know what these are. These are mass genotypes. And if it's more to the left then it's a rag too knockout, right? And so we can then see that these biomarkers are skewed depending on the type of mice that they're coming from. The way Lefse works is to do a Wilcoxon test on each taxonomic grouping. It identifies those OTS or features that have a P value less than a threshold, say 0.05. And then instead of doing a correction for multiple comparisons like we have used in the past like Benjamini Hatchberg or perhaps you've heard of Bonferroni, what it then does is apply this LDA, the Littier Discriminate Analysis to then rank the features by their effect size. And so those features that have a larger effect size are then represented if the effect size is, say, bigger than two or three and those that are smaller than are kind of ignored. Again, this figure is showing a pair-wise comparison between two genotypes. In our case, we have three different groups. And so that would be three pairs of comparisons. As we've seen in previous episodes we've been working on developing this figure where we can show the actual relative abundances of the three different disease status groups. Their level of variation for each taxonomic grouping that we're interested in. Again, to represent the same information with these LEFSA plots, we would need three LEFSA plots. And I don't think the LEFSA plots are doing that much work, right? They're showing you the effect size but they don't actually show you like the actual relative abundances. So in this episode, I'm gonna show you how we can run the LEFSA function from the mother software package within R in RStudio. We'll create the files that we need. We'll call mother to run it and then we'll read in the output files and then generate a figure just like this. Let's go ahead and head over to RStudio now and we'll get going. If you'd like to follow along today's episode with your own computer, I strongly encourage that you do that to really help learn the material all the better. Down below in the description here is a link to a blog post associated with today's video where you can get this starting code chunk. Also across the top here, I'll put a link for instructions on installing R, RStudio, the tidyverse and getting the data that we're gonna be working with. You see that we load in the libraries. We then read in some taxonomy information, get nice formatting and things like that. We then read in the metadata for this read Excel and then we read in a shared file. I'll go ahead and load all this into my current RStudio session and we'll get going. One of the things that we'll need before we can proceed into mother is two files to run the LEFSA function. So we'll need a shared file, which is a mother formatted file that we've been kind of working with in recent episodes as well as a design file that tells LEFSA what kind of sample each sample is, what group or disease status group each sample belongs to. Also, in some cases, in some projects, you might have different samples represented in your shared file than you might in your metadata file, right? So perhaps you sequence things but not everything got fully sequenced and so you might have more things in your metadata file than in your shared file. So we need to make sure that those two files are the same that contain the same information on the same samples. So we'll go ahead and look first at metadata, which again contains all sorts of metadata, the sample ID and a whole bunch of other stuff. So I'll do a select on sample ID and disease stat. And so now if I look at metadata, I'll have those two columns and only those two columns. Also for my shared file, that is going to be a massive file with a whole bunch of columns and rows. It's got 338 rows and 2,491 columns. There's 2,488 total OTSUs that we're gonna be looking at. What I'd like to do is make sure again that my metadata and my shared file contain information on the same samples. I can do that using an interjoined function where I could give it a shared file and metadata. And because the columns indicating the sample ID are different, I need to use the buy argument. So I will give buy the column for the shared file first, group equals and then it's sample ID from the metadata file. If I do call names on this as part of a pipeline and then look at tail, I see that the very last column is my disease stat. So I'm gonna go ahead and save this as shared design. So with this shared design file, I can then specify different disease status groups that I want to compare. So I'll get two different disease status groups and then from that, I'll create a shared file and a design file and then I can run that into mother. So let's do an example, right? So we could do shared design and pipe that to filter and I could say disease stat equals equals non-diarrheal control or disease stat equals equals diarrheal control, right? Maybe I'll put a line break here just to keep things on separate lines and make things so they don't run off the edge of the screen. And I now see that I have how many samples. I've got 244 samples. And so this then is a shared design file for my non-diarrheal diarrheal control comparison. And so maybe I'll call this NDC DC. So with this, I could then create a design file and a shared file for this specific comparison. I can do NDC DC and I could then pipe that to a select and I will do minus disease stat and I can pipe that to write TSV. And I wanna put this in a directory that I'll call processed data. I have a directory called raw data. So I'm gonna put this in process data because these are not raw. I'm gonna keep my raw data, my process data separate and then I will output this as Schubert.ndcdc.shared. I run that and it complains because it cannot open the file, process data Schubert.ndcdc.shared. And so what I can write would be dir.create processed data and that then creates a directory that now pops up over here in my files tab. And so I can now run this there and I then put Schubert.ndcdc.shared into that file. I wanna do something similar to create the design file. So I'll do NDC DC and pipe that to select and here I'll get grouped as well as disease stat. And I will then pipe that out to write TSV. And again, this is going to be processed data Schubert.ndcdc.design. So again, looking in my process data directory I now have those two files and I'm ready to run this in mother. One problem is that I don't have mother. Firing up my browser. I'll go ahead to github.com forward slash mother forward slash mother. This brings you to the mother github repository on the right side down here. There's a link for releases. The latest release as of I'm filming this here at the beginning of June 2021 is version 1.45.3. That brings you to the latest release version. I'm running mine on a Mac. So I want this Mac OS X-1014.zip. If you are on Windows, then you want a mother.win.zip. And if you're on Linux, you want the mother one of these Linux ones. So I'll go ahead and download mother OS X-1014. So I've got my mother directory in my downloads and I can go ahead and move that into my code club directory here. And now looking in code club I see that I have the mother directory with everything I need to run mother. If you're in Mac land, they've got all sorts of security provisions going on now. To get mother to work, you need to right click on it and then hold the command key open. And that will then open up a dialogue that says, are you sure you want to open this? Yes, go ahead and open it. And then that fires up mother. And so I'm gonna quit and we'll be good to go now there. Coming back to our studio, there's two ways, at least two ways that we can run mother from within our studio. So first I can go up to tools and then do terminal, and I can say new terminal and that opens up a new terminal down here in the lower right corner. I could then do mother forward slash mother and that gets me my mother session from within our studio pretty slick, eh? So I can then at this prompt run lefse. So I could do lefse and I could do shared equals shubert.ndcdc.shared and design equals shubert.ndcdc.design and I can say input dir equals processed data. Close parentheses and we can run this and it runs lefse, right? And it then produces this output file, okay? Another way to run mother from the command prompt would be to quit out could then do mother forward slash mother and we can then run commands for mother from the command line. And so we do that by putting it in quotes. So we do a quote and then a pound sign and then I'm gonna paste in the actual lefse command and then a closing quote, right? So this will run lefse without me having to go into mother and run the lefse function directly. So again, this way of running it from the command line will be very helpful as we think about how to run lefse from within R. So I'm gonna go ahead and paste that lefse command here. Again, remember it was mother forward slash mother and then we did a quote, pound, lefse, all that and a closing quote at the end. So this is a long string and what I will do is create a variable that I'll call command and I need to then wrap all this, the mother function call in a single quote, okay? So I can't do double quotes because I've already used the double quotes and so it doesn't matter which you use first, the single quotes or the double quotes but we wanna load this command string. Now within R we can type system command and so system command will run that command from within R and so pretty slick. It's then running a mother from within R. Isn't that pretty cool? And then this will output this file which is Schubert ndcdc 0.03 lefse summary and so that's good. And so we can then take this file and read it in and make our nice plots, okay? But I wanna do this for ndc non-diarrheal control versus case and diarrheal control versus case. So what I'd like to do is turn this series of commands into a function so that I can give it the different disease statuses I wanna compare as well as a tag for what I wanna name the files. Getting going on this, I'm gonna create a function that I'll call run lefse and so we'll do run lefse equals function and I'll give it x, y and tag. X and y are going to be these values so non-diarrheal control and diarrheal control in this case, that was the example that we used. Of course, next time we run it, x and y will be whatever value we use when we call run lefse. I'll then call this x, y because I'm not very creative with names and then I'll need to use x, y here as well as here. For this tag, I can use glue to put in the tag but I need to write glue around all this so that what the glue function will do is insert the tag into my file name here. I can do the same thing down here with the design file by again doing glue and then replacing that ndc, dc with a pair of braces and tag in there and then I can do all this and here again, I'm gonna need to put in my glue but we'll get this going here and again, I can then wrap this in glue. Okay, so that will be good. I'm gonna go ahead and indent these lines over a bit and then we need to put in our closing curly brace and then I'm going to return the mother output file name, the lefse summary file and so if we look at our terminal, that was this and we'll go ahead and put that in quotes and I'll replace this ndc, dc with tag and curly brace and of course I now need to put glue around all this. I can then run this with run lefse where I do a non-diarrheal control, diarrheal control, gotta spell it right and then I'll do ndc underscore dc, run that. So again, we can then repeat this for our other three comparisons but for now I'm gonna call this ndc, dc and again, I can copy this down to non-diarrheal control ndc case where I'll then put in case here and then in my tag, I'll do case and then here I'll do dc underscore case and I can then do diarrheal control there and case and then dc case, right? Again, now we have these three variables ndc, dc ndc case, dc case that contain the file names of that lefse summary file that we generated in mother, right? Let's go ahead and make one of those classic if you will, lefse plots. So we'll go ahead and read underscore tsv ndc, dc and we get a sense of the layout. It's giving us a bunch of errors. They're not errors, they're warnings of parsing failures. So the problem is that like for row seven, it says five columns expected, three columns actual and then if we look at seven, what we see is that this is a case where there wasn't a significant value. Anyway, looking at this table, we see in the first column, the otu, the log max mean, the class as well as the LDA value as well as the p value. So again, if the p value is less than 0.05, then it calculates the LDA and then tells you which class has the higher mean or the higher median value. Again, let's think about this plot on the right side of the screen here. And what it is, it's a bar plot that's vertical. So the y-axis is the otu and the x-axis is the LDA score. Interestingly, the LDA score is positive for one case and negative for the other case. And so our LDA values are all positive. And so those that are associated with one of the two disease status groups will need to make negative and the other will have to keep positive. Let's go ahead and start here with the read TSV and DC. I'm gonna do a drop and a on LDA. So again, this will get rid of any row from our data frame that has an NA value in that LDA column. Then I want to modify my LDA column. So I'll do mutate LDA and I'll say, if else, and I'll say class equals equals and I'll make my non-diarrheal control go off to the negative side, right? So if the class is non-diarrheal control, then I want negative one times LDA. Otherwise, I'll keep the plain LDA value, right? And so now what I see is that I misspelled the real. And so let's run this again. Okay, now we get our non-diarrheal control samples have a negative LDA. Our diarrheal control are positive. So they're going off to the right and we can then again ship this off to GD plot where our AES, our X will be LDA, our Y will be OTU. And so we'll come back and substitute OTU in with the actual taxonomy name. But for now, let's go ahead and do geom call. And again, what we see is we've got our OTUs on the Y axis or LDA on the X axis. We don't have our color and we don't have things sorted by LDA. We also would like to eventually bring in the taxonomy information for those labels. Let's start with the color. And so we'll do fill equals class. Good, we've got things colored by class now. To get things in the right order, we'll go ahead and add a line to this mutate where I will do, I'll make OTU to be a factor. So I'll do FCT reorder on OTU and we'll reorder it by LDA. And sure enough, we now have things reordered. So the diarrheal controls are up at the top. The non-diarrheal controls are towards the bottom. That looks good. I'll go ahead and join in the taxonomy data frame with inner join. And I'll take the stuff coming through the pipeline with taxonomy and then I need to do a buy because my columns have different names. So in the pipeline coming through, it's all caps OTU and in the taxonomy, it's lowercase OTU. Add that to the pipeline. Again, that's a strong argument for keeping your columns the same capitalization. So now instead of OTU, what I actually want to use here is taxon because the taxon, the stylized name is what I actually want on that y-axis. So taxon and then I want to do theme classic to get rid of all the grid lines and all that stuff. And then theme, I'll do access.text.y equals element markdown. So element markdown will take those stars in my taxonomy names and make them italicized. And I get that from the ggtext library. ggtext is installed with the rest of the tidyverse. And so there's like way too much stuff here, right? And so you can see maybe the font's too big or I think really ultimately there's way too many taxa on the y-axis. To get rid of some of those OTUs, what I'll do up here in my pipeline, I'll do a filter to keep those LDA values that are greater than four and get rid of some of those ones that have a smaller effect size. Great, so that kind of whittles down the OTUs to a smaller number of features that's a little bit more manageable and more on the scale of what we already saw with our plot where we're showing the relative abundances for the three disease status groups. And we've got a nice styling of our taxa. So let's go ahead and start with changing our labels. So I'll do labs, y equals null, and then x, the styling that they use in the default plots from Lefse is LDA score, and then log space 10, add a plus to the end of that. And then for my x-axis, I'll do scale x continuous. And I'm going to do limits equals C minus six to six. And I want breaks, I'm gonna do, I'll do seek minus six to six by equals two. And with that, it'll be minus six, minus four, minus two, zero, two, four, six. Go ahead and add a plus to that. So that's looking better, better styling. Let's go ahead and change the colors now. And so we'll do scale, fill, manual, and we'll do name equals null. And then our breaks will be non-diarrheal control and diarrheal controls. All right, and then our labels will do healthy and we'll do diarrhea. And I'll do C difficile negative. I know it's another star after difficile. And then to match the colors, we'll do values. So again, the values off to the left are red and those to the right are green. And so I will do non-diarrheal controls will be red and then dark green and we'll add a plus to the end of that. So that matches pretty well. The colors, the colors are bad, right? Should not be using red and green because a large fraction of people are unable to differentiate between these two colors. So let's go ahead and maybe change these back to our stylings that we've been using throughout these episodes. I also need to go ahead and add element markdown for the legend text. So again, instead of red and green, our healthy had been gray and our diarrhea seed of negative had been blue. And I also then need to do legend dot text is element markdown. And so that's pretty good. I think I'll go ahead and put a line break in there in the title and maybe I'll make the size of the plot a little bit shorter. So those rectangles, the bars aren't so tall. And so maybe my height, I'll make four. I'll go ahead and put in a line break in here between diarrhea and seed of negative. So I think this looks pretty good. Definitely much better than the color scheme that you get for the defaults. Again, I'm not totally thrilled with this visual. I would need to repeat this two more times to generate the comparison between the diarrhea seed of negative and the diarrhea seed of positive as well as between the healthy and seed of positive. I already have that in that other plot we generated where we saw the relative abundances as well as the variation in the relative abundances. I feel like as we've been stepping through these episodes we've identified a couple of limitations with different visuals. So you'll recall that like a stack bar chart and pie chart shows you no variation in the data. I mean, it's also got other problems, no variation in the data and no sense of what is significantly different from other things. This shows you significance, right? In a way like the LDA scores kind of like a pedi. It's the effect size, but so it's showing you like what's different, but it's not showing you the average relative abundance. It's not showing or the median relative abundance. It's not showing you the variation, right? Again, if we come back to this version of the visual we see the median, we see the amount of variation. We can see what's significant or not. Perhaps it's a little bit more challenging to interpret the effect size because again, we've got this log scale on the x-axis. But I can kind of, I can at least compare that like for Bactroides, that's like a 10 fold variation relative abundance. Again, that's O2U2, where again, if I come back here to O2U2, that's Bactroides here, I've got a negative four point something LDA score for healthy. It's not really intuitive what that means. So if I really wanted to use LEFSA for some reason, I think what I would do would be to run all three pairs of comparisons through LEFSA and MOTHER. I would then get the LDA scores to figure out which O2Us are associated with which treatment groups, those pairwise comparisons. And then perhaps come back to this plot and perhaps instead of these stars being generated from the pairwise Wilcoxon test, I would then put stars on here and I would say, the stars indicate that the pairwise Wilcoxon was significant and there was an LDA score greater than whatever threshold you use. Here we used four, but perhaps you could go down to two and you could use that to indicate your star. At the same time, I'm not so convinced that I probably would use LEFSA. I don't think that I'm getting that much more information. And by using these prepackaged tools like LEFSA or Metastats or other things that are out there, you really kind of pigeonholed into their experimental design. Say I had people that went through two different treatments and I had people before and after each treatment and I wanted to know which treatment changed the people's microbiota the most, right? Well, LEFSA would not help me do that because I would not be able to incorporate the paired nature of the data. I could do that type of a paired analysis by again rolling my own statistics through something like a Wilcoxon test, a paired Wilcoxon test, like we saw in the previous episodes using MAP. So again, I'm not totally convinced that these types of LEFSA plots really convey that much information. And I don't know that it's worth having multiple LEFSA plots like this. When again, it's just far more information that you could show in those other plots where we're showing the relative abundance and the variation in the data. Agree with me? Disagree with me? Let me know down below in the notes, in the comments, I'd love to hear what you think. I feel like because the LEFSA R package and I think it's available through CHIME, outputting these types of visuals directly that people are like, ah, there's a file that has a figure. I will use that directly in my paper that that kind of removes a lot of resistance that people face in analyzing their data. So hopefully going through this episode today, you'll see how we can make the plot. If you have a PI that wants you to use a LEFSA plot, how could we make them better by picking better colors perhaps? You can better understand kind of the strengths and weaknesses of these LEFSA plots, but also the other thing that I wanted to drive home in this episode is how cool it is that you can run other programs from within R in RStudio and doing that with mother. You know, I certainly think there is a reason to run everything through a bash script like we've seen in older episodes of Code Club and then bringing that within R, but you know, it's pretty powerful to be able to run commands from the command line from within R using that system function call. Give that a shot. Let me know what you find out. Let me know if you find that useful for your work. Hopefully you see the value in doing that. Keep practicing. Tell your friends about what we're doing here and we'll see you next time for another episode of Code Club.