 If you look closely at recent announcements for special issues in journals or requests for applications for grant proposals, one of the things you might notice, well at least I notice, is that there's this call for greater mechanism or getting more towards function. But this is really, in my mind, a reductionist mindset that doesn't lend itself very well to studying the microbiome. Why is that? Well, because we're studying communities of bacteria. Unfortunately, the method that we typically use of analyzing microbial communities at the population level to try to get more to a mechanism is to treat each taxa, each otu, each genus, each implicant sequence variant as its own thing, right? And so we're basically trying to pursue Koch's postulates. And if you've ever seen me give a talk, at least over the last 10 years, you'll know that I don't think much of Koch's postulates because it ignores the ecology of organisms. And the idea that we know is true for the microbiome, or at least we think is true for the microbiome, that it's actually multiple bacteria working together, or perhaps in one context, one type of bacterium, in another context, a different type of bacterium, but they all cause the same disease. The analogy that I use in my talks is of the human genome, right? So the number of traits that are Mendelian and driven by one gene is vanishingly small, whereas if you look at something like height, it's controlled by dozens of genes, right? And so I think most traits in human genetics and the microbiome are driven by dozens of genes or dozens of microbes, not a single gene or a single microbe. So unfortunately, things like Lafse or doing t-tests or DC or whatever you're going to do, really get you towards Koch's postulates. What is that one bug causing the disease? That is the motivation for the package my lab's developed called MicropML. And so while MicropML can be used for any type of data, I'm going to be talking about in the context of microbiome data, because MicropML is a package that will allow you to use machine learning algorithms and methods to identify biomarkers of disease or of any kind of trait that you're interested in in your population of people or samples or whatever it is you're looking at that has a microbiome. And so instead of looking at individual biomarkers separate from each other, we're looking at collections of biomarkers, collections of populations that can be used to classify individuals or samples as being from different phenotypes or different characteristics or whatever, right? So in these episodes of Code Club as we march through introducing MicropML, what I'm going to do is revisit a data set that my lab's published on a lot. And if you've looked at my minimal R teaching resources, you'll be familiar with and that's the Baxter data set. Now I'm doing that because a collaborator of mine wants me to analyze the data in yet a different way, because we're trying to get a larger collection of samples that we can kind of continue to validate this approach of, you know, perhaps someday getting into patients and seeing if we can predict whether or not somebody has colon cancer based on a stool sample. So that's my motivation. That's my application interest. But now that you can do this, if you've got soil or water, if you're looking at people with cystic fibrosis, if you're looking at whatever, right? And it's been widely used already, even though it's been published for for just a few months. Again, this problem that we're trying to overcome is the mindset, the reductionist mindset, that we need to look at individual bugs and say, what is the bug? What is the bacterium that is causing colon cancer? And I say, what are you talking about? There's dozens, most likely. And they're most likely working together to cause disease. And furthermore, it could be the loss of organisms that is causing colon cancer. So before we get into those fancy machine learning tools, I want to show you what I mean by pursuing that O2 by O2 or genus by genus approach of identifying populations that are associated with colon cancer. So let's go ahead and go over to our studio and we'll get going with today's work. Remember that if you'd like to follow along with me, down below in the show notes for today's episode, there'll be a link to a blog post for today's episode that has all the instructions you need to get caught back up. I'll also put the code, the R script that I'm working on in there as it currently stands before I do anything here in our studio. But I would love for you to get the whole project so that you can do everything with version control and everything else that I do in today's episode so that you can you can have the best possible learning experience. Anyway, here I am in our studio in the lower right corner here, you can see the status of my directory. You'll know that I am in the right directory because I've got micro ML demo off of my desktop. You can have micro ML demo wherever you want. But what's important is that you are in that directory. And that is your current working directory. I got to this working directory by double clicking on my micro ML demo our proge file. So again, up in the upper right corner is my git tab and you'll see that it is blank, which means that I haven't changed anything. And so we are ready to go with today's episode where we're going to take the code from the last episode this genus analysis. And this code we are going to take. And we're going to ask the question of what general are significantly different between people with and without screen relevant neoplasias. And again, we're going to start with that code that we finished producing in the last episode, which again reads in our shared file or taxonomy file or metadata, and then generates this composite data frame that kind of throws it all together. I'll go ahead and run this and we'll make sure that it hopefully works. All right. And so if I do composite, we'll get this output for the data frame. And you'll see that we've got our group. So the subject ID, the taxonomy, the relative abundance, and then a whole bunch of other metadata about each of the subjects, all in one big composite data frame. So as I mentioned, I want to look at each genus and figure out what genre are significantly different between people with and without screen relevant neoplasia. My guess is that it's not going to be very many. And that's again, because the human microbiome is very patchy across individuals. So to do that, we're going to repeat an analysis that we did a few episodes back using the Schubert data where we had three groups. So because we had three groups that we are comparing, we use the Kruskal-Wallis test and then the Wilcoxon test. But today we're going to do the Wilcoxon test because we only have two groups, right? True false on screen relevant neoplasia. So let's go ahead and do composite. And then we'll pipe this into the nest function. This is a little bit different than what I did last time. You may recall last time I kind of goofed things up a little bit with correcting for p values because I didn't do this step. So we will nest. And we're going to nest all of the data except for the taxonomy column. So we'll say data equals minus taxonomy. And looking at this, now what you see is that we have two columns. We have a column for the taxonomy and a column called data. And so that data comes from this nest statement, right? So the data column is all of the columns except, and so that's the minus sign, the taxonomy column, right? And so now we have two column data frame. The first column is of type characters, so the name of each of the genera. The second column then is of type list, and it is a collection of different tables or the collection of data frames, 490 rows by 20 columns. So 490 rows, there are 490 subjects in the study, and then 20 columns of metadata that we could use for doing all sorts of tests. Now what we'll do is we will pipe this into a mutate function. And again, the mutate function allows us to create new columns. And so this mutate column that I'm going to create is going to contain the results of my statistical tests, right? And so to do that, I'm going to create a column that I'll call test, I'll then say test equals map. And so the map function allows you to take a function and iterate it over different rows within your data frame, right? And so what are the rows that I'm going to iterate a function over? Well, it's going to be over the different rows of my data column, right? So each row of data is going to get fed into whatever function I give map here. And so it's going to be the Wilcox test, because again, I have two different values, or two different levels or categories within that SRN column. And so to do this, I'm going to say dot x equals data. So I normally leave off the argument names, because it's more typing, right? Well, in this case, I find that it helps me to put in the argument dot x, so that I remember what the data are that we're going to be iterating over. And you'll see what I mean here in a moment for, for why at least I find that to be helpful. And then we're going to then, again, define the function that we're going to iterate over. And so I can then say tilde, and that tells us that tells R at least, or the map function, that this is going to be a function. And then we can then say Wilcox dot test. And we are going to use the Wilcox syntax now. And I'm going to say relabund. And so I want to explain the relative abundance within each of this, these tables, right? So within this, a betrophia, table in the data column, I'm going to take the relabund column from this data frame, and then explain it using the tilde by SRN. So if I wanted to use lesion, or some other variable, I could put, I could put in that variable name here instead of SRN. But again, I want to see how that generate vary across different SRN values. Then I can say data equals period x. And again, that's where I find having this argument name dot x comes in handy, because that is going to then get plopped in here for data for the Wilcox test. So this data is an argument name for Wilcox test. It's not the same as this data column that I have in my nested data frame. Okay, so we'll run this. And now what we get out is this table that now has a new column again, because we use mutate to create a new column of type list like we saw for data. And it's each value as of h is a value h test, right? So to get this to be tidy or to be a data frame in my test column, I'm going to pipe the output of Wilcox test to a function called tidy. And that will tidy, if you will, the output of the Wilcox test that comes to us from the broom package, which means up at the top here, I need to do library broom. And so now if I come back down and rerun this pipeline, I now see that this changed from being h test up here to being of type tibble one row and four columns down here. And that looks right. I'll now pipe this to a unnest. So unnest test. And that will take this test column and unnest it. And what you'll see now is that we've got a 280 by six column tibble, where again, for each of my 280 genera in the dataset, I now have the results for all of those Wilcox and tests. And so we see there's a statistic, a p value method and the alternative test. I'm interested in this p value column, because the next thing that I want to do is I want to correct for multiple comparisons, right? So if I do, let's just round to 300 statistical tests, then by chance, if I use a 0.05 threshold to assign significance, then I'm going to have that's five and 100. So if I've got 300, it's gonna be 15 false positives, basically, right? And so if I again do, so let's see what we get actually, I'm curious. So if we do filter p dot value less than 0.05, we get 26, right? And so we would again assume that 15 of those or give or take, right, are false positives, right? So we again need to correct for multiple comparisons. And so to do that, we'll do another mutate where we will create a column that I'll call p dot adj. Let's just say p adjust equals, and then we'll say it equals the output of the p adjust function. It's not always a good idea to name variables the same things as a function, but this will be okay. And so the argument for this that we're going to give it is the p dot value column. And we'll say method equals bh. So that's a Benjamin Hatchberg method for correcting for multiple comparisons. Other methods you may have heard of will be like Bonferroni. Bonferroni tends to be really strict, whereas Benjamin Hatchberg is not so strict, not so conservative, and has pretty good performance that people find with microbiome data, microarray data, people do microarrays, I don't know, whatever. So now if we look at this output, we see that we now have this p adjust column, which is great. And we can then pipe that into filter. And then we want not p dot value, but p dot adjust less than 0.05. And so now we find we have seven genera that have an adjusted p value less than 0.05. And so I'm pretty happy with that, that is a very small number of genera. I suspect if you look at different environments or different problems, you don't have this patchy problem that we have in the microbiome field. So you might have maybe many more taxa that are significantly different between your different categories. Anyway, let's go ahead and clean this up a little bit by doing a select to get the taxonomy column and the p dot adjust column. And so now we're going to have a two column data frame right with the taxonomy and the p adjust value. And I'm going to go ahead and call this sig underscore genera. And so what I'd like to do is to take these significant genera and use that to filter my composite data frame so I can make a plot showing the relative abundance values for every subject in my study for these seven different genera. If I take composite and pipe that to an inner join with sig genera, and I'm going to join by the taxonomy column. I now have those seven genera. And I can now pipe this into ggplot where I'll do aes. And I'm going to put my relative abundance on the x and my taxa on the y. And so I'll do x equals relabund, y equals taxonomy. And I'll do color equals srn. And we might fiddle with things here and before we sign off, but I will go ahead and do geom jitter. And yeah, let's see how things look and we'll tweak as we go. Now let's go ahead and save this. I'll do gg save. And I'm going to put this into a directory that I'll call figures. And we will then say figures and let's call it significant genera dot tiff. And I'll do width equals six height equals four. And I think I need to create a directory so you could do this with the finder or whatever. Actually, let me show you the terminal tab. So there's a terminal tab that opens up here. And you'll see that I've got some funny stuff here. It's raining. It's been raining for like weeks, it seems in Ann Arbor. But I'm here in my micro op ml demo directory. If you don't have this terminal tab, then I believe you can come up to tools, terminal, new terminal to get this going, right? So I can then enter terminal commands, right? So if you're in Mac or Linux, I can do mkdir, what I want to call figures, right? And so now if I look to an LS in here, I do have that figures directory. I believe it's the same argument if you're in Windows, right? So let me go ahead and save that. Voila, we have our very ugly jitter plot that maybe we'll do a few steps here to make it look a little bit nice. But what I'm trying to do here really is kind of drive home the point of the patchy distribution of those taxa that are significantly different between people with and without a screen relevant neoplasia. One of the things that I just have to do is to dodge my points. And so I forgot to add the position argument to geom jitter. Again, back up here in geom jitter, I can do position equals and this is always a tongue twister for me position equals position jitter dodge position equals position jitter dodge. So when I teach workshops and have to say that constantly, oh, it's just horrible. And so we've got a dodging of our points there. I think we can separate it a little bit better to make it easier to see what's going on. So I'll do dodge width equals 0.8. And I'll do jitter.width equals 0.3. Let's add some space. And so we now see that we have better separation between those clouds of points. And so what you hopefully notice is that for most of these taxa, the points appear to be right at zero, right? Like there's no variation away from zero for those. Porphymonus does seem to have within the people with SRNs higher values. But for most of these, it's kind of hard to see what's going on. Also for lagnosporacy, unclassified, there's just so many points there, there's about 240 points per category of SRN true or false. So what I'd like to do next is let's go ahead and map on a stat summary point range attribute so that we can see the median as well as the 50% confidence interval, the intra-cortile range. So to do that, I'll go ahead and do stat underscore summary. And we'll do fun.data equals median high low. Again, this is stuff we've seen before in the context of that Schubert data. But we're seeing it in a whole new light now, right? And so we'll do fun data, median high low, I'll do fun.args equals list, and then I'll do conf.int equals 0.5. Good. So that will give us the 50, 50% confidence interval, the intra-cortile range. The default is a 95% confidence interval. That's good. And then let's do geome equals point range. And so we get. And so it's hard to see. But you can perhaps see the point here in the middle. It's a little bit bigger, but it's the same color as my true value. And I think I again have this dodging problem. So I will add to this position equals position dodge. And let's go ahead and try dodge width equals 0.8. I think I actually don't need the dodge width. I think I just need width, and that's got a D. So now it's really hard to see because the points aren't right on top of each other, but you can kind of see that the point for the point range is a bit bigger than all of the other points. But really, you also don't see the line for these points because, again, the 50% confidence interval is probably right around zero. So what I'd like to do is a trick to make the color of my point range to be black while keeping the color of my SRN categories to be the colors that they are. To do that, the trick I'm going to rely on is that there are different plotting symbols that look the same, but that have different aesthetics with them. So we're using, I think, that a fault is 18, which is a solid circle. If we use 21, then that will get you a solid circle of a fill color, but with a border of a color color, right? So if I do fill equals SRN, and then for geom jitter, if I do shape equals 21, shape 21 again is that circle, where you've got one color for the inside for the fill and a different color for the outside. We get the same basic appearance, but now what we can do in stats summary is I can go ahead and do color equals black. And now I get a black circle and line or range, indicating the 50% confidence interval. And again, that black circle indicates the median. So I'm happy with the way this looks color wise, right? Something I'll do is let's go ahead and turn off the legend for stats summary. So I'll do show dot legend equals false. That allows me to have my colored legend. I believe the stats summary was on top of the geom jitter legend. So that looks good. One other thing I want to do is let's go ahead and put the x axis on a log scale. You may recall from other episodes that there's a trick we need to do with that, because if you take the log of a zero value, so again, if most people don't have these populations of bacteria, then they're going to have zero relative abundance, right? And so taking the log of that is going to throw off all sorts of warnings. So I will go ahead and do scale x log 10. And I'm going to come back up here to my relative abundance. And these values are in fractions. So I'm going to go ahead and modify it to be a percentage. So I will do a mutate rel abund equals 100 times rel abund. But I'm also going to add a very small value to my relative abundance. So there's about 10,000 sequences in all of my samples. So I'm going to add one over 20,000. And so that'll be a really small value added to everything. And that would be like an imperceptible change to values where, you know, somebody has a, you know, actually has that population up and I added a plus at the end of my mutate should be a pipe. So again, we get separation on the x axis of our relative abundances by putting things on that scale x log 10. We can see kind of a subtle difference for this lack no spray CA unclassified a subtle difference in the median value of its relative abundances. It's not huge, right? Like it's not something that you would like be really excited about for lack no spray CA to be a biomarker. But one of the things that I think is cool is that people with SRNs actually have less of these unclassified lack no spray CA than people without SRNs. So maybe if SRNs are actually involved in tumor genesis, it's the loss of a population. We don't know, right? We're looking at associations not causation. So you always have to keep that straight. But it does appear for the most part, these other six taxa, higher values of those taxa would indicate more likely to have a screen relevant neoplasia. Because we have a little bit of time, why don't I go through and see if we can't clean up a little bit of this appearance? I'm mainly interested in this plot to tell me, you know, it's kind of diagnostic information, right? Like what do the relative abundances look like? I'm not super anxious to kind of present this for a talk or for a paper, but at the same time, why not practice the skills that we've been developing to see if we can make this look a little bit better? One of the first things that you know are really important to me, probably too important to me are these blasted underscores in italization. So let's go ahead and take on that. To do that, we can again come back up here and I'm going to add a mutate line to mutate the taxonomy column. And so what I want to do is I want to put stars on either side of the name. And then I can use gg text with the element markdown theme argument or value to get those to be italicized. And I'll use str replace again. And I think what I'm going to first do is I'm going to put stars on either side of the names. And then I'll come back and clean up those unclassified names. So I'll take string of place on taxonomy. And then the pattern, I'm going to do parentheses dot star parentheses. So those parentheses will save whatever is being matched. And then we will then output it with a star back back one star. And so back back one is taking the value from within the parentheses and plopping it into the replacement pattern. Again, if we look at this, we now see that we've got stars on either sides of our names. I now want to clean up the case where we have those unclassified. And to do that, I'll come back up and I'll add another str replace. So we'll do taxonomy equals str replace on taxonomy. And what we'll do is we'll look for the case where we have star and then dot star underscore unclassified star, right? And so I want to save the part that's the dot star. And also because a star has a special value in a pattern, it means match the preceding character, zero or more times, but I actually want it to match the star, I then need to use those back slashes. So this will match that. And then I want to replace it with what I want to say unclassified. And then I'm going to put in a line break so I can put unclassified and then lacknosporaceae. And then I'll do star back back one star. Good. So all that's missing now is loading ggtext so that we can get that element markdown to work. So do library ggtext. And then I will add my theming and I'll do theme classic. I like that clean look without the background or the grid lines. But then we'll also use theme to add axis dot text dot y equals element markdown. Right. So now we've got our italicized labels. Pat is happy. Let's go ahead and clean up our x and y axis labels real quick. And so here what I could do would be a labs x equals relative abundance. And I'm going to put a percent sign and then why I'll say no. That's good. Let me also do scale color manual and we'll say no. And so we're going to change the coloring right. And so I want in the legend the true and then false or the false and then true. And so we'll do we'll do breaks equals C and then F comma T and then values equals C and then for false I'm going to put in gray and for true. I'll go ahead and put in Dodger blue. And then my labels I'll say healthy. And then otherwise SRN. I can again do scale color manual. I'm going to copy that down and replace the color with fill. And that way then no one will know that I have different colors for my fill and for my color and my legend there looks pretty good. So the last thing I'd like to do is flip the order that the clouds of points are being presented in so that I have healthy and then the SRN. So I can fix that by coming back up to my mutate statement here at the top of my code block where I'm generating the plot and I will make SRN a factor and give it in the order that I want those categories of SRN. So do SRN equals factor on SRN. And then I'm going to set the levels. And so the levels again are the order that I want them. So I'll say C and I'll do T comma F by default. It's alphabetical order. So it'd be false and then true. I want to flip that right. So to have true and then false. So this looks pretty good. Almost even like possibly even publication quality. Who knows. One other thing I might do would would have been to perhaps order the taxa by their average relative abundance. But again, I'm not so interested in this being for publication quality. I want it more for diagnostic or more for demonstration purposes. And the thing I want to demonstrate is again that, you know, except for unclassified lack of spray CA in people with SRNs, there is not a taxa here that is in every patient in the study. The human microbiome is tremendously patchy. And so if we were to say what population is associated with colon cancer, SRNs, well, there isn't one here, right? However, what the data suggests to us is that, you know, if you have a high level of porfumonus or a high level of Peptostracoccus or Peptinophilus or Parfumonus, right? Or Anarococcus or probably even this mogibacterium, then you're more likely to have colon cancer. And so if I were to say, you know, look for causative agents of colon cancer, I would start looking at these, right? And saying that perhaps it's not just one population that's associated with colon cancer, but, you know, either of these or any of these, and that we could mix and match them. And that's why, as we move forward through these episodes, we're going to start looking at methods that allow us to look at an ensemble or a composite set of taxa to make a diagnosis. Again, we are not looking at causation here, we're looking at association. Still, if I'm going to look for causative agents, I'm going to look for things that have associations first. I think the other thing that is pretty cool in this example at least with this unclassified lacknosporaceae, is that it appears to be the loss of lacknosporaceae that moves you from being healthy to not healthy. So it's not the presence of a bug, but it could also be the loss of an organism, right? And of course, the these statistical methods like the Wilcoxon test, they actually kind of struggle when you have a lot of zeros in your data set. And so that our median values are zero for all of these genera, indicates that more than half of the people in the study for either category do not have these bacteria, right? And so there could be populations that are even more poorly distributed that could be stronger biomarkers of an SRN, right? So there could be a population that's only in say 10% of people. But if those 10% of people have that marker, that is a sure sign that they have an SRN, right? But our statistical tests like Wilcoxon test or T test or any of these other traditional univariate statistical tools will not detect them. And so that's again, the motivation for why we need to keep going through these episodes to learn about micro open ML and how we can apply machine learning techniques to microbiome data. So if you want to jump on this train to learn all these great techniques, please be sure that you're subscribed down below and that you click that bell icon so you're notified whenever I post the next video, I'm going to be trying to post these on Monday, Wednesdays and Fridays around noon. But in case you forget, and I always forget myself when things are coming out, please be sure that you've subscribed and that you're getting notifications. And of course, please tell your friends about what's going on here. I've gotten emails from people telling me that they've learned so much because their friend or their PI told them to go check out a video. And then they send me the plot that they make. And that just makes me so happy to see. So be sure to pay it forward. And the best way that you can thank me for what we're doing here, you don't have to thank me. But the best way you can thank me if you want to is to send your send this to a friend, or to send me a plot that you've made. Anyway, keep practicing with this, please try to apply these approaches with your own data. And we'll see you next time for another episode of Codeclub.