 Just because a microbial population is differentially represented between two different groups of samples, does that mean that it's a good tool for classifying a new sample as being from one of those two different disease status groups? Well, that's exactly what we'll take on in today's episode of Code Club. Hey folks, I'm Pat Schloss and this is Code Club. In the last episode, I showed how we could use the Wilcoxon test, a non-parametric test, comparing the relative abundances of samples from two different groups to identify those populations of bacteria in the subjects microbiome that were significantly different in the relative abundance. The two groups of samples I'm looking at and most interested in are people who are healthy and people who in their colons have what's called a screen-relevant neoplasia. Those would be individuals that are at much greater risk or actually do have colon cancer and so we would want them for further evaluation and clinical care. I'm not a clinician, but this is what I'm told. Anyway, so that's an open question, right? We previously found seven different populations of bacteria that were significantly different in their relative abundance. Are those good instruments for classing individuals in either group? I think oftentimes we see that a population like in this case Porfomonas is higher in people who have an SRN or screen-relevant neoplasia and so we think, that is the bug that is causing the disease or that would be a bug that if you have it, then you have colon cancer. You're in trouble, right? What we saw, of course, was that most people in fact didn't have any of those populations and that's because the microbiota are so patchy across people that individually they might be good biomarkers, but we need to recognize that that biomarker is not universal across all people and that what's most likely is that the presence or absence of an SRN is likely driven by dozens of different populations of bacteria and so looking at one just doesn't do what we need it to do. Before we jump into those other methods that allow us to look at collections of bacteria, what we're going to do in today's episode is talk a little bit about how we can evaluate the ability of an individual population or other characteristic of a patient to classify them as whether or not they do have an SRN. We'll start by looking at a measure called FIT, which is a fecal immunochemical test at FIT, much easier to say, which is a measure of the amount of blood in a person's stool sample. You do not want blood in a stool sample. That is a indicator of having something wrong in your colon. So we'll look at that and then we'll see if we can't apply that information for a FIT result to those seven other populations. And that will really put us in a good position to move forward as we think about machine learning algorithms in our next episode. Don't worry, we're going to get there, but there's some there's some groundwork we need to work on first before we move on to machine learning. As always, if you want to get caught up with where I'm at today, down below in the description for this episode, there is a link to a blog post that will help you get set up. I am working on this project in our studio. I want to see how much I can do in our studio before I go to other tools. I'm using our studio with version control on Git and GitHub. And so in that link, there are instructions for how you can get what I have. Also, if you don't want to mess around with the version control stuff, there's the code that I'm going to finish with at the end of this episode. Alternatively, you can go to the blog post for the previous episode and get the code that I'm starting with today. Anyway, the information's there. And if you have any questions by all means, down below, leave a comment or question. All right, I am looking at my genus analysis.r script. I've gone ahead and run it already. Everything works to kind of give you a lay of the land. I load three packages. So the tidyverse, which is actually a collection of packages like dplyr and ggplot and read are those types of things we've already used all quite a bit. The broom package, which actually is installed with the tidyverse, this allows us to make our data tidy. We use this more for the Wilcoxon test and ggtext is another package that we use that allows us to stylize our text using markdown or html. And we use that in the previous episode to get our taxenames to be italicized. Anyway, we then set the seed. So we load the shared file, which again is the count information for each taxon in each of the 490 samples. We have the taxonomy data then which tells us the taxonomic name for all of the genera in our analysis. Again, we could be doing this at any taxonomic level we want. I'm going to do it at the genus level. And then we have metadata that we're loading in, which gives us information about the disease status of the different subjects in the study. We define two columns, one SRN. So these are people with advanced datanomas or carcinomas in their colon, as well as a lesion column, which is both of these are logical. So lesion is anybody that's got an advanced or regular adenoma or a cancer. We then went ahead and joined that all together. And then we looked at the significant genera. Finally, we then built a strip chart to look at the differences in relative abundances for those seven taxa that were differentially represented. Now, one of those seven, the unclassified lacknosporaceae was actually higher in healthy people and lower in people with SRNs, whereas the other six were higher in people with SRNs and lower in healthy people. So we'll probably want to keep that in mind as we go forward. I've gone ahead and run this already. And what we're going to look at, again, is the composite data frame, which has our subject ID, the taxonomy, the relative abundance, fit result, and so forth. And again, the composite has all of the taxa. So what I'm going to take are these first two lines from where I generated the plot, because this inner join effectively filters out all the rows from taxa that are significantly different, significantly differentially represented. So I'll go ahead and copy that and paste it down here and remove that final pipe for now. And we see that we get our sample ID, the taxonomy, the relative abundance, fit result, and all this other information. I'm not so interested in all that other information. And so what I want is I want to have, I want the SRN column, I want the taxonomy information, the relative abundance and the fit result and the group ID. So I'm going to go ahead and tidy this up a little bit. And I shouldn't use tidy so loosely. Clean it up is perhaps a better way of saying it. So I'll do a select, and I'll do group, and then I'll do taxonomy, and rel, a bond, fit result, and then SRN. So this data frame now has my group taxonomy, relative abundance, fit result and SRN information. What I'd like to do is ultimately have a column for the metric, the metric being fit results or the taxonomy, and then a column for the score, which would either be the relative abundance or the fit result value. To get there, what I need to do is I need to pivot wider to get each taxa to be its own column. And then I need to pivot longer again to get relative abundance to be a eighth biomarker in this analysis. So we'll do pivot wider. And we'll do names underscore from equals taxonomy. And then values from equals rel, a bond. And so now what we see is that we've got group, fit result, SRN, and then our seven different taxa. So that's in good shape. The next thing I want to do is again, bring it to be longer. Now I'll do a pivot longer. And I'll do calls equals, and I'll do minus C because I want all the columns except for the group. And the SRN names to equals metric in quotes, and then values to equals score. And so now I get four columns where I have the group that which is the subject ID, whether or not they have an SRN, the metric and then the score. So I now have things tidy, so that I could then again use my map function to go ahead and create a different set of data for what's called a rock curve. So the receiver operator characteristic curve, where we put one minus the specificity on the x axis, and then the sensitivity on the y axis, that's where we're going. So to kind of help get there, for now, I'm going to go ahead and do a filter on metric equals equals fit result. And so now I have all of the data, the 490 subjects with their fit result and their score. And what I want to know is if I increase the score, do I do a better job of classifying people as having an SRN or not, I will tell you that fit result is a great tool actually, and it's one of the most widely used tools for doing screening of people to recommend them to go ahead and get a colonoscopy. And so the recommendation in the US at least, is that you get a colonoscopy every 10 years, or that you get something like a fit test every three years or every year. And that if the score is over 100, then that means that you then need to go ahead and get a colonoscopy. So anyway, that is the current most widely used noninvasive diagnostic in the United States, at least there's other noninvasive diagnostics that have been made. But the fit is a big part of all those other diagnostics. Anyway, let's go ahead now and think about how can we make a rock curve based on the fit result. So we've actually done this before in previous episodes, where we've made rock curves based on the diversity indices to predict whether or not somebody has diarrhea or whether not somebody has C. difficile. And we also used it way back when when I was writing that paper on Amplicon sequence variants to generate those rock curves. And so we have some of this code before I'm going to try to create it from scratch for you. But we might go a little bit quick. I'll be sure to put a link to one of those recent episodes across the top here. If you want me if you want to see me kind of go into it a little bit deeper. So I'm going to create a function that I'll call get sends spec. And we'll assign that to the function keyword. And then we need to give it our arguments. So I'll say threshold. So this would be like 100 for, you know, fit result, or any other value that we wanted to get the sensitivity specificity from, we'll also then put in our score. So this would be the actual fit result. And then we'll put in the actual diagnosis. So this would be like the SRN column. And then we need the direction to make that test, right? So if higher values are associated with positive, then we would want greater than if lower values were associated, then we'd want less than. Okay. So we'll go ahead and create the body for this. Now, as I'm creating this and testing this out. Again, I'm looking at data from the fit result. So I'll go ahead and call this output test. And what that allows me to do then is I can kind of create some testing variables, where I do threshold equals 100 my score. I'll do test dollar sign score. And then actual is test dollar sign, SRN, and then direction, I'll do greater than. All right. And so let me load these four. And so they're good to go for testing. And so now what I want to do is I need to create a predicted column, a predicted vector. And so the predicted vector, that would be score greater than the threshold, right? And so if my score is greater than the threshold, then it's going to be predicted as true. And so then we can compare the predicted with the actual. So because with that lack no spray C on classified actually want less than, I'm going to modify my code here just a little bit to do if direction equals equals greater than, then I'm going to do score greater than threshold. Otherwise, I'll do else. I'll do score less than threshold. And so now if I that comes back with predicted, so let me store that to predicted. And so now we have predicted. And we also have actual as a series of truths and falses. And this now allows us to look at things like true positives, true negatives, false positives, false negatives, right? So for true positives, I'll say TP, the number of true positives will be the sum predicted and actual. So where predicted and actual is true, we're going to add those all up. So a true value has a value of one. So we sum that up, we get the true positives, we can then do two negative will be the sum of not predicted and not actual. So where they're both false, I'm going to spell this right. And then we need false positives, which will be the sum of predicted. So predicted true. And where actual is actually false, so we'll do not actual false negatives, which will be the sum of not predicted. So where predicted is false, and where actual is true, right? So now we have our four values of the confusion matrix, we can do TP is 111 to negative, false positive, and false negative. The first parameter we want to calculate is the specificity, which is the fraction of true negatives over total negatives. So we'll do Tn divided by Tn plus Fp, the false positives. And we also want to get the sensitivity, which is the fraction of true positives over total positive. So true positives divided by TP plus Fn, the false negatives. And so then we see that our specificity is 0.94. And our sensitivity is 0.48. So at the threshold of 100, we have 50% more or less specificity and 94% specificity. Now we want to return this from this function as a table. So we'll do table specificity equals specificity. And sensitivity equals sensitivity. And I'm going to go ahead and wrap these names in quotes, so it's clear that those are the column names. And so this then outputs this data frame. Very good. So I'm going to go ahead then and comment this out, my test code. And let's go ahead and run it and make sure that we get the same thing. So to test this, I'm going to go ahead and do get sends spec. And then the threshold I want to use is 100. The data that I'm going to use is test dollar sign score, which is my fit result again from this test data set, and then test dollar sign SRN. And then I'm going to say greater than and I get back the output that I had previously. So we're in great shape. Now this works for one threshold, but I'd like to do it over many thresholds, right? So to do that, I'm going to create another function that I will call get rock data function for arguments for this function, I'm going to give it my data, which I'll call x as a generic data. And then I'll also need my direction. So again, to test this out, I'm going to do x is test direction is greater than. So I'd like to do is now use that score column from x to make a set of thresholds. So I'll say thresholds equals unique x dollar sign score. And so if I look at thresholds, thresholds, I get all those unique threshold scores. And there's 167 unique fit values. And again, there were 490 total subjects in the study thresholds. So we're going to take thresholds. And for each value of thresholds, we're going to send that to get sends spec. And again, this argument is dot x, right? So I'm going to send it to get sends back. And like we saw in the last episode, we're going to do dot x as that threshold. And so then the other arguments as we see here already are the x dollar sign score, x dollar sign SRN, and then direction, right? So if I run that, we get a bunch of funky output, right? So we get all these sensitivity specificity values, but it's in list format. And so the default output from map is a list. If instead I do map DFR, I get outputted a table, right? Where I then have my specificity and sensitivity, and we're in pretty good shape, we could sort our thresholds, so it is sort on that. And so that way, then our thresholds are sorted. And then if we run the next line, this map DFR, we then get our sensitivities and specificities sorted as well. This then gets outputted from get rock data. And we should then be ready to fold this into our main pipeline. So we don't have to just look at the test results. So I think we have everything that we want. Let me go ahead and run get rock data on our test data frame. And then this is going to be greater than. So before I run this, I need to go ahead and comment out my test code up here and get rock data. So we'll go ahead and run that. And now if we get run get rock data, everything looks good. And so now we can come back. Let's comment out these test function calls. And I'm going to go ahead and remove the test assignment. And we need to get this ready now to look at not just the fit result, but also all the other parameters that the other tax that we want to include in our analysis as well. Alright, so to do that, I'm still going to leave in this filter. But I'm going to do a nest on, I'll say data equals minus metric. And so now I have fit result and data. Again, if I hadn't done the filter, I'd have multiple metric rows, we'll get there eventually, right? And so now what I can do is a mutate. And I can then say rock data equals map. So we'll map that over each row of data. So we'll do dot x equals data. And so again, each row of data is a different metric that we're going to be looking at. And then we can pipe that or send that to the function rather, which will be get rock data. And then the input to that will be dot x. But we also need the direction, right? So I need to go ahead and add a column to indicate the direction for each value of for each metric I have, right? And so after my nest, I'll go ahead and add a mutate statement to make a direction column. And for now, I'll go ahead and say greater than. And so let's look and make sure yep, so now we have direction greater than, I can then pipe this into my mutate. Instead of map, I'm going to use map two, because map two allows us to give two input columns, right? And so I can do dot y equals direction. And so then this instead of direction will be dot y. And I misspelled get rock data. And so now we've got our metric or data or direction and our rock data. Let me go ahead then and unnest our rock data. And so then we get our sensitivity and specificities for our fit result. Awesome. Now we want to go ahead and remove this filter statement so that we can then look at all the fit result as well as all seven of our biomarkers. So now if I run through nest, I see again that I've got eight metrics fit results and all these others. My direction, I have set to greater than. And so what I'm going to do instead of greater than is for everything, I'll say if else. And so if else metric equals equals this lack no spray CA unclassified. So if it's lack no spray CA unclassified, then why'd you go way down there? I'm going to then say less than. Otherwise, I'm going to say greater than right. And so now let's I'm missing a missing a parentheses. I could have made that one composite mutate, but whatever. And so now I've got my greater thans and less thans set up right. So now we should be able to run the whole shebang and get yeah. So now we've got our metric and our sensitivity and specificities. I'm going to go ahead and do a select to kind of make things a little bit simpler. And I'll do metric specificity and sensitivity. And so now again, I've got all that. And so I'm going to now call this rock data. So I'm going to take rock data and use that as input to ggplot. So I'll do AES x, I'm going to make to be one minus the specificity and y to be the sensitivity. And I'll add geom line. I'll then save this with gg save to be figures forward slash rock figure dot tiff width equals six height equals four. Do you know what I did wrong? Yeah, I forgot to include an aesthetic for my different metrics. So I can then do color equals metric. Good. So we now have separate lines for each of our biomarkers for the seven genera, as well as for that fit result. A couple things I'd like to do is I'd like to go ahead and put in a 45 degree line to show what is effectively random classification. I'd also like to add a dummy row to each of the data frames, so that they all go out to a specificity of zero and a sensitivity of one. I'll go back up to my get rock data. So I'll go ahead and add a row here with our bind so bind by rows. And I'll give it a vector specificity equals zero and sensitivity equals one. I'll go ahead and reload get rock data. And so now all my curves go out to this upper right corner. Let's go ahead and throw on a diagonal line. So we'll GM AB line where our then slope will be one and our intercept will be zero. And so now we get that 45 degree line. And we can perhaps make it not so pronounced by doing color equals gray. And we can add theme classic. Good. So now we have that gray line for the 45 degree angle, indicating where random classification would fall. We see that fit result actually does quite well. We again, we saw that like at about 95% specificity, we'd get about 60% sensitivity, which is pretty good, which is far better than any individual taxa on its own. Something we could do is calculate areas under the curve. I'm not going to do that in this episode. I feel a little bit weird about extending this line, all these lines out to one zero. So specificity of one and sense specificity of zero, because we don't really have data out that far. The reason it's going out there is again, because for many of these taxa, both people with and without SRNs had zero relative abundance of those populations. And so we don't truly have a specificity of zero sensitivity of one for any of these biomarkers, except for perhaps the lacknose bracelet. This is a good foundation for helping us to think about the performance of our ensemble techniques where we look at collections of biomarkers, collections of genera, collections of genera plus fit to see how the models perform relative to any one biomarker on their own. And so as we can see, fit actually does pretty well. It's, you know, high specificity test, perhaps low sensitivity test. But these biomarkers on their own, even though they are significantly associated with SRNs, they're not great for classifying people, right? And so that's something that we really want to work to improve and see if by combining multiple biomarkers, we can improve our classification. So we are going to start working on those methods in the next episode. And so if you want to be sure that you see that next episode, be sure that you hit the thumbs up so that the algorithm knows that you love me in my videos. And even better would be to go ahead and hit that subscribe button as well as the bell icon, so that you are notified when that next video is released. Anyway, keep working with this material with the data I've given you as well as your own data. That is how you will learn this material the best. And we'll see you next time for another episode of Code Club.