 In recent episodes of Code Club, I've been showing how we can use Mikrope ML, which is an R package for implementing machine learning algorithms, to figure out whether or not we can predict whether or not somebody has a screen-relevant Neoplasa, or SRN, I think colon cancer, in their colon, based on the relative abundance of the different bacterial genera that we found in a stool sample from those people. What we found was that using L2 regularized logistic regression, we could find an area under the receiver operator characteristic curve of about 0.64. If we had used random forest models, we would get an area under the curve of about 0.71. If we added the fit result, which is a measure of the amount of blood in a person's stool sample, we found that the AUC values would go up by about 0.1. We might be tempted to run with the random forest models because that gave us the best area under the curve, but why might that be a bad idea? Well, stick around for this episode and I'll tell you why. While the ability to make a prediction is perhaps measured by something like the area under the curve is certainly important, another really important consideration is the interpretability of the model. Interpretability can be defined as the degree to which a human could understand the cause of the decision if they looked at the model. Generally speaking, logistic regression models are highly interpretable because it's easy to directly extract the coefficients that are the weights for each of the parameters that go into the model. The random forest models aren't super interpretable because the forest is literally a forest of trees. You can think of the model as being a collection of say like 100 or 1000 trees that are used to make classifications and then are aggregated together to make that final classification. It's impossible to extract and synthesize all that information directly. On the far end of the continuum, things like neural networks are highly uninterpretable. Not interpretable, you might think of them as like a black box. There's a general trade-off between performance as measured by something say like the area under the curve and the interpretability of the model. Therefore, we need to spend a little bit of time talking about why we want a model in the first place. As with so many things in science, it depends on the question, right? My favorite dodge whenever anybody asks me a hard question about how to interpret the data is, it depends on your question, right? Well, if your interest is more in the biological realm, trying to understand why a certain prediction was made, then interpretability is certainly far more important, right? So if I want to know what taxa, what genera, what features go into predicting that somebody has colon cancer, then I want the model to be highly interpretable. If on the other hand, I'm more concerned with diagnosing people as having colon cancer, then I don't care why the model made that prediction. I want a high performance model. So again, something like your feed on Facebook or the YouTube algorithm kind of giving you this video. Gee, why would they show you this video? I don't know. Then you're more concerned with performance and less concerned with interpretability. But don't worry that because a model is not interpretable that you can't get information out of it about how it's performing. There are a number of ways that you can go in and try to get some interpretability out of a model, say, like random forest. The approach that I'll be talking about today is something that is called permutation importance test, which effectively builds, rebuilds models, and each time it rebuilds the model, it removes a feature before assessing the model. And so then you can get a measurement of the effect of that individual feature, that individual, say, genus on the performance of the model. Those genus genera that when we remove it are super detrimental to the performance of the model, we'd say are important, right? So it's not as interpretable as, say, the weights from a logistic regression model because, you know, you see the importance, but you don't get things like, say, the sign, right? So with the weights, I can see if it's a positive or negative sign. If it's a positive sign, I could say it's predicting the presence of lesion. If it's negative, then it's predicting, say, health, right? So again, looking at these different approaches of interpreting model is the focus of today's episode. I'm going to stick with the L2 regularizer logistic regression model that we made previously. And so we'll be able to compare the output of these two approaches for the same modeling approach. As always, if you want to get the code for this project, check out the link down below in the description. I am going to be using some output files that we ran and generated in previous episodes. Those do take a bit of time to generate, but you can certainly do it. And I would encourage you to check out those previous videos if you want to learn more about how to do that. So again, we're here in a new R script that I'm going to call featureimportance.r. As always, I'll start with library tidyverse. Get that all loaded in. And we are interested in the files in my process data directory. And so as I showed you last time, we can do list.files. And I can say path equals processed data. And then pattern equals let's do star.rds. And so what we see then is about 400 files. I have in here my random forest models with and without using fit and also my L2 genus with and without fit. I'm again in this episode going to focus on this L2 genus rds without the fit. And so I have 100 of these files because we did 180, 20 splits where we trained on 80% of the data and then tested it on that 20% held out. So again, if I look back at this, I can do L2 genus underscore back back D star. And so that back back D will match a number and the star will match any number of numbers, right? So now I look at this, I have the 100 RDS files and each of these RDS files contains the output of running my L2 regularized regression regression model, right? And so here I will call these my L2 files and I want to read those in. And as I showed you in the last episode, I like to start with an example file and then scale that up to all 100 files. So for my test, I will use L2 files one. So again, test is one of these files. And I'm noticing I don't have the path here. And so I need to add an argument to list files, which will be full dot names equals true. And so now I've got that full path, right? And to read in test, I can do read RDS. And these RDS is the special binary format that is made by our and so I can say read RDS. And again, we see all this wonderful output. So to get out this trained model object from this overall RDS file, I'm going to go ahead and use the pluck function. Again, pluck allows you to pull a entity out of a list. And so I'll do trained model. And so again, this gives me that whole trained model. And I'll then call this model to get the coefficients from this model. What we'll then use is the cof function, which takes values out of the model based on some other parameter. I'm going to give it. So the parameter I'm going to give it is my best Lambda value. So the Lambda value is the hyperparameter that affects the regularization procedure. Basically, penalizing the number of features that go into the final model. So I can do model, dollar sign, final model. That is the model. So again, if I look at that down here in R, I see a whole bunch of a whole bunch of stuff, right? And so thankfully we have this cof function to extract the model for the hyperparameter that we desire. Model, dollar sign, best tune, dollar sign, Lambda. And so the model best tuned Lambda. Again, we gave the model a bunch of regularization parameters to choose from, and it chose to as the best. So now if we run cof on the final model and the best Lambda, what this will output for us, then are those best parameters for each of the 100 genera that went into this model? We see that is 101 by 1 sparse matrix of type DGC matrix. I'm not sure what that is. But what we see in the first column R is an intercept value as well as all 100 genera along with their different weights. So what we find, for example, in this first case is that we have methylobacterium negative 8.8 times 10 to the minus 3. So 0.0089 seems to have a small negative impact. So, you know, more methylobacterium would predict that you'd be less likely to have a lesion, but it's not like a really strong signal, whereas something say like porphomonas is positively associated with a tumor, and it's at 0.027, right? So if you have more porphomonas, then you're more likely to have a screen relevant neoplasia in your colon. Keep in mind that this is for one of our 100 splits. And so, you know, over those 100 splits, there might be some variation in the weighting of these parameters. So we need to clean this data frame up so that we can work with it inside the tidyverse and so that we can then also read in the other 99 splits. And so we can then synthesize it together and then make a plot. So the first thing that we need to do is to figure out how to convert this matrix into a tidy data frame into a tibble. I would be tempted to convert this to a tibble by say piping the output of cof to az underscore tibble. Unfortunately, this gives an error message. What we have to first do is convert it to a matrix. So we can say az dot matrix and then pipe it to az tibble. And then what we get out are the weights for that first column. Unfortunately, we lose the names of those columns. And so what you can maybe notice in the output that we saw from running the cof function is that this first column doesn't actually have a name, right? And that's because these, that first column isn't a column at all, but they are row names. And so row names are something that are just universally aboard within the tidyverse. So what we need to do is we need to tell az tibble to take the row names and make those a column. And to do that, we can say row names equals, and I'll say feature, and that will take the row names and put those into its own column, which is called feature. And so again, we see that we now have a tibble, 101 rows, two columns, and that we then get feature as a column along with those weights. I'm going to go ahead and rename that first column because I don't want it to be a number. And so what I'll do is rename and I'll do weight equals back one back numbers as column names are just a pain. And again, I want to be explicit about what is in that column. All right, so now we have this tibble with 101 rows and the two columns for the feature and the weight. I'm pretty happy with the way that looks something that I think I'll do is go ahead and add in a mutate to get the seed number. And so what we can then do is mutate. And again, what we had was test. And so test again is the name of the variable that holds the file name. And so what I'll say is seed equals str replace on test. And what we'll do is we'll try to extract that number. And so I had something like that up here, right. And so this pattern is basically what I want. I should put that in quotes. And we'll again, we'll do process data forward slash L to genus that. And what I want to capture is what's this backpack D star. So I'll put that in parentheses. And then we will then replace that with backpack one to get that seed out. What we'll get is a column that includes the seed for that file. And so yeah, sure enough, we see that we now have a seed column. So I'll go ahead and put these on separate lines to make things look a little bit more clean. Now we want to be able to run this over all of the values of L2 files to do that. As we saw in the last episode, I will convert this chunk of code now into a function that I can use with map DFR to run over all 100 of my L2 files values. All right, so we will do get weights. I will use that as my function name function file name. Right. And so that's the name of the function, the file that I am getting out of that L2 files vector. And so we can then tab this over, just a little bit more white space. And instead of test, I want to use file name, right? And so up and I've got functions instead of function. And then to test this, we can do get weights on test and we get the output that we expected. Wonderful. We're in good shape now to iterate this over all of our L2 files value map DFR and then the vector name of L2 files, the function get weights. And I will call this L2 weights. So again, that ran through and now we have L2 weights. So again, we have 10,100 features as I mentioned for the 100 seeds and the 101 different features for the 100 genera plus the intercept. We also have the feature of the weight in the seed. I don't know that we really need the seed but it's there just in case. I don't know that I really need the intercept. So I'll probably go ahead and end up removing that intercept. And so the type of visualization that I'd like to come up with for this data would be something like my point range plots, right? Where I can draw a point for the median weight for each of the 100 features and then a range bar to indicate perhaps the interquartile range, right? And I don't know that I want all 100 features. So maybe the features that have the largest absolute value as being the features that have the most importance in the classification of the model. And so those types of questions are kind of the bread and butter of working in the tidy verse with things like dplyr and ggplot2. So we'll start with L2 weights and pipe that to a filter to remove that intercept. And so we'll do feature not equal to parentheses intercept. And so we see that we now have 10,000 features, right? So that's good. And so what we could go ahead and then do is pipe this into ggplot. So we could do ggplot, AES. And I'm going to put my features on the y-axis and my AECs or the weights rather on the x-axis. And so we can then do x equals weight and then y equals feature. And one place we'll start maybe is with stats.stat underscore summary. And we'll do fund.data equals median high low and fund.args equals list conf.int equals 0.5 for that 50% confidence interval. And then we will do geome equals point range. We can then save this with gg save as L2 weights.tiff. And then we'll do width equals 5, height equals 6. So what we get is a very busy plot. That's really difficult to decipher. We've got all 100 features. And we've got the weights again on the x-axis with the median as a circle in the range for the intro quartile range. Instead of using stat underscore summary, I'm going to go back and use the group by and summarize functions to kind of build out my own median and quartile values. So let's go ahead and insert that here where we can then do group by feature and then summarize. We'll do median equals median weight. And then I'll do L quartile equals quantile weight and then prob equals 0.25. Again, the quartile, the quantile function takes a vector and then tells you the value at the probability, right? So at prob equals 0.25, that's the 25th percentile. And so this will then return the value of weight at the 25th percentile, such that 25% of the values and weight are smaller than the outputted value, right? And so again, for the upper quartile, we are going to use 7.5, right? And so this will give us our intro quartile range. We then get our features, our median, our lower quartile, and our upper quartile. Those are all the values that were being represented in that plot. So let's go ahead and recreate this plot before we start doing some filtering to clean it up a bit. So instead of x equals weight, we want x equals median, y equals feature, and we will then do x min equals l quartile. So this is a aesthetic that we can give to the range, geome range function, and then x max equals u quartile. And again, instead of stat summary, what we will do is we'll do geome point, and we will do geome range, or line range, right? And I forgot to pass this output of summarize into ggplot. So we get the same plot that we had before using stat underscore summary. But I feel like doing my own group by summarize, I now have a little bit more control over the overall data. Because what I'd like to do, again, is sort my features by their median abundance. And I would also perhaps like to remove some of the things that are closer to zero. Really, I want to focus on these things that are more to the extremes on the median range. Let's start by sorting all of our features by their median weight. And again, we can do that up here, where after the summarize, we can do mutate on the feature. So that'll be fct reorder. And we will then make the feature that we're reordering a factor we're reordering is feature. And we'll do it by median. And again, don't forget that pipe. And so now we get the order with gemella and porfumonus up at the top and lack no spray CA unclassified at the bottom. So let's think about the range of points that we're actually interested in. Again, I'm not interested in all of these things in part because I can't see them all. So what I might like to do is kind of focus in. And so if I pick like, you know, absolute value of 0.02, then I'm only going to have like four features returned. So maybe I could start with 0.01 and we can go from there. So we'll go back up here and after the mutate, I'll go ahead and do a filter on the absolute value of my median. So if it's negative 0.02, it'll become 0.02. And I'll say 0.01. Right. So if it's if the absolute value is greater than 0.01, then it will be in the output. And so this gives us a much more limited scope of features and their weights. And we can clearly see which features are positively associated with the present of SRN and which are negatively associated with the presence of an SRN. Right. And so we see things like dialyster, Fakila bacteria and lack no spray CA are associated with health. Whereas things like gemella, porfumonus, parfumonus and we see fuzobacterium up here also, which is, you know, more associated with the presence of screen-relevant neoplasias, colon cancer and the like. So this is all good. And again, what we're seeing here are more features than what we saw using the Wilcoxon test. Again, all of these are important. All 100 are important in the model. But as we see for varying degrees, what I'd like to do is now turn our attention to how we can look at the importance of different features if we don't have the weights, right? So say we're doing random forest and we don't have the weights because it's not interpretable through the use of weights, but using that permutation importance test. Now, something that I did that I didn't tell you about was that if you look at my run-split.r file, the script that runs each of those 180-20 splits, you'll notice that in my model variable here, that I set find feature importance to true. So find feature importance to true does this permutation test. It makes the model take a lot longer to run. If you want to do this next step of looking at the permutation importance test, you need to run this with find feature importance true. If we do read RDS, and I think I still have test in here as the variable, then what you'll find is that this last data frame that's outputted are the feature importance values, right? And so the third, I think it's the third value in this list is this feature importance data frame where we have the perf metric, perf metric diff, the names, and then at the bottom was the actual metric that was used, the name, as well as the method and the seed. And so what this is telling us the perf metric is what is the area under the curve if we then remove methylobacterium. And so what you see is that, you know, the difference between 0.0619 and the actual value, so maybe it was actually like 0.625. Forgive me for being off by 100. The difference was about 0.005, right? So this is the difference what I've got highlighted here between the AUC using the full data set and what we found removing methylobacterium. Other things, oddly enough, like colonicella that if you remove it, the model actually gets better, right? And so that's weird. But that happens, as you can see, we have some values in here that are negative, but in general, their exponents are pretty small. And so we're not really concerned with that. So I'm primarily going to be interested in those features that have a positive importance and those that have a larger value, right? So what I'd like to do is, again, step through how we could go about making a similar plot to what we just did with the weights, but using the perf metric and perf metric diff columns from this feature importance section of the list that we get from each of those RDS files. So to get that feature importance element out of the list from each of those RDS files, I'm going to start by just grabbing this get weights function. And this will be a starting place, right? So I'll do get feature importance. We're going to read in the file name. We'll read in the file. I don't want to pluck trained method. I want to pluck feature importance. And you know what, to kind of test things out, I will go ahead and say file name equals test. And that way, then I can look at what I have here for model as being, yeah, these this data frame of the feature importance, right? So that all looks good. I'm actually not going to call this model. I think I'll call this feature importance. And that way, I keep straight what the data actually represent. Alright, and so we're not worried about all this stuff with the coefficient. But what I sure would like to do is to convert this to a table, because one of the downsides of a data frame, not being in a table is that you get 100 lines of output, you get all the columns and it's just impossible to look through, right? So maybe what we could do instead is to then pipe this to as underscore table. So now we get the table. It's much easier to read. I'm not interested in the method the perf metric name or the seed. So I will go ahead and grab these three columns. So I'll do select on those. And I think I want the names first, the perf metric, and then the perf metric diff. And so I'm not so interested in all this stuff that we previously did for the coefficient. And so this now is my get feature importance function, which I'll go ahead and load. And as we saw up above, you know, if you ever forget how this map DFR function works, these are great examples, you can paste down here and then again, L to feature importance. And we're going to take the get feature importance and put that in place for get weights. And if we do L to feature importance, we then again get 10,000 row data frame with three columns. So again, 100 features, 100 seeds, that's the 10,000 and then our three columns. And so again, to show you how you might riff off of your own code, I'll go ahead and grab what we used before with the weights, and we'll place that with feature importance. We didn't have that feature intercept so I'm not so worried about that. We also have this called names. So, you know what, if you wanted to, we could rename names to be feature. So again, just to kind of practice, we could do rename feature equals names, and then we can then pipe that into group by in our summarize. We don't have weight, we have perf metric or perf metric diff. Maybe what I'll put in here is perf metric diff. We're then mutating our feature to reorder it again by that median value. Let's go ahead and remove the filter for right now by commenting it out and I will save this as L to feature importance. We have our 100 features. And we see that there are a handful that are negative, right? So actually, if you remove them, they would perform the model would perform better. And so you see Preva tell a lacknospiracy up at the top, let's focus on those that have a median value greater than, oh, what, let's do 0.025. And so we'll save filter. And I don't want the absolute value. I want the median to be greater than 0.025. And I forgot a zero. So this plot appearance is very similar to what we had before with the weights, where we have the median metric indicated with a point and the line showing the interquartile range. And we have our taxa on the y axis. I want to briefly come back to this topic of interpretability. And so one of the nice things again about the weights is that we can easily see the positive versus the negative weights. You can't quite see that in the permutation test, right? So lacknospiracy unclassified was strongly associated with health, as was fecalobacterium and dialyster, right? And so we see dialyster and fecalobacterium here in the top five features by the permutation importance test. And again, gemelol, lacknospiracy, and so forth. We also see different taxa showing up in the permutation test, then what we see in the weights, the way I interpret the different features that we see in the variable importance plot versus the weights, again, comes down to how the test is done. The weights are the actual weights that went into those models, right? It's very clear cut. There is a formula that is being fit for each of these splits, whereas again, the permutation importance test, you know, it's it's a bit more of a heuristic. And it's not so clear cut what's going on in the model. If we're working with something like random forest, then you do have to use this permutation importance test and that's kind of the way it works. The other thing that I will point out is that this also sometimes causes to fall back into this kind of reductionist mindset that there's one organism associated with disease, right? Ooh, gemelol looks really bad, right? Or porphymonus looks really bad. When in reality, all of these are bad, right? And that, you know, you might not have gemelol, but you might have fusobacterium, right? And you might have fusobacterium and not gemelol, right? And those two together might be as bad if not worse than gemelol on its own. And to me, that's again, one of the cool things about these types of approaches is that instead of kind of distilling down a community to one bug that's causing the disease or one feature, you know, going back to that idea of a spam or one feature that indicates spam, we can actually look at a whole collection of organisms associated with disease or associated with health. And that's health is good, too, right? I would certainly encourage you, and we mentioned this in Big Uume's paper, that if you have several models that perform equally well, that pick the model that's the most interpretable, right? You don't gain anything by using random forest, if random forest does just as well as logistic regression, because with logistic regression, it's just so much easier to interpret the data. Practice with this, again, try to apply this to the random forest data or the genus plus fit data or heck, apply it to your own data. Wouldn't that be amazing? That'd be wonderful. If you do that, please let me know what you find. And if you run in any problems, I'd love to hear how you're doing with this. Again, all this output is coming from the great micro ML package, which really is the foundation for running these 8020 splits and making it easy to interface with the carrot package and getting other things. So if you have other questions about micro ML, by all means, holler, and we'll see if we can't work those into a future episode. Keep practicing and we'll see you next time for another episode of Code Club. Thank you.