 When it comes to assessing the performance of a machine learning algorithm, there's a lot of different approaches that you can use. The one that we've been using in previous episodes is the area under the receiver operator characteristic curve, or just simply the AUC, the area under the curve. Unfortunately, the AUC is a bit of a blunt tool. Say we had an AUC of 0.7. Well, there's a lot of different shapes of curves that would give you an AUC of 0.7 to suggest that those models all perform the same. However, you might expect that at a specific specificity, one of those models might perform considerably better than the others. So my goal in this episode is to explore an alternative approach that will allow us to look at specific specificities to compare across our models. If we were to use a machine learning algorithm to predict who does and doesn't have colon cancer, as we've been doing in these recent episodes, well, we would get an output from our model that is a probability that the person has cancer, so by default, all of the reports and metrics that are given to us by the software assume a threshold of 0.5. So if above 0.5, we assume the person has cancer. Below that, we assume they're healthy. Well, 0.5 isn't necessarily the best value. It seems to make sense, but again, it's not necessarily appropriate. So what we need to do is figure out what is the appropriate value. So as an example, in the United States, for the fit result, which is a metric of the amount of blood in a person's stool sample, if you have more than 100 units of blood in a stool sample, then you're referred for colonoscopy because you're at higher risk. If you're below 100, then you're considered healthy. So there was a balance between specificity and sensitivity, again, there's a trade-off there, and that based on that balance, the smart people figured out that 100 was the right threshold. My understanding is that if you go to Europe, they use a different threshold, maybe like 50 units. And that's, again, a different weighting of what's important in terms of sensitivity and specificity. So it becomes very difficult to compare different models and different approaches when they all have different specificities or different sensitivities. What is necessary is to pick a specific specificity or sensitivity, and then let that other parameter vary so that you can more directly compare the models. We've been talking about the area under the receiver operator characteristic curve, which again, this big composite value. But what we should do is go back and actually look at the curve. Because if we looked at the curve, we could pick a specificity, and then by tracing that line up, we could look at the different sensitivities of our various models to compare our models performance directly. That's exactly what we're going to do in today's episode of Code Club. We've got a lot of ground to cover, so let's get to it. You'll recall that we've created two different types of models. One was L2 regularized logistic regression, and the other was a random forest. For each of those two modeling approaches, we use different sets of features. In one set of features, we used only the microbiome data. In another set of features, we use microbiome data plus the fit. So between the two different types of features and the two different types of models, we have four combinations. So the output that I want is a figure with four receiver operator characteristics. Now the hitch here is that we have 100 seeds for each of those four modeling combinations, which means we have 400 total receiver operator characteristic curves that we somehow need to synthesize down into four curves. So there's a lot to cover, and as I mentioned, we need to get moving here. The first step that I want to take on is building a single receiver operator characteristic curve from one of those 80-20 splits. Again, we took 80% of the data, trained the model on that, and then evaluated the best fit on the held out 20%. So let's look at the sensitivity and specificity curve, the rock curve, for one of those example files. I will do read RDS process data forward slash L2 genus 1.RDS. Again, we're going to come back and read in all 400 files eventually. And so again, we see all this output here. I'm going to call the output of read RDS model. And then with model, we can then look at trained data, trained model rather, and we can use that as input to the predict function. And so we'll take that model and we'll apply it to the test data. So we'll do model, dollar sign, test data, and we'll then do type equals probes. And what type equals probes we'll do is we'll output the probability that each sample in the test set. So again, we had about 500 total samples. So it should be about 90 samples or 100 samples in our test data. Because again, we did an 80-20 split that it will output the probability that any sample was a healthy or had a screen relevant neoplasia, again, think cancer. And so again, if we look at that, and I should be prob, not probes, we then get this data frame that has the 92 or so different samples, as well as whether it was a healthy or an SRN. I'll go ahead and call this prob for the probabilities. And then the other bit of information that I want to get as well is the observed data. What was actually observed? What was the classification clinically, not by the model? And to get that, we can then do observed as model, dollar sign, test data, dollar sign, SRN. We then get all those classifications of the people. And so we want to bring these together because we want to look at the probability of an SRN as well as the observed. So we can then look at the sensitivity and specificity of that probability threshold for making the best possible classification. And so I will make a variable, I'll call prob obs, and we'll then do bind calls. So bind columns with prob, and then observed equals observed. And again, if we look at prob obs, we then see we've got those 97 rows. And we've got two columns of probabilities for healthy and SRN and the observed. I'm only interested in the SRN and the observed. So I can go ahead and then do select, SRN and observed. Looking at prob obs, we get those two columns worth of data. Again, my strategy when I'm working with a whole bunch of data is to simplify things as much as possible, break down the individual steps. And so we're working with this one file here, but eventually we'll then want to ramp it up to all 400 files. What I'd like to do now that we've got the probability thresholds and the observed diagnosis is I want to go ahead and figure out what is the sensitivity and specificity for each of these probabilities. I'll take prob obs and I'm going to do an arrange. So I'm going to sort the data frame in descending order by that SRN probability column. And so again, we see that at the top, we have very high probability of SRN 0.67 and all the way down at the bottom, then we have 0.35 low probability of an SRN. But of course, we see that there is an SRN there, right? To calculate the sensitivity and specificity, one of the things we need to know is for each of these thresholds, each of these probabilities, how many positive values are there at or above that threshold, as well as how many negative values are there at or above that threshold. So if we pick 0.6356 right here, the second row, then we would have two positives and no negatives, right? So those two positives will be true positives. And if there were any negatives, so healthies above 0.635, those would then be false negatives. So to calculate those, I'm going to use a trick, which I've used before, that leverages the fact that a true value has a value of one and a false value has a value of zero. And I can then do mutate. And I'll create a variable called is SRN. And I will then say observed equals equals SRN. And maybe I'll put that in parentheses, just to make it a little bit more clear. And so now we have that column that's got a bunch of truths and falses. What that allows me to do is to create a variable I'll call TP for true positive, and then use the function cum sum, which is the cumulative sum of those rows in is SRN. And now what we see is that we have, these are all true positives, right? So at this value or above, we have four true values. And so the number of true positives at that threshold is four. What we need to handle is this fifth case where we have a high probability, but it's healthy, right? And so that would then be a false positive. And so we can then do TP equals cum sum. Exclamation point is SRN. And so exclamation point means not. So if it's a true value becomes false, false will become true. So that false will become true. And then to the right there in our FP column, we will then have a one. And sure enough, we get that one in that column. And that is our number of false positives. So we have the column for the true positives and the false positives to get the sensitivity and the specificity. We need to know the total number of positives and the total number of negatives in the data set. To get that, we can do prob obs. I'll do count on observed. And that gives us healthy and SRN the number of ends. So I'd rather have healthy and SRN be columns so they're easier to deal with in that code chunk below. So I'll go ahead and do a pivot wider. And I'll do names from equals observed and values from equals N. And so again, we now have this table that has one row healthy and SRN. And I'll go ahead and call this total. So now what we can do is we can say sensitivity equals TP divided by total dollar sign SRN. And so again, this gives us our different sensitivity values. So instead of directly calculating the specificity, I'm going to go ahead and get the false positive rate, which will be FP divided by a total dollar sign healthy. If you needed, you could then say specificity equals one minus the false positive rate. Great, we can go ahead and clean this up a bit. And so I'm going to go ahead and do a select on sensitivity, specificity and FPR. And this gives me the three columns that I'm going to need going forward. So at this point, we have the sensitivity and specificity data we need to create a rock curve for this initial file. But again, we've got four different models. And we've got 100 seeds from each of those. So it's time to begin to think about how we're going to kind of compartmentalize this or chunk it apart into functions, so that we can easily replicate this across our 400 RDS files. The first thing that I'm looking at is this chunk of code down here that we use to calculate the sensitivity, specificity and false positive rate. I'm going to go ahead and cut this and put it way up here. And I'm going to use this to create my first function. And so we will call this get sends spec, and I'll call it lookup. And I'll explain that in a moment and function on data. And we need a closing curly brace down here. And we can go ahead and tab that in. And again, instead of prob OBS, we want data. And so we can test this with get send spec lookup on prob OBS. And we get that same output. So I can then say send spec lookup. And that is going to be again, that function we just ran down here, which was that get send spends, get send spec lookup, good. And so now we have this variable, send spec lookup. And what I'm trying to think about is that I'm going to have the same data, but for 100 splits for each of our modeling comparisons. And so I don't anticipate that I'll have the same specificities for all of those, right? So imagine that I had, you know, say 0.97 here as a specificity. Well, what would the sensitivity be? Well, it should be 0.133. Well, how do we how do we get R to know that? I think this is going to take a few tricks. And I think I have a pretty good idea of how to do it. We can go ahead and take send spec lookup. And for the time being, I'm going to do a mutate on diff. And I'll do specificity minus x and x is going to be the actual specificity I want. So I'll do x equals what I say 0.97. And now we can look at our specificities. And we've got this fourth column, which is the difference, right? And so there's a point at which it goes from being positive to negative. And what I really want, though, are those specificities that are positive, because then I want the largest sensitivity. Does that make sense? So again, what we can look at are the diffs that are positive, right? So that's these four rows, because those are all positive. And what I want is the largest sensitivity. And so that's going to be the sensitivity I want to output. So instead of using this diff, the way I've written it, I'm going to write it as a filter instead. And I don't need diff, but I can do specificity minus x greater than zero. And again, I get these rows. And now what I can do is top n on sensitivity. And I can do n equals one. And then I get one row out at the bottom here, right? I'll go ahead then and mutate my specificity column to be x and my FPR to be one minus x. So this seems like it could be a nice encapsulated piece of functionality that we can make a function that would take in that specificity threshold, right? So we can then do get sensitivity function. And then we'll feed it x, we'll also feed it in our data frame. And so I'll call that data. And so then data will be here for get spends, look up, just kind of add some white space here and close that out. And good gets sensitivity on let's do again, 0.97 just to make sure we get the same value. And we will then give that sense back look up. And we get those values that we saw before. And again, we could change this to be say 0.9. And again, get a different sensitivity and false positive rate for that 0.9. Okay, so hopefully you're seeing where we're going with this, and that we could actually then iterate get sensitivity over a whole bunch of values of specificity, right? That's exactly where we're going. So I'm going to go ahead and grab this function and move it up here. I always like to have my functions at the top of my code. And again, what we could then do is we could have a specificity vector that will be seek 0 to one by 0.01. And we could then do map DFR on specificity, get sensitivity. And then we'll also give that sense back look up. And so then that will create, as I said, a data frame that's got all of my specificities, as well as all of my sensitivity values. So something I'm noticing looking at this is that we have 287 rows, should only have 100. And that we have, you know, some specificities appear duplicated in here. So let's go ahead and take that 0.88. And let's go ahead and see if we can't do some debugging here. So if I replace that 0.90 with 0.88, we do get two rows out. So to solve that, we can come back up to our get sensitivity function. And I can add a distinct line. And that will only return distinct or unique rows from our data frame. So now if we come back and we rerun that at 0.88, we get one row. That's great. We can go ahead and delete these. And let's go ahead and rerun map DFR. And now we get all we get 100 rows in our data frame for those 100 different values. One other problem I noticed is that this ends at 0.99 not 1. So again, if we come back to our code, I think we can fix that by doing greater than or equal to in get sensitivity. And then if we go ahead and rerun this, we now have the sensitivity specificity for all values of specificity from zero to one going by 100. So now what we'd like to do is then replicate that across all 400 of our RDS files. What I'm going to do because we currently have lines 31 to 44 as just kind of freestanding code is I want to turn this into a function, right? And so then what we could do is we could take this function and we could use map DFR to iterate that over all of the files and life will be easy, hopefully. Okay. So what I'll do is I will say get pooled sends spec. And we will then give it file name. And I'm also going to give it the specificities. All right. And so we'll do a curly brace there. And we'll grab all this code, tab it in a notch out of line to get some nice white space. I feel like Bob Ross when I'm talking about white space. And so I'll grab that specificity and bring that down here. And so now we have this function. And we can replace this with file name. And this then becomes file name. Right. And so that all should be good. Create that function in memory. And so what I can then do is get pooled sends spec on this function. And we'll also then give it specificity. And I'll define specificity before get send, get pooled send spec. We now have a function that we should be able to iterate over all 400 of our data frames. One little thing that I noticed, though, is that this data frame that's outputted doesn't tell us anything about the type of model that it came from, right? So we could theoretically get 101 times 400 rows and not know where it came from. We need to go ahead and mutate this to go ahead and put in a tag indicating what type of model was used to generate those data. So at the end of this map, DFR, I'll go ahead and do a mutate. And I'll say model str replace on the file name. And we'll do, let's see, processed data forward slash. And then I'm going to capture the first part of that. And then we'll stop with the number. So we'll do back back D to match a digit, the star any number of digits RDS. And we'll replace that with back back one. So again, back back one is what was matched inside those parentheses. And now we see we've got L2 genus here in that final column, we're now ready to iterate this function over all 400 of our RDS files. To get those RDS files, we can do list dot files path equals processed data. And then pattern equals will do dot star back back D star dot RDS. And we'll do full dot names equals true. And so now we see we've got those 400 files, right. And so that all matched. And we're ready now to feed this in to map DFR. And we will then give it our function name, as well as specificity. Okay, processing those 400 files took about five minutes to run. So my code certainly isn't fast, there are probably ways that we could speed it up a bit. But for now, I'm just going to keep going. But one thing I totally screwed up on was I forgot to save the output to a variable name. Well, you have been very patient, and you are still watching this video. So I am going to give you an extra special tip for watching all the way through. See, aren't you glad you keep watching? Well, this is something I do very regularly, right? So there is a special variable called last dot value. What do you think last dot value holds the last value, right? So I'm going to call this pooled sends spec. And I'll do period last period value. And so now pooled sends spec has all that stuff, right? So this isn't totally reproducible. You know, I'll still, you know, at the end, perhaps want to rerun everything altogether just to make sure it works. But you know, it saves me another five or six minutes rerunning everything. So again, that that cool tip again was period last period value to get you the output from the last statement that you ran in the console. Aren't you glad you kept watching? Good job. Now we'd like to go ahead and plot our data. But before we can make the plot, we need to go ahead and group our data and summarize it grouping by the modeling approach, as well as the specificity. And what I want is a line for the median sensitivity, as well as the interquartile range on the sensitivity. Let's go ahead and get to that. So we'll go ahead and pipe this into a group by model and specificity, sensitivity equals median sensitivity. And I'll do L quant quartile, quantile on sensitivity and prob equals 0.25. And we'll get the upper quartile by doing prob 75. All right. And then I'll go ahead and do a dot groups equals drop. Right. And so now what you can see in the output here is that we have the model, the specificity, the sensitivity, the lower quartile and the upper quartile. And we're in good shape. Now we're ready to plot the data. So we can go ahead and feed this into ggplot aes x is going to be one minus the specificity, and then y is going to be the sensitivity. And then y min, I'm going to do lower low quartile l quartile, y max, I'll do l the u quartile. And then we will then pipe that into I guess it's not pipe, we'll add that into geome line. And we will also then do geome ribbon. And I want to do a color, right. So I will add to this another aesthetic, which will be color equals model. And maybe I'll go ahead and put these things on separate lines. And I'm going to put the ribbon behind the line, right. And I think I will also do an alpha, let's do 0.25. I don't want the ribbon to be the same color, the same intensity color as the line, I want the line to kind of pop out, right? We'll save this with gg save to figures. And I'll do rock curve dot tiff width equals five height equals five. Wonderful, we have our plot with our four rock curves. One thing I noticed is that the ribbon geome ribbon uses the fill for the color of the ribbon, not color. I think the color might actually be the the boundary. So let's go ahead and then do fill equals model. And that didn't change anything. And so it's interesting that the legend is giving me the right colors. It's giving me the fill as well as the border and the line. And what I think is happening actually, is that if you look at my summarize statement, quantile is operating on this sensitivity, not the sensitivity of the full column. So what I need to do is I need to grab that line. Pretty sure this is what happening, what's happening. And I need to put the sensitivity down below. And again, we have our ribbons of color, we have our solid line. It seems that the color aesthetic with ribbon is the border of the ribbon. I don't want that. So let's go ahead then. And I will go ahead and put fill model down in an aesthetic for geome ribbon. And I will do color model as an aesthetic for geom line. That looks a lot better. One other thing that we could perhaps do, let's go ahead and do theme classic. And we'll also move the legend to be inside. We don't need all that white space over by the legend. So we'll do theme legend.position equals C. And then we give it coordinates on the x axis and the y axis. So I'll do 0.8 and 0.2. So it looks a lot better moving the legend inside. It gives us kind of more space for the plot within the panel. One thing I'm noticing about this that I want to check out is that the line, you know, if you kind of look at this purple line for the random force genus with fit, is that an angle? Usually rock curves are stair stepped. So what I want to try, something I haven't done a lot because I don't use a lot of these types of plots. And so instead of geom line, let's do geom step. So geom step instead of geom line will give us a step shape to our curve because it shouldn't be kind of a diagonal line. Sure enough, now we get that stepped feature to our curves that you would expect for something like a receiver operator characteristic curve. So that looks a lot better. I don't use geom step a whole lot. And so it's kind of cool to see it being used here. But I think that is the more appropriate geom over geom line. Again, what I hope to show you was that we can look at a receiver operator characteristic curve. And instead of looking at the composite metric of the area under the curve, we can look at each of the individual curves, we can perhaps pick a desired specificity, you know, you could come to a specificity of say 0.8, then draw up and see what are the sensitivities of our models. And that then allows you to begin to have a dialogue about, you know, relative performance of models, as well as, you know, what's more important specificity or sensitivity. Those are trade offs. And, you know, in some cases, there are no right or certainly no easy answers. I know within the oncology field, they have constant debates over these types of things. What's more important specificity or sensitivity. And, and it's just hard, right? If nothing else shows you how I think through kind of a complicated problem, right of how do we take 100 rock curves that all have different values for their specificity. And how do we then bring them all together so that we can reflect the variation across the 100 seeds for four different models. Thanks again for sticking through. I hope you enjoyed that period last dot value tip. I know I use that a lot. If you've got other tips for other people, by all means feel free to put them down below in the comments. I'd love to hear what kind of tips you know about that you don't think anyone else knows about. And we'll all learn from it. Anyway, keep practicing with your R coding skills. And we'll see you next time for another episode of Code Club.