 So if you've been following along in recent episodes of Code Club, you know that so far we've made models using L2 regularize logistic regression and we compared that model using only genus level microbiome relative abundance data, as well as a model built using that same relative abundance data but adding on to that the fit result. We know that using the fit result alone has an area under the receiver operator characteristic curve AUC of about 0.65. So what we found was that adding the microbiome to the fit does better than fit alone as well as better than the microbiome alone. So at least using the L2 regularize logistic regression, we're not going to totally scrap the fit. We're going to combine that information to make a better model. In the previous episode, I showed you how we could use microp ml to adapt the code we built using L2 regularize logistic regression to build random forest models. What I'd like to do with you in this episode is to create a figure that allows us to easily compare the performance of our different modeling approaches and then within each modeling approach to compare the different types of features, whether we include the relative abundance or we're using the relative abundance plus the fit. And then within each of those different sets of features, I want to be able to compare the performance for the training set and the testing set. Also along the way we want to incorporate the variation in the data for those 180-20 splits. As I talked about in the last episode, it's really important to keep our code dry. And so in this episode, I'll be revising our code at the end to make sure that it's as dry as possible with the use case that down the road we might want to go from having two different modeling approaches again L2 regularize logistic regression and random forest, and that we might want to add a third or fourth or fifth different algorithm to see if we can't improve the performance over say the random forest model that I think will ultimately prevail. To kick things off, I'll go ahead and start with a library tidyverse. Get that package all loaded. Something else I'm also thinking about is that I've got these performance.tsv files in my process data directory. So I can use the function list dot files path equals processed data. And so if we run that, I misspelled that process data. This is a handy dandy function that is basically the same as say LS from your command line. It shows us all of the files that are in process data. I can add to this pattern, and then I can kind of give, you know, the section of the file name that I want to use. And so I can say performance.tsv. And this then gives me my four performance files, right? I've got my L2 genus plus fit, my L2 genus, my RF genus plus fit and my RF genus, right? Good. So I now have those four files. And what I'd like to do is read those in and then ultimately concatenate the four files together so we can then go ahead and make a figure. All right. So we can do read.tsv. And then I will grab this and we can then say process data forward slash the L2 genus fit performance. And I'll call this L2 genus. Okay, and so we're going to have several of these. This will actually be L2 genus fit. And then here I will move the fit. And then we'll copy these down. RF, RF, RF, RF. And so you might be thinking, Pat, that's not dry, you've repeated the same line of code four times, except for the name of the file. Hold on to that. We'll come back. Okay. So again, we have these four files that are now read in. And one thing that we will need is we will need a variable that allows us to distinguish these four different types of files, right? And so if I look at say RF genus, and I do select method, I see it's all RF. And if I do the same thing with RF genus fit, I see the same thing, right? So I'd like to modify my method column to indicate that this is actually RF genus fit, not just RF. Okay. So what I can do is I could go ahead in here and I could say mutate method equals and then I'll say L2 genus. And again, I can copy this down. You might be saying Pat, this is not dry. And you're right. Fit. And let's see. And so this should be L2 not LT. And then this is again RF genus. And again, we can copy that. And we can get RF genus fit. And let's, let's put in some white space here. So it's a little bit easier to read. And we can then load all that. And again, if I do RF genus fit, select method. I now see I've got RF genus fit. And if I do the same thing with RF genus, I know I've got RF genus and we can do the same thing with the L2 genus and L2 genus fit. Okay. Now what we can do is bind rows on these four data frames, right? And so that'll take these four tibles and concatenate them together. And so I'll copy these names. And RF genus fit. And so now I now have a data frame that's got 400 rows and 17 columns. I'm not interested in all 17 columns. To make things easier for me to see and work with, I'm going to go ahead and do a select. So I'll do select seed. Again, that's going to be one through 100. I'll do method. And I'll do CV metric, AUC, and then AUC. So the CV metric AUC is the training AUC and the AUC is the test AUC. Okay. So again, we have our seed or method and all that. Let's go ahead and rename these columns. So they're a bit easier to work with. So again, we've seen this before. We can do rename. And I'll do training equals CV metric AUC. That's capital AUC. And then testing equals AUC. And now we'll see that our columns are simplified to be training and testing and we've got seed and method. So what I'm thinking about, again, as a plot where across the x axis, I have whether the data was training or testing. And whether or not this was from an L2 logistic regression model, or a random forest model, or whether I'm using genus or genus plus fit to get there, I need a column that has AUC in a single column. What I'm going to do is I'll go ahead and pivot longer to pool to gather together the training and testing columns. So again, we'll do pivot longer. And I'll do calls equals C vector training testing. And I'll then do names to I'll call it training testing. And then values to I'll call AUC. I need an equal sign in there. All right. So now we don't really need that seed column, but we've got method training testing AUC. Maybe I'll go ahead and remove the seed just to make things simple. Right. And so now we again, I got the method the training testing and the AUC, we can then fire this into GG plot. AES across the x axis, I'm going to put my method. And then my Y is going to be the AUC. And then I'll do say fill with training testing. And we can do a geombox plot. And we can then also GG save this as a model compare dot tiff. And I'll go ahead and put this into figures. And I'll make the width, say eight, and the height equals five. So this is a good first step for our figure. There's a few things I don't like about this that I want to work through with you. The first thing I don't really like is that I have testing and then training. If you think about the process of building the model, it goes training and then test it. So I want to use a factor to flip that order, I can come back into my code. And after the pivot longer, I'll do a mutate to recast training testing as a factor where I set the levels to be training and then testing. So we will then do mutate training testing equals factor on training testing and then levels equals training testing. And we need to pipe that into the rest of everything. We have the training on the left and the testing on the right. What we see is that in general, in this case, that the testing AUC is actually a little bit larger than the training, which tells you that the data are not overfit. The next problem that I want to take care of is, I'm not such a fan of these box plots, as I've talked about in the past. I don't think most people understand what these with the whiskers represent. So instead, what I'd like to do is draw a point range plot, where we draw a point at the median and then arrange for the 50% confidence interval for the intra quartile range to do that. We'll come back to our code. And for now, I'm going to comment out geom box plot, and we will then go ahead in here and we'll do stat dot summary, fun dot data equals median, high low, we can also do geom equals point range. And so we've got our geom point range. So two things that are going on here is that our data are right on top of each other. It's not separating things out by training and testing. That might be because we are setting the fill rather than the color. So we want to change that. But also we need to put in a dodge in our points. Also, this is drawing, I believe the 95% confidence interval, or maybe it's even the range, what I want is the 50% confidence interval. Okay, we'll go ahead and change both of those things. So we will come into our GG plot. And instead of Phil will do color for training testing. And we can then also in here do fun dot args equals list. And then in here, we will go ahead and put conf dot int equals 0.5. And we need to then close out with a parentheses, we can also do position equals position dodge width equals 0.5. And so now we've got two points, one for training, one for testing. And we see the 50% confidence interval. So I feel a lot better about that. You could go in and you could use whatever confidence interval you want. I think this gives this gives you the idea right that we're seeing that the data are not horribly overfit. And we're seeing that 50% of the data follow within that range. The next problem that I want to turn our attention to is that the four different methods as I've labeled them here are being portrayed on an equal footing, which is fine that they kind of are. But I'd rather have is perhaps the L2 clustered together, and the random forest clustered together. So it's easier for my viewer to compare the L2 to each other, and the random forest but at the same time I want them next to each other. So I can easily compare across the four types of models, right? To do that, we are going to do a trick with something called facet wrap, which I think I've talked about before, but I can't remember. So we could do facet wrap, tilde, and then the variable that we want to facet on. And so what I want to facet on is model, but we don't have a model. So what we want to do is we need to come back up here to our mutate and I'll go ahead and put this on multiple lines. So it's not all scrolly left and right. We'll make a column in our data frame that we'll call model. Again, the model name is within the method column. So I'm going to do an str replace. And here we're going to get to use our friends those regular expressions, which I still need to make an episode about I think. So we'll take that method column, and I'm going to remove everything to the right of the underscore genus. So we will then say underscore genus, period star, and replace that and replace that with nothing. And so then that will create a model. We then will facet wrap on model, and I'm going to do n row equals one. So I keep things on one row side by side. And then I'll add the plus sign. And now, okay, okay, we're gaining on it. Trust me. We now see on the left, we have the L2. And on the right, we have the random forest. But we have the same x axis for both of these. To solve that problem, what we can do is come back up here to our facet wrap and do scales equals free underscore x. And that then frees up the scale, so to speak, right? So it only shows the range that's being represented within each of these two facets. So for the L2, I have the two L2 models. And for the random forest, I have the two random forest models. The next thing that I want to do is instead of having these facet labels at the top, I want them on the bottom. Okay, to move those labels will do strip dot position equals bottom. And so now we see they're at the bottom, but we want them on the outside of the actual x axis labels, we can change that using the theme function, where we can do strip dot position placement equals outside. And so now we see that we have the L2 and RF on the outside. We also have this method so you can see that we have these three levels of labels, I want to get rid of that method label. And I also want to turn these gray rectangles into white rectangles, so there's nothing to see. And I want to give these all better names. So I'll do labs x equals null. And I also then want to give my model, instead of it just being RF or L2, I want to give it something more verbose, right? And so I will then take model, and I'll do it if else. So if else L2. So if it's L2, then I'll say L2, regularized, logistic regression. And then otherwise, I'm going to say random forest. Okay, so that'll look good. I'll probably put a line break in here, because I'm sure this is too long of a label. So I also want to change the label with the method, so that if it's genus, it's only says genus. And if it's genus underscore fit, that it says genus plus fit. So again, I'll take method. And I will then say, if else, and I'll use a function str detect. So I want to string detect on the method column. If it says fit, then I want it to say genus plus fit. Otherwise, I want it to say genus. So I'm getting an error. And I think that's because up here with string detect, I forgot to include a parentheses. So again, string detect takes the method column and looks for that label fit within the values of method. And if it's true, then it returns genus plus fit. If that's false, the fit isn't there, then it's going to say genus. And I've got four sets of parentheses here. And so I'm not sure if that's the right number. So again, if I kind of toggle over that, that's for the if else, this next one is for the mutate. And I think these other two are extra. Now I'm getting another error. Model if else. So this should say model equals equals L2, right, because I stripped out all that stuff. So if model equals equals L2, then this should work. That looks a lot better. Let's go ahead and remove the gray box around these facet labels. So it's not so obvious what we did. I'm also going to change the theming. As you probably know by now, I am not a fan of the default theme. So I'll go ahead and hear and do theme classic. And that will kind of give us a white background and solid axis lines. And then I will also do strip dot background equals element blank. So one of the things that theme classic does is it draws a black border around those facet labels, element blank will make sure that there's no labels there. Wonderful. Now we see that we've got a nice clean axes, and that we see this clustering that we'd hoped to get right so we have L2, regularize logistic regression on the left, and random forest on the right. I'm happy with the way that looks. The next thing I want to change is my y axis. So I'd like it to go from 0.5 to one, because that's kind of the natural range for area under the curve. Also go ahead and spell out what AUC stands for. So we can do y equals area under the receiver operator characteristic curve. And I'll go ahead and put a line break in here. And we want our limits so we can do limbs. And I can do y equals 0.5 to one. Good. So we're going from 0.5 to one, we've got a nice y axis label there. I'm happy with that. The next thing I'll maybe do is go ahead and change training and testing to be capitalized. And I want to get rid of this label. And I want to fix the colors. I can do all of that, believe it or not, with one function, which is scale color manual. I can say name equals null. That will get rid of that name. I can then do breaks equals training testing. I can then do labels, capital training and testing. And then I can do values. And let's do gray and Dodger blue. And a plus sign to that. So you see that got rid of the name of our legend. It capitalized training and testing and it changed the colors. So I'm pretty happy with the way this looks. One thing that I might want to do is maybe scooch it in a little bit. Things seem pretty far apart. I can change that by maybe making my width a little bit more narrow. So I'll do five and four. So again, that brings things in a little bit more and makes it a little bit more compact. So one last thing that I'd like to do in here is put in a dashed line to indicate where the fit result AUROC would be to do that. We can come back up here and I will put that first before stat summary because I'd want that line to be behind any of those confidence intervals. So I'll do a geom h line y intercept equals 0.65. I'm hard coding that number. Again, if I was doing this for research purposes for publication, I would call the function that calculated that and then have that automatically inserted so that if the data change, I don't have to worry about, you know, that data change. That number being wrong, you know, if the data are updated. Anyway, I'll do color equals gray and I'll do line type equals dashed. And very good. We now have that dashed line to indicate where the area under the receiver operator characters occur value would be for using the fit result on its own. And so as we can see, our models do I would say as good or better than fit. Clearly, genus plus fit does better than fit alone does better than genus alone. And what we can also see is that random forest does better than the L2 regularize the gestic regression. So I'm really happy with the way this looks, you know, an AUROC at or above point eight, I would say is winning with some more tweaks and, you know, ways of improving the model, we could perhaps even push that higher. So something we might do to try to improve this might be to try a different modeling approach, right. And so with that in mind, thinking about, you know, maybe we'd want to try L1 regularize the gestic regression, maybe we want to try a decision tree or maybe want to try XG boosted trees, right, that this plot could get bigger, right, we could go from two to four to five to six or however many different models big um had in that paper that we published in 2020. But the way that we have this coded already is not dry, right, because if I add another model, I'm going to have to add more lines to this and that just is going to get pretty tedious. So what I'd like to do in the closing moments with you here is see how we can take this code that we constructed and it works, it's fine, but it's not dry, right, we're repeating ourselves many times. So when I see code like this, that's repeated many times except for one thing that screams to me function, we can create a function to do this for us. So coming back to this list files function, actually, we see that list files will output the name of the files. So to get the full path to include the process data the directory, we can actually add an argument to list dot files, which is full dot names equals true. And so now we see we've got the full path, right, I'll go ahead and create a function that will allow me to read in each of those file names that I get out of list paths. And so I will do read performance function x and so x will be the file name, maybe I'll just call that file name. And then I've got my curly braces, right, I'll go ahead and do read T SV file name, right, and then we will pipe that into a mutate to change the method, right. And so we'll do method equals str replace on file name, right. And so I'm going to take the file name, which will be look like something like this right. And I what I want to do is I would need to get a pattern using a regular expression. So I'll do a period starred forward slash that'll match everything from the first character up to that first forward slash that'll match what I've got highlighted here. And then what I want to capture, I'll put in the parentheses, that'll be everything from after that forward slash to the underscore performance dot T SV. And then I need to give it something to replace it with. And so we'll replace it with back back one. So back back one is the value that was matched in these parentheses. Okay, so we can go ahead and load that. And then we've got, like I said, we've got list files. And what we can do is it's pretty slick, actually, we can run all this into map DFR, right. And so map DFR takes the values of this vector, right, these four values, and we can then indicate that with the period, and then we're going to send that to read performance. Right. So we run that, and that runs, and that creates a data frame that's got 400 rows. And if we're not sure that that worked, we can always pipe that to account on method. We've got our four methods, and each of them with the 100 seeds. So that is in good shape. So I can actually remove that. And I'm going to go ahead and just delete these lines. And I don't need this blind rows anymore, right? And so then I can pipe that in. So again, this now is far more dry, I can add another performance file into my process data directory, and it will automatically get updated into my figure. And I get the same exact figure that I had before without all the repetitive code. And again, the beauty of using the list files and the map DFR with my read performance function is that I can add another performance file to that process data directory, and it will automatically get updated in here. The one caveat to that, of course, is that I did add some special code to make L2 regularize logistic regression from GLM net and random forest from RF. So if I added, say, our part, I think that's decision tree, I would need to kind of have a similar mapping of our part to decision tree or boost to XG boosted trees or whatever right. So there's a little bit of futzing that would still need to be done to get the labels to look right. But again, that level of automation really helps with reproducibility, and having just a much nicer programming experience. So again, looking at this figure, I'm filled with optimism that with some more tweaking, perhaps some better algorithms, perhaps changing things and how we pre process data, perhaps the features that we're including in our model that we can make a non invasive diagnostic of screen relevant neoplasias that incorporates the human microbiome. I think that's really cool. To me, this is like an objective way of showing that maybe the microbiome is actually good for something. You never know. We did some really cool things in this episode that I really want to pause and help you to take stock of us that you appreciate what happened. Again, I showed you how we can make three layers of our X axis labels, right? So we have genus plus genus fit, we also have training and testing, right? So it's going to be four layers. We have these labels that again came from the facets. And then we had the overall X axis label that I ultimately removed with that X equals null in the labs function. We saw how we could add a GMV line right with that dashed line to show where the fit result was. We also saw how we could replace a box plot that is a little bit klugey, I think, and just not so clear. People perhaps overinterpret a little bit with the point range using stats summary, which we've seen again before. And then finally, what we saw in our code way up here was that we can make our code dry by creating a function that replaces those four repetitive chunks of code to read in our performance TSV files. What I'd like you to do is go out into the wild of your own code and your own data and see if you can't incorporate some of these concepts into your work this week. Let me know how it goes and what you find. Share with everybody down below in the comments, your experiences, and we'll talk to you next time for another episode of Code Club.