 Life would be so easy if we had the same data, the same questions, and the same goals, right? Fortunately, we don't all have the same data, the same questions, and the same goals. But because we're all trying to do different things, when we try to build statistical models and especially machine learning models, we need to adjust parameters to customize the model to the data and to the questions. We'll talk about how we can adjust those parameters, which we'll call hyperparameters in today's episode. Hey, folks, I'm Pat Schloss and this is Code Club. In the last episode, we talked about pre-processing our data as a way of kind of cleaning up the data set, rescaling and re-centering all the data to make it run more quickly and to get a more robust analysis. In today's episode, we're going to talk about another set of adjustments that we need to make to what are called the hyperparameters. The hyperparameters are the parameters that go into the model that affect how the model is trained and fit on the data. One of the things that we have to appreciate is that we could add more and more and more parameters to the point where the model fits our data really well. But when the model now sees new data, it's going to perform horribly because the model was overtrained to the data. The model that we've been using to kind of motivate our progress through learning about machine learning in the micro-PML package has been a logistic regression model. One of the parameters that we can use in logistic regression is regularization. So we've been using L2 logistic regression and with that model comes a hyperparameter called lambda, which affects the regularization process. It affects how many features effectively from the full set make it into the final model. It kind of adjusts the weighting, if you will. That all goes into, again, getting a more reliable fitting of the data. Micro-PML comes with some preset lambda values that it automatically tries to fit to the data and then outputs a variety of area under the curve values for each of those lambda values. So what we're going to do today is see how well the default lambda values fit our data. We're going to see how we can adjust those lambda values because they don't fit the data very well. And then we'll talk about how we can visualize the data to get a better sense of, you know, where is the optimal lambda value for our data. So to get going with that, we'll go ahead and head over to our studio. Here we are in our genusml.r script. You'll recall that this reads in the data at the genus level, relative abundance data. The dataset that I'm working with was previously published by my lab from a former graduate student named Neil Baxter, who was looking at variation in the gut microbiota by colon cancer status. And so what we're doing is we're trying to predict whether or not somebody has a screen relevant Neoplasa or SRN based on the structure of their gut microbiota. And so we go ahead and read in all that genus data and get it tidy and all looking good. We load our micropml package, we get everything formatted correctly, and then we can pre-process the data as we discussed in the last episode. The default pre-processing that I'm doing will remove any columns where there's no variation in the data or very low variation in the data. It will also center and scale all of our relative abundance data. So the mean is zero and one represents one standard deviation above the mean and minus one equals one standard deviation below the mean, right? And then we have our runml call, which goes ahead and as we see where you're doing method glm net, it goes ahead and does that logistic regression. Let's go ahead and run this. As we've seen in the past, it takes about 90 seconds to run. So as we look at the output from runml for our set of parameters, if you come to the top, there's a object in SRN genus results called trained model that has all sorts of information more than actually is being shown to the screen here. And what you'll see is there's a table here where the first column is lambda, and then we've got a bunch of different columns that you can use to assess, you know, which lambda value is the best. The one that we're going to look at is this AUC column, which is again the area under the receiver operator characteristic curve or AU rock curve. And so what we'll see then is that right at about one or maybe 10, we have a peak AUC value. So some a couple of things to keep in mind, first of all, this is for one split of the data, right? And so we could have done another 80 20 split, again, we take that 80% of the data, and you fit parameters, and you kind of do your five fold cross validation before you test it on that held out 20%. That 80% is where we're calculating these lambda values. So if we'd have done another 80 20 split, we'll get different AUC values, right? And so what we'll want to think about is, perhaps doing this a bunch of times, so you can get a better handle on our AUC values, and what the optimal lambda value is. The other thing is that we see that the tuning parameter alpha was held constant at a value of zero. So GLM net actually has two hyper parameters, we can set alpha and lambda. The lambda, as I mentioned, affects the regularization and how the weightings are performed for doing the fitting of the data. With an alpha of zero, that indicates that we did L2 regularization of the data. This is also called ridge regression. If we used a value of one, that would have been L1 regression, regularization of regression. And that is also called lasso regression. So I encourage you go go read more about regression and machine learning algorithms to get a better sense of what's going on there. We could also set alpha to be a value between zero and one. And that's called an elastic net. What we're looking at is L2 regularization of the data when we set alpha to be zero. And we've got these built in lambda values that we'd like to perhaps think about how can we change those values so we can give run ML our own hyper parameters by giving it the hyper parameter hyper parameters argument. And then we need to give that a list indicating the values of the hyper parameters that we want it to evaluate. And so I'm going to go ahead and here and I'll put test HP. And so I need to define test HP. And that actually needs to be a list, right? And so we'll do test HP equals list. And so a list is a data type in our that can kind of you think I think of it as kind of accumulating or pulling together different types of data. And so I can put an alpha equals zero. And so again, that is for our L2 regularization. And then I can do lambda. And I can give that a vector of values. So I could do like 0.1, 1, 3, 5, 10. Something to realize is that for each combination of alpha and lambda, we're going to do 100 crossfold validations. So I could put in 10 lambdas. But that's going to double effectively the time it takes to fit all those parameters, right? So let's start with these values of lambda. It's a bit broader of a range. Actually, it's a bit it's adjusted, right? Instead of going from 10 to the minus four up to 10, we're looking at 0.1 to 10, with some kind of more granular steps in between. So we'll go ahead and load test HP. And then we'll go ahead and run our model and see what the output looks like. All right, if we do SRN genus results, we again get the same similar type of output. But again, what we're looking for is this data frame. Again, where we've got these five lambda values we set. And we see that the AUC does seem to kind of go up from 0.1 to 1 to 3. And then it kind of falls off. But man, that's really like that's out in like the 10,000ths place of the decimal point. So I suspect it comes up and then it's kind of flat. So I think we've got a pretty good range here from 0.1 to 10. We are doing one split, right? And so maybe what we'd actually like to do is let's think about doing, let's do like three splits, right? Again, just to get a sense of what's going on. In this episode, I don't want to do a deep dive on how we do tons of splits in different ways we can make that more efficient. But I want to kind of give you a sense of how we can begin to think about what lambda value is the best for multiple splits. So I'm going to go ahead and turn my run ML in all its arguments into its own function. And I will call this get srn genus results. And I'll say function. And I will give this an argument, I'm actually going to give it an argument that we'll call seed, because we will give a different seed to run ML. Every time we run this, so we'll say seed equals seed down here in run ML. That looks good. And so again, if I do get srn genus results, and then I give one as my seed, it will use one as the seed for my random number generator. Again, this takes a minute and a half or so to run, just want to double check that everything looks good. So that looks good. I'll go ahead now and write my map function. So I'll do map, see 123. And so that's going to be the three seeds that I'm going to send to get srn results srn genus results. This will run get srn genus results three times the first time with a seed of one, the second time with a seed of two, and the third time with a seed of three. I'm going to go ahead and call this iterative run ML results. And so this will output a data frame or an object that has the results from running run ML three times. So that ran through something I just want to briefly comment on is that we're still getting this warning message about things not converging or working well. We're not totally sure what this means. The models seem to perform pretty well in spite of this warning message. So I'm going to kind of ignore it for now. And maybe in future releases of micro ML, we'll get a little bit more information about what's going on here outputted to to you all the users. All right. So if we look at iterative run ML results, we see that we have a list, it's a list of three different elements for each of the three different runs of the model. Okay. And so what I'm interested in is this trained model section for each of the values in the list, I can extract that by doing iterative run ML results and piping that to the map function, where I use the pluck function. So pluck is from per. And that will then I can pull out the trained model element from each element of the list. And so pluck will basically pluck out trained model from each seat, if you will, in the list. And so if we look at this now, we see that we've got the third trained results, the second and the first, we can then pipe this output to combine HP performance as a function. And let me go ahead and put these on different lines. And what we'll get out is a data frame dot dollar sign dat that has our alpha value as well as our lambda values and the AUCs. So I'm going to call this performance. Micro ML has some helper functions built in to do things like helping you to visualize the performance of your models across different hyper parameters. And so we'll do plot HP performance, and we'll give it performance. And what we want is that that data frame in performance, because that's a list. And so the dollar sign that gets you that data frame out of the list. And then we need to give it the hyper parameter that we want to plot on our x axis and so put lambda. And then we want the metric that we want to put on the y axis. And so that's going to be AUC. So what we get out is as I described, we get our lambda on the x axis and the mean AUC on the y axis. The range of our mean AUC values is pretty small, especially compared to the standard deviation. So the area bars here represent plus or minus one standard deviation in the data. Again, we're doing three iterations. So it's, it's probably not that reliable to read too much into this, we'd like to perhaps do a hundred of those splits to get a better sense of the variation. But you can kind of get the sense that it does kind of peak around three and then fall off on either side. I might want to come back and do 1234510, right, to get a better sense of that variation of the data. So let's go ahead and do that. And again, what we can do is we can come back up and do 1234510. This will give us kind of more, more of a grid or more granularity to our lambda hyper parameter. So we see from the plot that's outputted that we do have better coverage, if you will, between 0.1 and 5 now. It's not totally clear which is the peak value. So like we've seen in the past, we can, of course, take a look at performance dollar sign dat and kind of work with that on its own, right? And so this now is a data frame, right? We can do all of the great group buy and summarize things that we've done in the past to get a sense of, you know, what is the optimal alpha lambda, we can get the mean AUC. So let's go ahead and do that. Let's get the mean AUC. And we could do group buy. And we'll do group buy alpha and lambda. Right. And then we can do a summarize. Hopefully this feels a little bit more comfortable to those of you that aren't so sure about this machine learning stuff yet. We can do mean AUC equals mean on AUC. And so now we get our mean AUC values. I can go ahead and do dot groups equals drop. And we can then pipe this to a top n, and we'll say n equals one, and we'll then give it mean AUC to then get out the top AUC. And we see that the top AUC is 0.631 with a lambda of three. Again, we set alpha as zero. So there's no change there. Okay. So this is great. We have a variety of ways of looking at the optimal lambda, you know, whatever the hyper parameter is, and getting back the mean AUC. Not that we would necessarily want to do this. But let's let's see what happens if we modify alpha, so that we perhaps looked at like say, let's do 0.5 and one, right? So we'll do 0.5 and one. And this is taking a little bit longer to run than I really like it to. So let me go ahead and let me change my CV times from 100 to 10. I just want to do a demo here to see what the output would look like if we were tuning to hyper parameter. So plot HP performance. At this point, we'll only plot one hyper parameter lambda across the x axis here. I guess I could do plot HP performance with alpha. And what you'll see is that we get kind of a drop in alpha as we go from zero to 0.5 to one. But of course, there's two values there, right? And so it's not totally a head to head comparison. If we look at performance dollar sign debt a little bit closer, we can see now that we've got, we ran this three times. And each of the three times we got an alpha of zero, 0.5 and one, as well as those hyper parameters of 0.1, one, two, three, four, five and 10, along with the output at AUC values. For whatever reason, for a lot of these 0.5 alphas, we get an AUC of 0.5, which is effectively random. Anyway, so let's look at the top end with three and see what we get for the three conditions that gave us the highest mean AUCs. Not much of a surprise that the alpha is zero. And then our three lambdas of right around two, three and four gave us the highest mean AUCs. We saw some problems with plotting the performance. Let's let's take another look at that though, right? So let's go ahead and instead of looking at the top end, I'll comment that out for now, but let's do ggplot and we'll do AES and across the x axis, because that's the variable I'm most interested, let's put lambda. And then for our y, let's put the mean AUC. And then for the color, let's put the alpha. And we'll do geomline. Now, one of the problems with this is that alpha is a numerical value, and so it's going to treat color as a continuous variable, of course, and it's going to look really funky. So let's go ahead and do as dot character on alpha. And so that'll basically treat each alpha, so zero, 0.5 and one as three discrete values. And so now we can see that we do have zero as having the highest mean AUC, 0.5, it's highest at about 0.1 as is the L, L one regression with the alpha of one. Now, it's a little bit deceiving because our largest lambda value was 0.1. And if we went to smaller lambda values, we might actually need a much smaller lambda value for the 0.5 and one, and the AUC might be better than what we see at three for an alpha of zero, right? So again, these are the things that you can mess with or tune, I guess, better than mess with things, the parameters that you can tune to get the best area under the curve, right? And so I'm focusing on the, the L two regularization through these series of episodes to kind of get our feet under us, so to speak, and try to learn how micro Ml works before branching out and looking at different modeling approaches. Something you might be wondering about is how do you know what the different hyper parameter options are? And what are the default values for the various different modeling approaches? Well, we have a helper function for that. And so you can do get underscore hyper params list, and then you can give it the name of the data frame that you're running through the model. And so again, ours was SRN genus preprocess. And then you can put after that the name of the modeling approach. So we could do GLM net to see the lambda values and the alpha values. We could also do RF, which is for random forest, which we'll we'll see soon enough, don't worry. And so there are three m tri values m tri being the only hyper parameter that our version of random forest will use. And other of that you might think about would be SVM radial get those variety of hyper parameters, some of the hyper parameters do depend on the data. So for example, the number of m tries m tri values depends on how many features you have. Whereas what we saw for like lambda, it doesn't really care. These are kind of baked in hyper parameters for a decision tree, we would use our part two, to see that there's one hyper parameter of max depth. Again, this is also dependent on the number of features that you have. And then finally, we can put in XG boost, which is XGB tree. And here this XGB boost has the most different hyper parameters that are available for tuning, building those XGB boosted trees, it's a variation on random forest. So let's go ahead and clean up our code a little bit to get it ready for the next episode. Because something we'd like to do is perhaps go ahead and do 100 splits. So we can get the most robust fit of the hyper parameters of that lambda value. I'm going to turn this alpha back down to zero. These all look pretty good. And I think we're in good shape. I'm going to go ahead and save this. And I will commit it after I finish saying goodbye to you all. But again, I hope this kind of drives home the point if that wasn't already clear, that machine learning algorithms are not simply plug and play, you can't just kind of plug in your data, let it chug and get out meaningful results. That might happen, right? That sometimes that does happen. But more often than not, some tuning needs to be done. Some careful thought in picking hyper parameters, picking your modeling approach, how figuring out how you know, you have a robust model. And that's what we're kind of slowly working through here in these episodes. So encourage you to be patient. We are obviously learning a lot about micro ML. But as we also saw today, showed you the pluck function combined with map that allows us to get specific elements out of each value in a list. We'll see more things in the next episode as we scale up what we've done here to doing 100 splits and seeing how we can use the map function with and without parallelization to go ahead and run that. And we're slowly again, getting up through a bunch of other really advanced topics that are, I say advanced, but they're really necessary, right? So thinking about how do you know if one model is better than another? How do you evaluate a model? How do you figure out what features are most important a model? Well, that's all coming in future episodes of Code Club. So thanks for making it this far. Encourage you to be patient and stick with us. I've got a lot of great stuff coming along. And we'll see you next time for another episode of Code Club.