 If you've been following along in my recent episodes of Code Club, you know that I've been building up to a reveal of a new R package my lab's developed called MicropML. MicropML allows you to easily implement a variety of machine learning algorithms to identify features that allow you to make classifications, as well as to predict continuous variables using regression. We're going to get going learning about Micropel right now. Hey folks, I'm Patch Sloss and this is Code Club. One of our motivations for MicropML was that in my lab we have started to use machine learning algorithms a lot as a way to identify markers within a microbial community that we can associate with health and disease. One of the problems is that in every paper we seem to be using different pipelines. And so with an eye towards reproducibility, I told the former postdoc in my lab, Big Oom, hey, Big Oom, it'd be awesome if we had a single pipeline that we could release and that we could use ourselves to improve the reproducibility of our machine learning practices. That's what MicropML is and over the coming 10 or so episodes, I'm going to tell you all about MicropML and we're going to use it to analyze the data that I've been showing you in the past few episodes from a project that a former graduate student in my lab named Neil Baxter originally developed. Anyway, let me take a step back and tell you why I think machine learning is where it's at. As I showed in the last two episodes, really, we can identify individual features such as microbial populations within a community that are significantly different between healthy people and people with disease. In my application for this current series of videos, it's looking for screen-relevant neoplasias or special type of advanced lesions within people's colons that typically is indicative of colon cancer. We can identify these populations that are associated with these SRN, screen-relevant neoplasias, but at the same time, when we then use those individual populations to say, can we classify individuals as having an SRN or not, the answer is no. They actually do quite poorly on their own. One reason for that is that the distribution of microbial taxa across people is very patchy. As we've seen, the number of people that might have any one of the populations associated with SRNs or associated with health is very low. Generally, it's less than 50% of people have that biomarker of interest. That causes problems for our statistical tools and then also for thinking about classification. It's also a problem because it plays into a very reductionist mindset that is growing within the microbiome field, that there is one bug or maybe two that is associated with disease. The reality is that we're most likely looking at dozens of organisms that are associated with disease. For heaven's sake, we're looking at a community of bacteria. And so to look at an individual and say, this is the individual that's driving disease just doesn't sit right with me. The benefit of machine learning is that we can then look at collections of organisms, collection of features, biomarkers, whatever you want to call them that you can then use to make predictions or make classifications of different types of samples based on what's in there. The example that I like to use in my talks is of an email spam filter that in the back end, my Gmail account has some type of algorithm running that allows it to classify individual emails as being legit, which goes to my inbox or a spam, which then goes to my spam folder and I never see it again. And so if it sees something like Nigeria, million dollars, bank account, it says, that's probably spam. But if it sees million dollars, well, that could be legit because that could be an email from NIH announcing some great new award that I might want to apply for. And it might even have the word Nigeria in there because it's for collaborations with Nigeria. But there's going to be a lot of features that Gmail has to use to figure out what's spam and what's legit. Now, that's very much like what I'm trying to do with the microbiome, that I'm trying to build a cancer filter, if you will, I'm trying to build a filter that looks at different components of the microbiome to determine whether a sample comes from somebody with colon cancer, or from somebody that is healthy. And so I think there's a couple of benefits to this. So first of all, again, it takes the community kind of an ensemble based approach, and it allows for that very patchy distribution of the community. The other thing that it allows us to do is to get somewhere practical with the microbiome, right? I mean, basic science is great. And I do a lot of basic science myself. But at the end of the day, we need some type of deliverable from the microbiome field in the place that I think we are going to make the first advances in the human microbiome, getting something deliverable to improve human health is with a diagnostic, right? If I could take a stool sample from somebody, run that through my sequencer and say, aha, this person is at a high risk of having colon cancer, we need to get them to have a colonoscopy, that would be a win, right? We already have some really good noninvasive diagnostics like the fecal immunochemical test fit, which I talked about in the last episode. And then there's others that have kind of built upon fit. Well, what if we were to add the microbiome onto fit? Could we make an even better predictor or diagnostic of colon cancer? Of course, getting a colonoscopy is what you should be doing if you're over the age of 50. But only 30% of people adhere to the recommendations for screening. And part of that is because colonoscopies, they suck, right? They're pretty awkward. They're expensive. It's just not fun, right? So if we can make a better noninvasive diagnostic, that would be a win. So those are the reasons that my lab is really investing heavily in developing and applying machine learning pipelines and algorithms to microbiome data. So again, that we can study our communities as communities and as collections of organisms rather than individual organisms. And so that we can perhaps get somewhere practical in coming up with a deliverable from the human microbiome. Now, a lot of what I'm going to be talking about over this episode and coming episodes is based on work that my lab did in this paper, published in Ambio. So we published this data because we're really interested in trying to have a conversation with the field about using machine learning approaches to study the microbiome. People are doing all sorts of things that are probably really just not appropriate. So things like neural nets that really require thousands of observations to build an intelligent model. Nobody has that many samples. So what we really wanted to do is to communicate and have a discussion, start a discussion with the field about how we should be using machine learning with microbiome data. Now, we then went on and took the the code that we used to write this paper and put that into a package called microbe ML. Although our genesis was with microbiome data, that doesn't mean that it can only be used with microbiome data. People have used it with genomics data. We've used it with publishing data with all sorts of different types of data. So what we think is unique about microbe ML is that it implements this pipeline. And so I want to spend a little bit of time going through this with you, but know that we're going to come back to this periodically as we go through the coming episodes. In general, we have this bigger, big outer box that you can think of as representing all of our data. So in the current data set, we have 490 samples. Let's just say we have 500 samples, okay, for just making round numbers. So what happens is that we initially split the data into a training and validation set that's 80% of the data, so 400 samples, and then we hold out 20% of the data, about 100 samples as our testing data. And so we're actually going to repeat that 100 times, but we'll hold on to that, right? So we have this 80-20 split. With this 80% that we're holding out for training and validating, we then do another 80-20 split. And so this is called cross-validations. We do a five-fold cross-validation on that initial 80% split. And we take that second 80%, the 80% of the 80%, so it's about 320 samples, I think we're at. And then we train this using a bunch of different parameters, different characteristics or weightings or elements of the algorithms to build the best possible model. And then with all those different trained models, we then attempt to validate it on that other 20%. So at this level, it's about 80 samples, right? And so we then repeat this 100 times, right? So within that initial 80%, we do that five-fold cross-validation. And we'll repeat that 100 times. And then for each parameter setting, or what we call hyperparameters, we then get an average cross-validated area under the curve as a metric of our performance. And so we then come up with that mean, that mean AUC and the best possible hyperparameters. And we have then trained the full 80% with those parameters. And then we take that 20% and we then run it through the model, each iteration through this big outer box, we get one test AUC. Now, we repeat this 100 times because we could get lucky on that initial 80, 20 split. And so by doing it 100 times or 1000 times or whatever you want to do, you can then get a mean test AUC, as well as a standard deviation or a confidence interval or whatever you want to do. So this is what MicroPML really excels at. So each iteration of MicroPML does one iteration of this initial 80, 20 split. It then goes in and does the five-fold cross-validation 100 times or however many times you want to do it. And then reports the mean training AUC, as well as then the mean or the actual the observed test AUC. And again, then what we'll see as we go over many episodes is how we can then repeat this 100 times to get our 100 test AUCs to model and look at performance. So I know that's a lot to take in. Don't worry, we will see this flow chart many times over the coming episodes. This is really a framework, right? This is a pipeline. There is no model baked into this, right? And so one of the benefits of MicroPML is that it allows you to use many different machine learning algorithms, whether that's logistic regression, SVMs, random forest, XGBoost, decision trees, whatever, that you can then plug into this framework to get the best possible model. In fact, one of the arguments that we made in this paper was that authors really should be evaluating a range of different models to figure out which model gives the best performance. And so that's again what we're going to be doing, not just today, but over coming episodes, is to build machine learning models with this framework and to find the model that does the best job of modeling the data. Why would you do that? Well, because as we'll see, there are trade offs in the interpretability of the different models, something like logistic regression is far easier to interpret than something like random forest. And so if we can get by with a simpler model, we should. At the same time, if we need a more complicated model, then we should definitely do that as well, even though it might lack some of the interpretability of the model, right? Interpretability is kind of like, you know, when they ask people at Facebook, you know, why do I see what I see on my feed? And they say, well, we don't know. That's interpretability. They can't interpret their own model. They just know it works, right? They know that people keep coming back and so it works. And so interpretability is a tradeoff. It's not everything. But as again, basic scientists, we want to know what are the features or what are the characteristics of the community that's driving the classification? Because I would like to think that, you know, if those are associated with a good classification, that maybe we should follow up on those and better understand their biology and any role that they might have in driving someone's, you know, normal colon towards something that has something like colon cancer on it. So my goal for today is to introduce us to micropml, maybe run a few commands and kind of get a sense of the inputs and outputs, as well as get our project set up so that we are ready to use micropml. The documentation for micropml is off of my lab's website. So Schlosslab.org forward slash micropml, micropml is on CRAN. So you can install it directly from R. You can also get the bleeding edge version of micropml by downloading it from GitHub. I'm going to be using the version that is in CRAN because that's the easiest for everyone to get. If you are interested in reading more about the package itself, it has been published in the Journal of Open Source Software, which you can see here. We even have a nice hex sticker, which makes it like a legit package, right? Anyway, it's a very short paper that if you use micropml, we kindly ask that you cite this paper. So all the great people that helped us with working on it get the credit that they deserve. As always, I'm here in our studio and I need to get micropml installed. So I'll go to packages and then install micropml. So what does micropml mean? So Begüm is Turkish. And so microp is microbe in Turkish. So go ahead and install that. There might be some other dependencies that we need to install along the way, but should be pretty straightforward to install if you've been kind of following along in recent episodes. I'm going to go ahead and create a new R script that I will save into my code directory as genusml.r. And so we are good there. I'm going to come back to my genus analysis.r script. And I'm actually going to create a new R script. I'm going to go ahead and take some of the code that I generated for genus analysis.r and put that into another R script to help keep my code dry. And so you'll recall that I have these libraries I read in the shared file for mother, my taxonomy data, my metadata, and I generate this composite data frame. So I'm going to go ahead and cut that and plop that into this new untitled R script. And I will call this genus process.r. And so now what I should be able to do is a source on code genus process.r. And this genus analysis.r script should all work. So this is where we generated the strip chart as well as the rock curves in the two previous episodes. So I'll go ahead and source that. So that all runs smoothly. So I'm going to go ahead and rename this genus analysis.r script to be genus by genus analysis.r to indicate that this is really comparing each genus separate from the others. And then I've got my genus ml R script separately. And I will go ahead and put this source code genus process.r into my genus ml.r. And so again, running that loads and gets me this composite data frame, I then get my data frame that has the group or the sample ID from each participant in the study, the taxonomy, the relative abundance, the fit result, which is the amount of blood in the stool, and a bunch of other metadata on all of the subjects. So looking at my git tab up here in the upper right, I see that it's got a D for code genus analysis.r because I renamed it. So it looks like it deleted it. But then I replaced it with his code genus by genus analysis.r. So if I click staged for both of those files, it then shows that it's renamed it. I'll also then go ahead and click on that staged for genus process.r. And I'll go ahead and commit this to say reorganize code to make dry, right? Because otherwise, I'd be taking that same code that was in genus process.r and copying and pasting it over and over again. Again, we have this composite data frame, and we're now ready to feed this into microp ml. But I need to do library, microp ml. And to get composite to work with microp ml, I need to have each of my features as a separate column in the data frame. So when I say features, that's anything that goes into the model. So I might decide to put in the fit result, I might put in the person's sex, I might put in their age, but I'm also going to put in their like genus relative abundance data, right? So for now, I'm only going to look at the microbial features of that of the community. So again, we'll do composite. And I'm going to pipe that into a select where I can get the group taxonomy relabund. And I also need then SRN, because I'm going to be predicting whether the person has a screen relevant neoplasia. This of course gets me now my data frame with the group taxonomy relabund and SRN. And I'm now going to make it pivot wider, right? So we'll then do pivot wider. And the columns that I'm going to pivot wider with are going to be the taxonomy and relabund. So I'll do names from equals taxonomy, and then values from will be relabund. And so this now gives us a fairly wide data frame, where we've got 490 rows for 490 subjects and 282 column. I also now want to get rid of the group column, because I don't want that to be a feature in my model. And so I will then do select minus group. And again, we now see that we've lost that one thing that is kind of particular about the machine learning tools is that I can't have true and false as levels for the predicting variable for the output variable, right? So I need to rename the values in SRN from true and false to something else. I'll go ahead then and do mutate SRN equals if else. And because SRN is already logical, if SRN is true, then I'm going to put in quotes SRN. Otherwise, if it's false, I'm going to say healthy. And this again, then gets me healthy and SRN in that SRN column. And SRN is already the first column, but I'm going to go ahead and for my own peace of mind, I'm going to use the select function to determine the order of the columns. So something you may not have noticed is that the order of the columns in that select function determines the order that they are outputted. So I'll do SRN and everything as a function. And so the every function, everything function gets everything else, right? And so again, this doesn't change the output. But to kind of give you a sense of what might have changed, if I had have done on this second select here on line five SRN comma group taxonomy relevant, then what you'd see is that the first column is my SRN, right? Whereas if I moved this back to the end, then my last column would be SRN, right? I'm going to go ahead and call this SRN genus data. Now I'm ready to run the run ML function from microbe ML. So I'll do run ML. And then I'll give that SRN genus data. And I will say method equals GLM net. So GLM net will do a logistic regression. And we will we'll see other algorithms that we can use later in future episodes. Logistic regression is fairly quick to run and generally does a fairly decent job. So we're going to do a bit of work with logistic regression to help kind of learn some of the nuances of how micro ML works. And then I'll do outcome call name equals and then in quotes SRN. And that tells run ML what it's trying to predict. So it's trying to predict the values in the SRN column by default run ML will use the first column of the data frame. Again, as you've seen in past episodes, I like to be crystal clear in my code about what's going on. So I think specifying outcome call name equals SRN is the way to go. And then I'll also set a seed. Again, this is using random number generator when it makes those splits in the crossfold validation. So by using the same seed, I'll get the same output every time. So I'll do 1976, 0 620 at a birthday a couple of weeks ago. Some of you remembered. Thank you for wishing me a happy birthday. And what we'll go ahead and run this then. And again, what this is doing is that first iteration of the 80 20 split with that 80%, it then does another 80 20 split where it builds a model develops kind of an intuition about the parameters, the hyper parameters for the model, and then trains it on that 20%. That was held out that five fold cross validation will repeat that 100 times. And then it'll take the best model and apply it to that 20%. That was held out. This is going to take a few moments to run. But I'll show you what we get when we come back. Very good. So it took about two minutes to run. I'll go ahead and maximize this output window. And so one of the first things to notice is that I got a couple of warning messages. Again, the warning isn't the end of the world. But it tells you some information about how the algorithm or how the function ran, right? So it says that this warning usually means that the model didn't converge in some of the cross validation folds, because it's predicting something close to a constant. And so this generally suggests that the parameters that we use the hyper parameters weren't chosen very well. So again, we'll come back and we'll deal with that in another episode. But I want to kind of scroll way back up here to the top of the output. And so, yeah, so here's what I entered. And it again gives us some information. It outputs information about the trained model that it used GLM net, the number of samples, the number of predictors, the two classes it was trying to predict that it used five fold cross validation. Remember, the cross validation happens within that second 80 20 split. And so what it's doing with that 80% is that it's building models using different hyper parameters. So in this case, lambda, which is the regularization parameter that's used between one to 10 to the minus four and then 10. And that we then get a variety of AUCs. And so what it's trying to do is maximize the AUC according to lambda. And so what we try to do is pick hyper parameters that will bracket the maximum of the AUC. And so what you can see is that the AUC keeps going up. So what we would like to do is perhaps modify our lambda to a larger value, so that our lambdas allow us to see a peak for the AUC. Again, we'll talk about this in another episode, but this is showing us the output of those cross validations. And that also tells us that the tuning parameter alpha was held constant at a value of zero. So the alpha of zero is for L to regularization or ridge regression, we could have also used alpha equals one for lasso regression. And if you use a value in between zero and one, then that's elastic net regression. Anyway, again, we'll come back and we'll talk about that another time. This then shows us the test data that were inputted into run ML. And then we get some performance metrics, right, we get the cross validated AUC of 0.636. The test on that 20% held out was 0.714. So that's good. But again, that's for only 180 20 split. Ideally, we would run this 100 times, so we can get a distribution of both the CV AUC and the test AUC, then also, there's the ability to do feature importance. And as it says, we skipped feature importance, adding the feature importance, which again, helps with that interpretability of the model makes the model run a lot slower. So I'd like to figure out all the other things upstream. Things like figuring out the right hyper parameters and can we maybe pre process the data bit to make the model perform better and faster before we do something really time consuming like feature importance something I should have done was assign the output of run ML to a variable name. So I'll do SRN genus results. And that way, then we would have SRN genus results as a variable that we could work with. And I'll show you how we might do that once this reruns. Again, I can look at SRN genus results, and get all that output, or I could do SRN genus results, dollar sign. And then I have a variety of different things that I can get access to. So if I do performance dollar sign performance, I then get out that table for performance. If I do trained model, I then get all that information out. And I should also add that if we wrap names, the name function around SRN results trained model, we'll get a variety of different information generated from a package called caret caret's actually what micro ML is running under the hood. And so there's all sorts of other information that we could extract from that trained model part of SRN genus results. I want to show you a few ways that we can customize how run ML works. So we have k fold, by default is five. So that's the five fold cross validation, we can do CV times, the default is 100. And then we can do training frack equals 0.2. So that's the 20% 80% 20% that we had at that initial separation of the data, right? And so these are the defaults. You know, we could change it to be say two fold cross validation. And maybe we only want to do 10 cross validations. And perhaps we want to do 50% of the data goes into the training fraction, right? So running this should run a lot faster because we're doing fewer iterations of everything, right? So that was like, bam, while I was still talking, it finished. And if we look at SRN genus results, again, we see all this great stuff. I think our performance on our areas under the curve are fairly comparable to what we had before, but perhaps not as robust as using the defaults of 500 and point two. So I'll go ahead and put that back to what we had. Okay, I'll go ahead and save this and then commit my code genus ML and commit it. And I will say roughed in logistic regression code with microbe ML and commit that and go ahead and push both of those commits up to GitHub, and we'll be good to go as a reminder of what we've done so far with microbe ML. Again, microbe ML allows us to do this full framework for multitude of different machine learning methods. So we can do things like logistic regression, we can do decision trees, we can do SVMs, we can do random forest, and we can do XGBoost random forests. But whatever method you use doesn't matter so much as this framework. And so what we just saw actually was that we can change the amount that's held out in the test. So, you know, I did 5050, but the default is 80% 20%. And then with that 80%, what's happening within run ML is that it then splits the data further into an 80% for training and 20% for validating. That's the five fold cross validation. So again, if you made it two fold cross validation, then it would be a 5050 train validate data set. And there then we are training the model with different hyperparameters. And so we saw how we could change that regularization parameter, the lambda, or I guess we didn't show you how you could but just that it does. So all of the models come pre installed or pre packaged with a range of hyperparameter values that microbe ML will try to evaluate for you. Later on, we'll see how we can expand that range. Because as we saw with ours, the lambdas didn't allow us to see the maximum area under the curve value. And so we'll have to kind of modify that range a bit. And so again, that is what's happening in run ML is kind of what's going on within this outer gray box. And as we'll see later, we'll want to iterate that say 100 times to get a 100 test AUC values. So we can look at the distribution there, as well as the distribution for that mean cross validated AUC. And we can then compare those values and ultimately compare models and ultimately then like change the world and save lives, right? Isn't that what we all want to do? Anyway, I hope you find this really interesting. And I strongly strongly encourage you to be sure you thumbs up this video as well as subscribe to the channel and click the bell. That is the best way to make sure all your bases are covered. So you know when the next video drops, again, we are going to be working with micro ML for maybe the next eight to 10 episodes. I'm going to show you all the ins and outs of how we would use micro ML in my lab to build a better, more robust model of classifying people's as either being healthy or having an SRN. My hope is that after you've seen how I do it, you can then take what I'm doing and apply it to your own data, so that you can use it in your papers and you can cite my paper and then we're just all happy, right? Anyway, again, it doesn't have to be microbiome data. It can be a multitude of different types of data that you would like to use with machine learning. And really, the glory of the great thing about micro ML is that it provides us this framework for fitting hyper parameters for training and testing and validating our models. And what you put into it isn't so important as that you you do it as you use this framework. Anyway, keep practicing with this and I'll see you next time for another episode of Code Club.