 As someone that's made their career, in part at least, on developing and supporting a software package, one of the things that I've noticed about users and users being you and, yeah, me too, is that we love tools where you put data in and you get back results in a nice shiny package. But as your analysis gets more and more sophisticated, that's not always so possible. We can't just shove data in and get results back. Sure you can, but the results you're going to get back, eh, they might be pretty sketchy. Hey, folks, I'm Pat Schloss and this is Code Club. We're in the midst of a series of episodes where I'm talking about a new package that my lab has developed called Micropamal. Micropamal is a R package, as I've mentioned, that provides a framework for engaging in reproducible machine learning based analysis of anything you want. Because I study microbiomes, I'm using it with microbiome data, but we've also used it with publication data. Other people have used it with genomics data. You can use it with anything. What's really key about Micropamal is that it gives you a package or a framework, rather, that allows you to do things like cross-validation, to do training and testing, to do hyperparameter fitting, as well as pre-processing of your data. Now, all these things, you know, pre-processing your data, tuning, figuring out how to split your data and whatnot, it's not something that lends itself very well to defaults, right? And so that's where things get messy in engaging with the package. It's not as simple as shoving the data in and getting back a result. So the problem that we're going to tackle in today's episode is how we can customize Micropamal to pre-process our data to get the best possible analysis. So why would we want to pre-process our data? Well, for a couple reasons. But the big reason that I can think of is that machine learning algorithms and training and validating the model is actually pretty slow, right? These are processes that are fast, but they still take hours or possibly a day or more to run. And part of the reason that that takes so long is when you have a lot of features, right? So the model will run much faster if I have five features than it will if I have a thousand features, right? And so the types of things that we would do to pre-process our data might be to remove features. And when I say feature, think of like a genus, the person's sex, the person's age, right? So if the genus has no variation in it, then we'd want to remove that, right? So if everybody in our study was 25 years old, age is not going to be a relevant feature in our model. So instead of forcing the model to figure that out on its own, we can pre-process the data and remove that age column because there's no variation in the data. How else might we want to pre-process our data? Well, say we have missing data, right? So imagine we are collecting the weights of all of the subjects in the study, but there's maybe two or three subjects in the study that we don't have weights for. So do we remove that entire, that person's entire record from the study? Or do we, you know, impute some variable in there instead of, you know, throwing out their whole, their whole file, so to speak, right? Alternatively, we might also have categorical variables where we have three or four different levels, right? So in this study, we have four different sites, research centers, where patients were recruited. And perhaps we want to use that research site, that site, as a variable. And so we need to pre-process MD Anderson, Toronto, U Michigan, and I forget the fourth place, but we'd want to pre-process those names into a variable that micro-ML and the code under the hood, so to speak, knows what to do with it. So we're going to talk about how we can do all these pre-processing steps in micro-ML with the data set that we have, and to demonstrate some of the other bells and whistles, we might make up a little bit of data to kind of show what's possible. Now, if you want to do a deeper dive on what I'm talking about, I strongly encourage you go to the micro-ML documentation page. I've got the URL down below. It's at schlosslab.org forward slash micro-ML. In there, there's a vignette where you can go and look at a whole page on all the different ways that you can pre-process your data. I'm going to try to stay close to the types of pre-processing steps I would use for the data we have. Again, depending on the study, you might need to do other pre-processing steps. So let's go ahead over to our studio now and we'll get going with today's code. So I've got my genus ML.R script here. I'm going to go ahead and run these first nine lines of code, make sure everything works, and let's look at SRN genus data. And here we get a lot of output, right? We've got the SRN status, whether the person is healthy or has an SRN. SRN is a shorthand for screen-relevant neoplasia. These are people with advanced adenomas and carcinomas. Again, if the mindset here is that we're developing a non-invasive diagnostic, that these would be people that if they come back positive for SRN, you would then want them to go forward with a colonoscopy. And then we have 400 or 280 other columns for each genus that's represented in the data set, and there are 490 subjects in the study. So to demonstrate some of the features of micro-ML, I'm going to go ahead and add some extra columns to the data set. What we have here only contains the SRN status and the relative abundance of these 280 or so different genera. So again, if we look at composite, so we see the composite data frame has 21 different columns, so a group is the subjects ID. We have a column for the taxonomy and relative abundance. We also have a bunch of other metadata. The output, as we saw, SRN genus data really only contains information on these first three columns, group, taxonomy, relabund, as well as the SRN status. Actually, it doesn't have the group column. We left that off because we are predicting the outcome of being the SRN column. To give us a data frame that we can use to play around with the different options available in the pre-process function, I'm going to go ahead and take the composite data frame and let's select some columns of different types that we can play with. I'll do select, group, and then maybe fit result, which is a continuous variable. Let's do site, which is a categorical variable that has the four different research centers where subjects were recruited at. Let's also go ahead and put in gender. Let's put in SRN. That's going to be a categorical variable as well. And I'll also go ahead and put in weight. Now, composite is a long data frame that has many rows per subject because, again, we had each genus on a separate row of the data frame. So I'll go ahead and do distinct, and this should then give us 490 rows, which it does, and our six different columns. Something else I recall about this data set is that there are some weights in here that are zeros. Let me go ahead and do a summary on the output of this. I see, sure enough, I've got a weight in here. So I'm going to, again, just to give us something more to play with, I'll do mutate, weight equals na if on weight, if it's zero, then I want that to be an na value. And then let's pipe that and we now see that we've got two na's and no zero for the minimum weight. So I'll go ahead and remove that summary into a select minus group, and that will get rid of that group column. I'm not going to use the group in a model. And it doesn't make sense to include it as a variable for pre processing. So I don't want to include something that I wouldn't ultimately ever include in a model anyway. So now if we look at this, we see we've got fit result, site, gender, SRN, and weight, and we have 490 rows and five columns. We're in good shape. I'll go ahead and call this data frame practice. And we are good to go to play with pre process. So I'll go ahead and do pre process data on practice. And then my outcome call name is SRN. So I don't want it to mess with that SRN column, because that's what we're going to be using to classify, right? That's what we're trying to classify is the different rows, the different samples by that SRN column. So if I run this, I get some output that just flew by the screen here. And so it tells me that it's using SRN as the output column name. There are two missing continuous variables that were imputed using the median of the features. It doesn't tell me what that was that that's probably coming from carrot, but I know that that is coming from the weight column. Because again, when we ran that summary function, we saw we had two NA values there. And so what we see in this dat transformed data frame is that again, we have our SRN column, we also now have our fit result scaled, probably between different values, we have weight also scaled, right? And then we have gender M. So if the subject had M in the gender column, they get a one, otherwise they get a zero. And so in this pre process data function, it's converting those character strings into numerical values. And actually, it's converting them into binary outcomes. So we can see here for site Dana Farber and site MD Anderson, is it because there are four different sites, it's going to make four different columns for the four different sites. And so what we get then for values are zeros and ones. So if it's a one in the site Dana Farber, that is a sample from Dana Farber, right? And it does the same thing for MD Anderson, Toronto, and U Michigan. It also removed site U of Michigan. And I'm pretty sure that that's because that doesn't contain any useful or novel information. So what we could do is go back to our code. I'll call this pre process. Let's go ahead and take pre processed. dollar sign dat transformed. And I'm going to send this to summary. And we'll see what the output looks like here, right? Good. So this is useful output to tell us a little bit more about what's going on for the continuous variables like fit result and weight. It does a scaling and centering. So centering means that it makes the mean for all of those columns zero. And it scales it so that one is one standard deviation above the mean. And minus one is one standard deviation below the mean, right? And so now we have kind of a much more compact range of variables than we did with the raw data. And that's ultimately why we do this scaling and centering of the data. Otherwise, fit result might go from like zero to a couple thousand and weight might go from say like 40 to 140, right? And you might even have like relative abundance data that goes from like, you know, point zero zero one to point zero five three, right? And so by putting everything on the same scaling or the same the same basis, we effectively will treat all of the features more equally. If you want to change how that scaling is done in preprocessing the continuous data, you can definitely do that with micropamal. You will give it parameters for carrot for its scaling packages. And again, I would refer you back to the the micropamal documentation. The scaling and centering is probably the best option for 95 99% of everything you're going to want to do. And so if you want to deviate from that, go check out the documentation to see how you might go ahead and customize that a little bit further. The other thing we see for that weight column where we had two na values is that although it doesn't tell us which column had the missing values, we knew from this summary output from earlier that weight had two na values. And so what it does is that instead of again, putting in the na values, or instead of removing that entire row from the study, it replaces the two those two na values with the median value. Okay. And so that's its way of imputing missing data. Again, if you want to use different approaches to impute missing data, I'd refer you for you to the documentation. But for the most part using the median is where it's at. So two other aspects of preprocessing data are going to be important to us. Let's look at the first one, which is to remove data with near zero variance. And so near zero variance means that there's no variation in the data or nearly no variation in the data. To demonstrate this, let me go ahead and add a column to my practice data frame. So I'll do mutate and near zero variance. And we will then let's call that 23, give that 23 as a value. And so now if we look at practice, we see that it's got 23 in all of the columns. And again, if we now run this preprocess to preprocess the data, and let's look at preprocessed, we then see that there's two columns that it would remove. So site you have Michigan, as well as my column near zero variance, because again, there's no variance in the data, right? If I were to look at the SD on practice, dollar sign near zero variance comes back as zero, right? So it's going to remove that column, because again, it doesn't contain any information that's going to improve the classification. If everything has the same value, then it does not going to add anything, right? So it would never end up in a model because there's no signal there, right? It's just all flat, right? So that's a good thing to be removing. And again, that is baked into micro-PML and its defaults. So the final thing that I want to touch on for preprocessing our data is removing columns that are perfectly correlated with each other. Again, if there are two columns that are perfectly correlated with each other, the second column doesn't add any extra information and can kind of actually obscure the signal, right? So if it's kind of, if the algorithm is randomly picking columns, in one case it picks column A, in other cases it picks column B, but A and B are perfectly correlated with each other, then we're kind of like dividing the signal of the importance of A or B. So we really only want to work with one of those two perfectly correlated columns. So to demonstrate this, let's do perfect core and I'll say fit result. And so now we have practice. And if I do core, practice, dollar sign, fit result. And then practice, dollar sign, perfect core. I get a core of one, right? And so now if I run this through preprocessed, and I look at what columns are being removed by preprocessed. So now we see something different in the output of running preprocessed data that we now get an element called grouped features, GRP feats. And so we have weight, gender, the three different sites or four different sites, as well as the group feats, group one. And what it's doing there is grouping together perfect core and fit result. And so that way, whenever the algorithms, when they're kind of randomly picking features, they pick fit results and perfect core together, right? And so that effectively is merging those perfectly correlated features with each other. Again, so we don't kind of dilute out the signal that either of those two columns have. The only thing I want to show you in the output here of the transform data is that we no longer have a column for fit result or for, or for perfect core, right? We now have a column called group one, which is that data in group one. Again, there's a variety of different tools within preprocessed data that you can use. All those tools come to us from the carrot package. And so I'd certainly encourage you to go check out the micro ML documentation, if you want to learn more about how you can use those different functionalities from the carrot package. Also, realize that you can take your own data frame, preprocess it however you want without using carrot without using micro ML, and then use that as input to the run ML function. So let's come back to our script. I'm going to go ahead and get rid of this practice data frame. And my preprocessed I'm going to rename to be maybe SRN genus preprocess. And let's go ahead then and give that SRN genus data. Again, my output column will be SRN. And that should all be good. Let me just double check that SRN genus data is what I want, where it's going to be the SRN and then the relative abundance of all my general. So that looks good. I can then go ahead and look at SRN genus preprocess. It uses SRN as the output column. Good. So we see we've got 180 features here. Let's scroll up to the top to see what these actually are. So these are the removed features. These would be things that have near zero variance. Nothing was grouped together as being perfectly correlated with each other. So that's cool. In my experience as you kind of drill down to finer and finer taxonomic levels, you tend to see more things being grouped together because they're perfectly correlated with each other. But again, these are going to be things that are removed because again, they're perhaps only present in one sample, or yeah, they or perhaps they are present, but they have really low variation across the samples. So that is cool because that then simplifies the data frame quite a bit. And so now we have 490 rows and 101 columns, whereas before I believe we had 490 rows and 281 columns. So we made our data frame a lot more compact by removing those columns that just don't contain any useful information. So returning to my code here, let me go ahead and minimize this for now. I'm going to go ahead and tack this dot dat trats transformed can't talk to the end of my call here. And I'll go ahead and get rid of this summary statement. And basically what I'm doing is I'm taking that that transformed data frame and assigning that to SRN genus preprocess. I can take this SRN genus preprocess and use that in place of SRN genus data. I'm going to run this with the defaults running it with the defaults with SRN genus data yesterday or in the previous episode or whenever that was, took about a minute and a half to two minutes. So let's see how fast this runs should be a lot faster. So that took about a minute and a half to run. I'm going to go ahead and rerun it, creating a different variable that I'll call SRN genus results, no pp for preprocess. And I'll go ahead and use SRN genus data. Let's see how fast it runs. And let's see if there's any difference in the output. So let's talk about two minutes to run. Let's go ahead and look at the performance element of the output to see if there's any obvious differences here. So we'll do SRN genus results, dollar sign performance. So we get a testing AUC of point six eight and the cross validated of point six one four. Again, we could do with the no pp. And we see, you know, very slight difference in the performance. Again, this was just 180 20 split that we saw from the pipeline I showed in the last episode. If we ran this 100 times, we would want to see, you know, does the preprocessing meaningfully change the performance to make it better to make it worse. We saw here that we can center and scale. Perhaps we would just want to scale or perhaps we just want to range things to kind of give kind of a linear range between zero and one or negative one or one, right? So there's a lot we lot more we can do. But before we can dig into that and see that differences in how we preprocess actually make a difference, we need to learn more about fitting our hyper parameters. We need to learn more about running more of those splits. And we also need to learn more about how we can test to see that our models are better. But hopefully, this is a good introduction to showing you how we can change the preprocessing of our pipeline, in this case, to take our continuous variables to center them to all have a mean of zero, and to have a range from minus one to one standard deviation, also converting any categorical data that we have to zeros and ones for true or false or whatever the value is of that column. We'll learn more as we go along. So there's a lot to do here. I know it's perhaps frustrating that it's not as easy as kind of jump dump data in and get data out. It is kind of, you know, garbage in garbage out, right? So if we're using the wrong parameters, and we're not kind of being methodical, and how we set our parameters, we're going to get garbage out, right? And so that's why I'm taking my time going through these episodes showing you how we can tune the different parameters to get the best possible model to get the most robust model we can for the data. Anyway, be sure that you've hit the thumbs up button so that you know when the next episode comes. Also, please be sure to subscribe and click that bell so you receive all the beautiful notifications. I'm trying to put these out every Monday, Wednesday, Friday. We'll see how well I do with that. But I'd love to have you join me next time and even better would be if you brought a friend. So please be sure to tell everyone in your research group or anyone you collaborate with about what we're doing here. Stepping through Micro-PML, I think you'll really find a lot of benefit to seeing how I work with my data using Micro-PML so that you can also take it and apply it to your own data. All right, so keep practicing with this and we'll see you next time for another episode of Code Club.