 When you have big data sets and you have ambitions for doing big analyses, there's really no way around it. Things are going to take big time. That's right, come here anytime you want your favorite dad jokes and bad puns. Hey folks, I'm Pat Schloss and this is Code Club. In the recent episodes, we've been kind of stepping through thinking about how we can use a new package developed by my lab called MicroPML to apply machine learning algorithms to microbiome data. We've been working with L2 regularized logistic regression, which of the different approaches that are possible is about the fast as they come, right? So as we kind of move along, we might want to add more sophistication to our modeling. We might want to say do random forest. Random forest can take considerably longer than L2 logistic regression. And you know, the hope is that the performance will be quite a bit better because there's perhaps advantages there, right? There's also other things that we would like to do even with our logistic regression like figure out, you know, what features or what taxa in our model are driving the classification. All these things take a lot of time. And so that's why I'm spending some time going through thinking about different ways that we can speed up the analysis and implementation of our model. In the last episode, I showed you how we can use functions from the fur package, which is a play on per, which gets us all those nice map functions. Well, the fur packages, the fur package gives us for map and related functions so that we can parallelize all those map operations. I want to take a step back. And instead of doing, say, 100 splits in one script, I want to have one script do one split. What that will get me is that I can now use my favorite tool in the whole wide world, which is make and canoe make works on the command line with Mac and Linux. You can also get it for windows. It's just a little bit heavier of a whole. And so what make allows you to do is keep track of dependencies. And why I like make is because I think of kind of creating out a pipeline and data analysis as a series of dependencies and make works really nice on my laptop on my computer, but works really well on a high performance computer where I can send off 100 different jobs to all run at the same time and then come back and pull that information together. So I'm going to get there eventually, right to kind of this big vision of a make file that I fire off on my high performance computer and show you how my lab uses slurm on the University of Michigan's high performance computer. But before we do that, I want to show you how I would use make to break this down into simple steps so that we can then kind of pull it all together. And this will also be a framework that we could then use up on a high performance computer. Anyway, let's go ahead over to our studio. So I've already got a make file created that goes ahead and populates my raw data directory with the files that we've been using through these tutorials. So I'm going to go ahead and create another R script that I'm going to be using with my make file. This isn't totally how I would do it in real life. But I want to be able to keep these other standalone files that, you know, work more natively in our studio in the repository. So so we'll do a little bit of things that aren't quite dry. Anyway, hopefully you'll forgive me. So the first thing that I want to do is I want to have code that takes data from those three files, the three Baxter files, the shared file, the tax omni file, and the metadata in the raw data directory, and that pulls that all together. And that then also does the pre processing step for us. So this R script is going to be the stand in for my genus ml dot R script. And if I zoom this or make this a little bit bigger, so I will go ahead and for now copy this over to my untitled. And let me kind of briefly explain what's going on in here. So we source the code genus process dot R script, we load microp ml, we then loaded for I'm going to go ahead and get rid of all the first stuff because this isn't going to be parallelized within the script. It then reads in the data that was outputted from code genus process dot R. So code genus process dot R, take the shared file, the tax omni file and the metadata file, blends that all together. And then this code chunk from lines four through nine is getting it into a format that works well with pre processed data, which goes about removing any columns. And so for this data, those columns are our relative abundance data, those relative abundances for those taxa that have near zero variance, it will also then center and scale the data so that the mean is zero. And that the data the scale data has a standard deviation of one. We then have our test hyper parameters for the L two regularized logistic regression, I'm going to go ahead and remove that 10, because the values that were optimal were like two and three. So we'll leave that for the test hyper parameters, we then run get SRN genus results with the seed. And I'm going to go ahead and then remove all this other stuff. Because that's not so important to me right now. This is the stuff after the fact that kind of takes all the splits and joins it together so that we can do inference on those splits to figure out which hyper parameter what which lambda value was optimal. Okay, so I'll also go ahead and remove this function. So I want this to be an executable script. So I'll go ahead and add my shebang line so be the shebang and then user bin and our script. And so that makes it so that we can run it from the command line, as long as the script is executable. So we'll do a ch mod plus x on code run split dot r. And so if we look at the contents of code, we now see that run split has those x's which means it's executable. But what we need to do is get the parameters from the command line. So we can do the function command args. And then we can do trailing only equals true. And we will call this seed. I think command args is going to come in as a character. However, so I want to make sure this is numeric. So I'll do as dot numeric on that, save it. And this all looks good. The one thing we don't have, however, is any type of output. But let's not get too far ahead of ourselves. I'm going to go ahead and run this from the command line with a seed of one. So I'll do period forward slash code run split dot r with the one. Hopefully this will run and work from the command line will make this bigger so we can see all the gory details of what's going on here. Wonderful. This ran through and gave me the output I was expecting. As I mentioned, I didn't save any of this to disk. So that's something I certainly need to add to my script here. I'm going to save the output of run ML to a object I'll call model. Again, model will be a list as output. And it's got tons of stuff in it. So I'm going to save the model object for every execution of this file as a RDS file, RDS files are a special R format. And so I can do save RDS model. And then the file is going to go into process data. And I then need to give it a clever name. I'll say, let's do L2 genus. And then let's put in here the seed. And I will use the glue. And so I can say glue around this. We've seen glue in previous projects. So we'll say library glue. Make sure that's good. And then let's test this out. Let's say seed equals 23. And then if we do glue on this. Yeah, it inserts that 23. But I need to also add the extension, which is done RDS. And so let's double check that that all works. Good. So now it should save the model object to a file in process data called L2 genus 1 through 100 dot RDS. So we'll go ahead and save that. Let's come back to our terminal. And we'll rerun our run split dot R with that one seed. And we'll look to see if the output is put in process data. So that ran, we can then do LS on process data and see that there's actually something there. Let's go ahead and do an LS dash LTH so we can see how big it is. So it's 8.4 megabytes. This could get 840 megabytes. That's not that big, right? You know, ideally, we might only save the different chunks of the RDS file that we really want. I'm not sure what I want at this point. So I'm going to leave it with the whole object and call it good. Okay. So the next step is that we need to link this into our make file. So the way these make rules work is that there's a target, the file you want to create, a colon, and then the dependencies on the right side of the colon. Below that then are the instructions on how to build that rule. So we will then do process data forward slash L2 genus underscore one dot RDS colon. And then we've got files that we need to be sure are in here, right? So we need this code genus process R. We also need run split dot R. Again, run split is also in code. So those are the two dependencies as are the three Baxter files, right? So we also need to put in here raw data forward slash Baxter dot metadata dot TSV. So we can put these on separate lines by putting a backslash. And the kind of annoying thing about make is that tabs are used instead of spaces. So we'll do Baxter metadata TSV and then raw data forward slash Baxter dot cons dot taxonomy. Again, we'll put another backslash here, raw data forward slash Baxter dot sub sample dot shared. And then we will do the period forward slash code forward slash run split dot R one for that one seed. We'll go ahead and run make on that. It says it's up to date. So again, if we do our M on that and rerun make, let's double check that it works. I think it's going to go through just fine. So yeah, that works great. Now we want to be able to generalize this to give it any seed. So what we'll try to do is let's go ahead and put a percent sign in there which is a pattern wildcard. And we can then use a automatic variable of dollar sign star. And let's go ahead then and let's change this to two and see if this works. So we see the recipe that make actually used was code run split dot R two. So that worked great. Again, what it did is it took the number that it matched with that percent sign and is using that here as the argument to my code run split dot R. And we are in good shape with that. Again, if I do LS LTH on process data, I now have both of those two RDS files. So the next step, of course, is that we'd like to do all 100 seeds. But before I write the rule for that, what I'd like to do instead is go ahead and write the next rule and the next R script to combine the output from the 100 RDS scripts. So currently we have two RDS scripts. So let's start there because that's a lot easier than 100. I'm going to go ahead and create a new R script combine models dot R. And that's in code. I'm going to go ahead and grab my shebang line. And we will also grab micro ML. So that's good. And what we want this to do is to read in all of the RDS files that we give it. And so again, we will do command args with trailing only equals true. And these will be our RDS files. So that will read in a list of RDS files that we're going to have to give it from the command line. And so we will use map, because map is going to read these in and create a list of RDS files. So we'll do RDS files with read RDS. And I think that should work. To test this, let's create a dummy vector that I'll call RDS files. Processed data forward slash l2 genus one dot RDS. And then we will also do this next one, but that's with a two. So now we have we have RDS files. Good. And so then if we read in with map, it's complaining that I can't find the function map. And that's because I need to like load library purr. I suppose I could do tidy verse, but let's see if we can get away with just doing purr. This then is using the map function to read all that. Yeah. And so we now see that we have our two RDS scripts read in and as a list. If we then go back to our genus ML script, towards the end here, you'll recall that I basically now have this iterative run ML results. So let's go ahead and grab that. Again, I'll call this iterative run ML results. Right. And so if I run that, that pulls that in here, I could not find function combined HP performance. Maybe I didn't run me crop ML. Okay. So we'll run that. And so now we see we've got our combined data. And then let's do a pluck to get the dat object out of this list. And so we have that. Good. And then we can probably do write TSV. And I'll put this again in process data. And I'll call this L to genus pooled HP dot TSV. And I think I'll go ahead and put in tidy verse rather than purr because we'll also need that read our package to write out right TSV. So now if we write that out, if I go ahead and look in process data, I see now that I've got my L to genus pooled HP TSV. And that looks good. And let's also get the performance. So instead of trained model, I'll do performance. And then we see that we've got the two performance values. That's good. What happens if we do map DFR with that? Then we get a data frame. That's like that. And we'll pipe that out to write TSV. And we'll use the same file name, except instead of pooled HP, we'll do pooled performance. And again, now we look at our process data directory, we've got that good. So we're making two objects here. So let's go back to our make file. So we've got that as well as our pooled performance TSV. And I'll put that backslash in there. And then this depends on code, combine models.r. I'm going to need to make this executable. So I'll do chmod plus x on that. So we'll also need all of those RDS files. So I'll call that L to RDS. So because the dependencies are in the order that we might run them from the command line, we can go ahead and use the automatic variable, the dollar sign carrot. Again, this stuff with make, if it's new to you, don't worry. I have a previous episode that I will link to describing some of the cool things that you can do with make. This does get a little bit into the weeds I know. But it's tremendously powerful. Alternatively, what you could do instead of that automatic variable would be to put the arguments right like that. But it's a little bit cleaner to put the dollar sign carrot. Although perhaps it's not as easy to read, right? I think I can clean up my targets a little bit by putting another percent sign in here as a pattern wildcard. That way the targets aren't so long. I kind of did the same thing up here, but I'm not going to use the dollar sign star. That's another automatic variable kind of like that dollar sign carrot. Anyway, it cleans it up a little bit. So I need to create this L to RDS object. And so for that, I need to do some more fancy GNU make stuff. So bear with me here. I can create a variable seeds. That will be dollar sign shell seek 1100. And so that should make values going from 1 to 100 by ones can then do pat subs. Percent comma processed data forward slash L to genus underscore percent dot RDS comma dollar sign seeds. And that should create all of the L to RDS files. I don't want to do 100 just yet. So let's do two. Let's give this a shot. So we'll go ahead and do make that performance dot TSV. So that ran through without any problems. I do notice that the code combined models dot our call does give all three of those individual TSV files. So that's great. I can then increase my seeds to 100. So at this point, we can kind of take two different approaches. If I was working up on a high performance computer, where we're using something like PBS or slurm, which is what we use here at the University of Michigan, I can create what's called an array job. And so I can create an array of 100 different targets. And I can use make and then the name of my target. And I can use the syntax to for slurm to indicate, you know, where does that array value go to build out all 100 of the targets, right? Alternatively, what I'm going to do here is I'm going to rebuild this target. But I'm going to add an argument to what we used last time. And that's going to be J 16. And so what minus J means as a flag for make is to use multiple processors. So I'm going to give it all 16 of my processors. And we'll see how long this takes, if it goes a little bit faster than the 25 minutes that we saw the last episode using the future map function that ran through in about 25 minutes, again, same amount of time that we saw with the future map function. Again, there's other things running on the computer in the background, like recording this video to kind of sap some of the resources. If I go ahead and redo the make call, it tells me everything's up to date. And we're in good shape. So I know this is probably going long. But I've got to just like, get us something practical at the end of this. So why don't we go ahead and take this genus ml dot r script, the stuff we had at the end here. And I'm going to paste this into a new R script that all calls process pooled data dot r, and then we'll do library tidyverse. So we have all that goodness loaded for us. And we can, of course, then do read underscore T SV process data. And then we will look at the L two genus pooled. Let's do start with the hyper parameters. And we can then read that in. And again, this is the same as that performance dat function or dat object. So I'm going to go ahead and remove this and pipe the stuff we read in into this plot HP performance. And we see the plot that we had from the future map function, right? Again, things kind of peak at around two or three really wide confidence intervals, again, these confidence intervals are plus or minus a standard deviation in the data. I can get a little bit, you know, more precise feel of the data by taking what we read in and plugging that into this group by summarize pipeline. So again, I see that my three lambdas two, three and four give me effectively the same mean AUC lower quartile and upper quartile. Again, it's performing really well. Final thing I'd like to do is let's go ahead and read in our other TSV that we generated, which was performance dot T SV. And I'll get rid of this final pipe. And what we see now is that we have 100 lines. So we have a table of 100 lines with our CV metric AUC and our AUC. So this is the CV metric is from the training. And the AUC is from the test, right? What I'd like to do with this is make a box and whisker plot for the CV metric. So from the cross validation, as well as the AUC test set, I'll go ahead and do a select on the seed, the CV metric, AUC, and the AUC. And so again, we've got our 100 seeds and our, our metrics. So I'll do a pivot longer. And I'll do calls equals minus seed. And I'll do names to equals metric, put that in quotes, values to equals AUC. Again, now we have it in a long format. And we can then go ahead and do ggplot aes x equals metric y equals AUC. And let's go ahead then and do geom box plot. So what we see in kind of this low frills box plot on the right, we have the cross validation AUC. And on the left, we have the held out the test AUC. And what we're looking for here is for the medians of those two distributions to be really similar to each other. And what we can see is that those are like bang on. Now, if they were out of whack with each other, that would be an indicator that the data were over fit in the cross validation when they were then evaluated with the held out data. So we're in good shape. And again, we get a AUC that I think I said was about 0.63 or so, which is not phenomenal, but is pretty good. And that's a starting point that we can then go forward by adding other types of features, perhaps the metadata, perhaps we can look at other modeling approaches like random forest. But there's a lot we can do with and build from here going forward, as we try to make this noninvasive diagnostic of screen relevant neoplasias based on information from the human microbiome. So there's more to that in the next episode. So be sure that you've hit that thumbs up button that you've subscribed, you've hit the bell icon, you've done all the things to tell the algorithm, I want more code club. But I'd love it if you told more of your friends about what we're doing here. As you can see, I bake a lot more in here than just micro ML that we covered so much today. I wonder what was I thinking in trying to cover so much. But anyway, there's a lot here. Feel free to watch it a few times really digest and you have any questions by all means down below in the comments, please feel free to ask a question. And I'll be sure that somewhere along here I did provide a link to earlier videos where I talked about using make. Anyway, we'll see you next time for another episode of code club.