 You're going to want to watch all the way through this episode because we got tic-tac, we got the future, we got parallelization, we've got machine learning. What more could you want? Hey, folks, I'm Patchloss. Don't worry, folks. We're not really going to make a tic-tac video, although wouldn't that be really cool? I think that would totally embarrass my kids. That's just the reason to do it, isn't it? Anyway, we're going to use the tic-tac package to help us figure out how long it takes to do various iterations of a machine learning step using code from the Micro-ML package. If you've been watching recent episodes, you know that I've been kind of slowly going through using the Micro-ML package to fit a L2 regularized logistic regression model to some data that my lab has previously published looking at the variation in gut microbiota and how that relates to whether or not somebody has colon cancer. So we are looking at genus-level relative abundances from about 490 subjects who vary in whether or not they have a screen-relevant neoplasia, an SRN. So people with SRNs have advanced adenomas and carcinomas, and so we'd like to think about creating a non-invasive diagnostic, right? And so if I could look at someone's stool sample, look at the types of bacteria there and say, ah, you might have an SRN, you need to go get a colonoscopy, right? So that could be a pretty sweet non-invasive diagnostic that would be driven by the microbiome. But to do that, as we've been talking about in recent episodes, we need to kind of think long and hard about how we pre-process our data, how we tune the parameters or hyperparameters, as they're called, that go into our models. What we've been doing in recent episodes is one split of our data. And so the way that our flow diagram kind of through this pipeline works with micro-PML is that we do an initial split of the data. So we have 80% of the data that goes for training, and 20% is held out for testing. That 80% then is used to fit hyperparameters and to evaluate the model. And then it is then compared, or that model then is used with that 20% held out data set, right? But we've only done one split, or in the last episode we did three splits. Ideally, we'd like to do 100 splits. In today's episode, we are going to scale up our one or three splits to 100 splits. Now, why is this a deal? Well, it's a problem or it's an issue because these things take time. And so we've been doing the L2 logistic regression because it's the fastest of the algorithms. And so if we then scale up to something like random forest, it's going to take days, right? And so I don't have that time. And so we want to think about how can we parallelize the execution of those 100 splits to make it run faster. And in the end, we will then see what else can we do with 100 splits that we couldn't do with one or three splits. I have my genus ml.r script here. Again, up across the top here, I'll put a link to a video that shows where you can get this code, and you can get set up with your project to follow along also down below in the description is a link to a blog post where you will see a link to the project as it stands right now. And what it looks like at the end of the episode in the future. Yeah. Anyway, so what's going on here, just as a refresher is we read in the data, we process it, we clean it up, we pre process the data to remove any features that have a variance near zero across all subjects. And we then also remove any features that are perfectly correlated with each other. So our data set doesn't have anything that's perfectly correlated between features. And then we also then center and scale our data. So our data are all relative abundances. So they're centered so that the mean is zero and the standard deviation is one. So that puts kind of all features on the same basis. Again, the data we're looking at our relative abundances. In the future, we'll think about adding categorical data and other continuous variables. But for now, we're happy to just look at the relative abundance data. We then set a range of hyper parameters. So we're using alpha of zero, which does the L two regularization. And then we are using a range of lambda values from point one to 10, really hitting the range between one and five pretty hard. In the last last episode, we saw that three seem to be the best but really the variation wasn't so huge between the different lambda values. It was certainly quite large. It seemed between wraps of running this next function, which we called get SRN genus results. This is a function that we wrote. It's a wrapper function on run ML that takes in a seed to then run run ML. And so every time get SRN genus results is run with a different seed, we get a different output. In the last episode, we use map with one, two, and three as the three seeds being fed to get SRN genus results. And that then was fed into iterative run ML results. So what I'm going to do is let's go ahead and run everything from lines one to 28. I'm going to go ahead and load a library called tick tock. So it's not like the website. It's sorry, it's not that cool. But what we can do is we can use two functions tick and talk before and after a function we're interested in to see how long it takes to run that function. So I'll do a tick on that, and then talk after that. And so we'll run these lines. And we'll get a sense of how long it takes to run three reps of get SRN genus results for the number of hyper parameters that we're looking at. So you see that took 270 seconds to run three seeds. So if we did 270 divided by three times 100, right, that'd be 9,000 seconds. So let's divide that by 3,600. And we get about two and a half hours, it would take to run all 100 seeds, right. So again, we're working with the L2 regularized logistic regression, because it's pretty quick to run. And it gives us, you know, good material to play with as we're learning about micro-PML and everything that goes into building a machine learning model. If we now move to something like random forest, which is considerably more complex and time intensive, that two and a half hours may turn into two and a half days, maybe even two and a half weeks. So we need something to speed this up, right. And those 100 seeds are going to be doing the same exact function just with a different seed, right. So what we'd like to do is parallelize what's going on here on line 32. And we can do that using functions from the fur package. And so, in a previous episode, many moons ago, I introduced you to the future underscore map function, which allows us to parallelize what's going on to get the future functions to work. And the future functions are all kind of stand ins for your standard map functions. We need to come back up here to the top and I'll do library fur, get that loaded. And then we need to give a function call that will say is plan. And plan is a function that sets up your computer to basically enable parallelization, right. And so there's a couple of different ways that you can do this. So the default is sequential, which is not parallelized, right. So this, this is a serial processing, not parallel, right. Another that we won't run is plan multi core. And so this does not work with Windows or RStudio, right. So if you're using a Mac or Linux in your end, the command line terminal environment multi core should work, it might actually run a little bit faster than what we will do, which is plan multi session. Multi core multi session have different ways of assigning jobs to the processor. So we'll go ahead and run plan multi session. You can see how many cores you have available by doing available cores. And you'll see that my computer has 16. So we'll go ahead and run this now. Again, it took about 90 seconds per seed to run previously. In an ideal world, it would only take 90 seconds to run all three of these instead of 270 seconds. But this isn't an ideal world. There's all sorts of other things that have to go on besides just kind of running get SRN genus results. It has to send the data out to the to the processors has to come back, it has to figure out how to synthesize it all together to make this inner iterative run ML results data frame. So anyway, we'll run this, we'll see what we get. There's no reason to kind of speculate when we can use tick tock to figure it out for us. So I'll go ahead and run this and we'll see how long it takes. You know, it worked. It took 93 seconds to run those three seeds. Whereas again, running it without future map took about 270 seconds. Now, we are getting kind of the same warning message as we got before. But it is telling us that future map is a little bit unreliable, perhaps if we are using a random number generator, which we are. So to fix that, we can go ahead and do options equals for options. And then we'll say seed equals true. And then again, we can rerun this and that warning message should go away. Ha, so that ran even faster. And we didn't get the warning message about unreliable random number generators. So 88 seconds. Again, if we then scale this up to do our 100 seeds divided by 16 processors, then that will take that's 6.25 sets of 16 seeds times, let's just say times 100 seconds, just, you know, to be generous, and then divide that by 3,600 seconds. Well, I guess maybe we don't, maybe we'll just divide by 60. Let's take about 10 minutes, 10 minutes to do all 100 splits of those 100 seeds. So again, I think that's a little bit generous. Something to also keep in mind is that I have 16 processors on my computer. So if I'm using all 16 of them, and then I go do something, right, like if I go watch YouTube, or I go check my email, then the performance is going to decline because I'm going to be using those processors for other things. So I think what I'll do is I'm going to go ahead and change this to 1 to 100. So I'll do all 100 seeds. I will walk away, I'll go refill my water. Hopefully it'll be done when I get back. So it finally finished. It didn't take the 10 minutes I was expecting, 2151 divided by 60, it's like about 35 minutes, something I noticed when I looked at the loading on the different processors that you can look at with top. I'll show this right now. As it was running, it wasn't using 100% for each of the R processes. And I remembered, I'm like capturing to the screen, right? That the thing I'm using to record the screen and the video and the audio, all that was also running as well as stuff in the background, right? So if like Time Machine kicked in or something else kicked in, well, that's going to sap the resources going to each of the R processors. So anyway, 35 minutes is still a lot better than two and a half hours. I'm pretty happy with that. Again, kind of underscores the value of being able to parallelize things. One other thing I wanted to show you was that with plan multi-session, we used all of the cores that were available. We could modify this to say do workers, eight, that eight. And then we'd use eight processors. And that way, then, I think we would be using 100% of each of those eight processors. And it might take about the same amount of time and it might be a little bit slower even, who knows? But I'm cool with using the full shebang. But if you're using this on your computer and you don't want to use all eight processors or all 16 processors, you can set it to eight or two or whatever you want it to do. One thing that they do recommend is that at the end, you go ahead and do a plan and then sequential to kind of turn things back to using a single processor. That way, you're not running everything in parallel mode for the rest of things that are going on. I don't know that's such a big deal, but, you know, something to keep in mind. So again, we now have this iterative results, iterative run ML results, and we get this list that's got 100 copies basically of the output of running get SRN genus results for each of the seeds. Okay, so I already have some good code here for looking at the performance of the different Lambda values. So let's go ahead and run these and we'll plot the HP performance with the Lambda and the AUC. And again, the dot represents the mean and the bar represents plus or minus a standard deviation that should include about two thirds of the data of the variation in the data. So you can see that it kind of does climb up and I'm not sure which is better two or three. But the variation in the data is pretty striking. Turn this back on looking at the top three AUC values, we get, you know, we see two, three and four have basically the same mean AUC. It's not showing us all the significant digits to the right of the decimal, but it looks like two is probably the optimum. But again, not by very much. Something that I could throw in here, again, with our summarize would be a interquartile range. So we could say log equals 0.25. And I'll copy this to get the upper quantile. And that probably 0.75. And so again, we see that the variation is kind of broad, not nearly as broad as what we see here. Although I guess this is like the 50% confidence level. And like I said, plus or minus a standard deviation is about a 67% confidence level. Anyway, there's a lot of variation in the data between each of our hundred splits. And again, that's why we do those hundred splits, because there's a pretty wide variation in the data. But then again, you know, the difference between like 0.62 and 0.65 isn't isn't that big. The point that I wanted to drive home in this episode, of course, was the value of using something like the fur package to modify our map functions to make it parallelized. I'm pretty happy with the way this went. At the same time, I know that as we scale up to something like random forest or looking at the importance of each of our different features, this is only going to take longer to run. So that's something to consider. And I think does underscore the value of using something like this fur pack being able to use the fur package is something that you can use with other functions in our it doesn't have to be with micro-pml or machine learning. We used it. I forget what we used it for in that earlier episode. But it's really handy wherever using a map function with a lot of iterations to be able to kind of distribute that across the different processors on your computer. Give this a shot with your own data. I'd be curious to see where you found value using the fur functions. If you're using a Windows computer, I would love to hear what your experience is with fur map on your Windows computer. I haven't done it myself being a Mac user here. And it'd be I'd be interested in kind of hearing what your experience is using that plan function for Windows. So anyway, let me know how it goes and drop me a line down below in the comments. And we'll see you next time for another episode of Code Club.