 Do you ever have an analysis where you're repeating the same function over and over again with maybe not much changing between those functions? And you've thought to yourself, you know, there's no reason that I have to run these in series. It'd be great if I could run them in parallel. Yeah, me too. Well, in today's episode, I'll show you how we'll do just that using the fur package in R. Hey, folks, I'm Pat Schloss, and this is Code Club. In each episode of Code Club, I apply principles and practices from reproducible research to interesting biological questions. Over the past couple dozen episodes, we've been looking at the sensitivity and specificity of Amplicon sequence variants, a way of analyzing 16s RNA gene sequences that's all the way of these days in microbial ecology. Now, don't go away. Even if you don't have a clue what microbial ecology or microbes or 16s or ASVs or any of this jargon is, stay tuned. Because in this episode, I'm going to show you something really useful in R, which is how to parallelize your analysis. Over the past couple of episodes, we've been trying to build a confusion matrix and found initially that it was going to take days, maybe weeks, to run that analysis. And who has time for that, right? And so over the past several episodes, we've systematically gone about trying to speed that up. Now, I want to make something crystal clear. Most of the time are as fast enough for what we want to do. We're in a situation though, right now, where we're comparing tens of thousands, I think 20,000 sequences to each other at different regions within those sequences and using different thresholds. So we have like 56 different combinations of thresholds and regions within the 16s gene against again, that 20,000 sequences. And oh, yeah, we need to repeat this analysis 100 times, because we're taking one genome from each species, because we're not really unbiased, let's say, in how we sequence genomes across different bacterial species. So this gets really slow. Where we left off in the last episode was that we used RCPP to generate a function to build a confusion matrix for one combination of threshold and region. And then in the last episode, we went about trying to deploy that then across all 56 combinations of regions and thresholds using nest, mutate and unnest. Well, now we need to go to the next step and repeat that 100 times. And so there's not really much that changes between those 100 times, except what genomes are being sampled from those different species. So we could run this in series one after another until we've gone through all 100 iterations, or my computer has about 16 processors on it. And why not take those 100 iterations and divide it, distribute it across those 16 processors, so that those then run in parallel, that we would run those different iterations in parallel. And then in the end, pull it back together. Well, this is actually something you can do in R pretty easily using something called the fur package. So in today's episode, we will do just that, I will build up going from doing it in series to then doing it in parallel. Along the way, I'll show you some tricks, some problems that we actually run into using our studio. And we'll see if we can't make this faster. Something to note is that if we do it in series, I think it should take about 16 hours to run. So if I've got 16 processors, it should take about an hour to run these 100 iterations, which is far, far faster than what we've done before. Of course, along the way, we've learned a lot. And you might be questioning, you know, was it worth learning all this extra stuff to get, you know, maybe 15, 20 hours speed up in how long it takes to process. I would say it does because in the long run, you might use these tricks again in the future. And that those that savings or the expense you might say of learning this now, you will then realize again and again in the future as you use this again. So let's go over to our terminal. I will move to our project root directory and go ahead and open our project. Over here, we will then go to get rock data. And we again, if I don't know why it doesn't just put me at the top of the script every time I come into it. But you'll see at the top here, we have our libraries, we have the function that we use to build the confusion matrix, get confusion matrix, that is a C plus plus function. We set our seed, we read in our ESVs, we get our metadata. And you'll notice that in this metadata step, we use slice sample to get one genome from each species. We then join those together generate that confusion matrix using nest mutate on nest. And then that's where we're at. And so now we're ready to replicate this again 100 times. We've done something like this in the past, where we read in a bunch of files and we used map DFR to do that. We're going to do the same thing to get going here before we transition to doing things in a parallelized format. Okay, I'm going to come back up here. And I'm going to create a function that builds my confusion matrix. And so what we will do is I will do get iteration as our function name. And it doesn't need any arguments. And I'm going to then indent all this other good stuff, a spot, so that I know it's part of the same function. And I'll close that with a closing brace. And then I'm going to let's see, I'll highlight all this and load this into my our session, so that I can then show that my function works, this get iteration function works. And we can maybe time it to see how long it takes to run. So we can do system dot time confusion, get iteration. And that should be good. And so I'm going to run this, I think it'll take about nine or 10 minutes. And so I want to see a couple things I want to see how long it takes to run. And then I want to see what the output of confusion looks like to make sure that looks good. And then we'll probably run it again to see that we get slightly different numbers. Because again, we're using that slice sample to get a different genome from each species. And so we'll do this for two iterations to see how things look. So that ran wonderfully, let's see how long it took in minutes, 549 seconds, it's going to be under 10 minutes. So it's about just over nine minutes. So great. So again, we then iterate this 100 times, that's going to be nine. So that times 100 916 divided by 60 minutes in an hour, get us about 15 hours, right? So again, even if I got something else running, I could theoretically get through this whole thing in 15 hours. But if it's paralyzed, I could get through it in about one hour, which would be really nice. All right. So the other thing I wanted to show is if we do confusion, we can see what the confusion output looks like. So we will save this. And I will run it again. And we'll see that the numbers change ever so slightly. Because again, we're taking a different genome from each species. And again, that variation is why we want to do this 100 times, and then kind of look at a median or a mean over those 100 iterations. So let's go ahead and rerun the system time. And I'll see you in about nine minutes in wall time. But in YouTube time, it'll be about one second. So we finished running at the second time, we're at 564 seconds, which was pretty close to what we had before up here, 556. Right. So within eight seconds, really good, still right around nine minutes. And if we look at confusion, kind of pull this back up, we see that at least for like this last row, it's easy to always look at the last row of a data frame to compare things. We see that the numbers are pretty close to each other, subtle differences. And again, that's because we're taking one genome from each species. And each time we do this, it's because it's using a random number generator process, we're going to get subtle differences, which again, is the whole premise of this because we want to do 100 iterations across running the same function, this get iterations 100 times, and then we'll merge all the data together. Okay, so the next thing we want to do is start to build up to do those 100 iterations. And I will build a little pipeline here. And the function we're going to use, we've seen before called map DFR against, we're still doing this in series, we're going to use this to help get us to doing things in parallel. We've seen map DFR several times. And what we're doing here is actually very similar to what we did in a previous episode, where we had a bunch of files that we wanted to read in and concatenate together. And so we used map DFR with I think read TSV or read CSV over all those files to concatenate them together. So what we're ideally going to do eventually would be something like one to 100, get iteration, and we could put it in a formula format with this tilde, get iteration. The other thing we can add for some eye candy, it doesn't really matter, is a column called iteration with the argument id. And so when this runs through the first time, it'll create this data frame with an extra column that's iteration for the first time through to be one, the next time through, it'll have two in that column the next time through three. And by the time it gets to the 100th iteration, that column value will be 100. I call it eye candy because art doesn't really care if you've got the same values, you know, the same threshold and region repeated over and over again. It's more useful for me as the programmer to see how things are changing over different iterations. Anyway, so we can run this and I'm not going to do 100 iterations though, because I'm still testing things out. So I'll leave it at two iterations. And what we'll do is we'll see that this runs in series should take about 18 minutes, and that we should again see different values for the same threshold and region within our data frame. So we'll run this should take about 18 minutes. And then we'll look at the results. As this is running, one thing I want to point out to you is that if I look at the top output, so again, if you go to your terminal and you type top, it outputs all the processes that are going on in your computer by default in order of descending order of amount of CPU usage. And so you see the first line here in my top output is the R session that I'm running. And that it's been running for almost an hour now, not just on this process, but overall, since I fired up our, but you'll see that it's using 100% of the CPU, and that there's only one row here. So later when we use the fur package, we will see multiple lines here. But I wanted to show you what this looked like with one process running. Wonderful. So it finished 1116 seconds. Let's divide that by 60 to see the minutes. Yeah, 18 minutes. Again, right around what we'd expect for running a nine minute process twice dead on. And if we look at confusion, we will see that we have a column for the iteration, the region threshold, and then our confusion matrix. And if I do something like confusion, count iteration, we see that we've got 56 twice, we could have seen that from the number of rows in our table of 56. And if I do something like confusion, filter, region, v four, four, and threshold equals 0.03, we'll get two rows. And we can see that the values are slightly different. Okay, now, if I were to again, take that nine minutes, and then multiply it out 100 times, it can take about 16 hours as we saw. What we'd like to do now is take the analysis that we've done in series, and to do it in parallel, using the fur package. And so what we've done here already with the map DFR function, as you've seen, is we've run two iterations in series, one after the other, what we'd like to do is to take each iteration and run them on separate processors. So that instead of running it in 18 minutes, we run it in nine minutes, because we're doing it twice as fast because we're doing it on two processors. Unfortunately, the fur package just is really painful to use in our studio I've found. Instead of getting super frustrated about that, we're going to see this as an opportunity to learn how to run R from the command prompt in the terminal. The downside of this is that I think this might be where we lose some of our Windows users. I will show you Windows users how you might be able to get this to still work inside your terminal. But I think that if you use the Linux operating system within Windows 10, I think it should work. Let us know down below in the notes, whether or not any Windows users have been able to get fur to work with parallelization on Windows using that embedded Linux system. Before we leave the cozy confines of our studio, let's go over to packages and make sure that we have fur installed. If you type fur into the finder window here, it should pop up. If it's installed, if it's not installed, click on that install button here in packages, then type fur, three Rs, not two, and then click install to get that going. I've already got it installed. So we're in good shape. I've got my GitRock data are saved and in good shape. So I am going to quit our studio. I've opened up my project in Adam. And you can see my GitRock data is yellow. And that's something nice that Adam does for me because it's aware of the repository in the state of the repository and that it knows that I've been making changes since my last commit. And if I come down, I can see the yellow band here on the left side of the screen, telling me that this is where I've been mucking around, if you will. Okay, and so you'll see everything we have here is what we had in our studio. It's a little bit different. One thing I'll let you know that I have installed that's really nice for working with R in the command line is a package called R exec. This allows me to highlight lines and then hit hit command, enter and run them into the terminal. This is a lot like what we did in our studio. Okay, and again, we're doing this in the command line, because there are problems in using for an unfortunately, in our studio. So let's go ahead to the terminal. I will start R, R from the command prompt, we have our nice cozy, comfortable R prompt there, even though it's not in our studio, it's the same prompt. And I'm going to scroll back up to the top and add library for to the top here. And I'm going to highlight everything. And if you don't have that R exec package installed in Adam or in your command in your text editor of choice, you could always do command C, command V into the terminal. But again, I can do command enter. And this then runs everything in R. It should load up without any problems. And we're right back to our prompt. And again, what I will now do is come back to my system time confusion map DFR. And we're going to modify this for use with per. And so one of the things that you might be interested in is running available cores. And this is a function from for that it tells you how many cores how many processors you have available. So I have 16 available. Depending on your computer, you might have more or less. The other thing I need to do is to tell for my plan. And so I'm going to say multi core. And so this is without any quotes. And so this tells for to look at my system to see how many cores I have and get ready to do what's called a fork. And so if you think of a fork has a bunch of tines at the end. And so what you can think of the fork is doing is that each processor gets a different time of my fork to run the different things that I'm going to map to it. Okay. So we can then run map plan multi core. And then here in my system time, I can then do future underscore map underscore DFR. And it should just work. And so again, we'll highlight and run this and see if we get any error messages. But if not, this should take about nine minutes to run. One thing I'll do is I'll open up a new tab here. And I will do top. And what you'll see at the top is that I have to our processes running. And I'm not doing any sleight of hand here. These are those two processes you see that they just started 21 seconds ago. And these are my two iterations running in parallel. And later when we scale up to the full 100, we will then see 16 our sessions going because I have these 16 processors. So we'll let this run for another eight, nine minutes, and then we'll come back and look at the results. Wonderful. That ran through in 591 seconds, which again is under about under 10 minutes, 9.9 minutes. So a little bit slower than what we'd seen before for an individual run. Again, that could be because I had other stuff that I was doing while this was running. But who knows, it's still much faster than 18 or 19 minutes, we do get these warning messages, unreliable value expected unexpectedly generated random numbers without specifying an argument for the seed. So you'll recall in our code up here, we do slice sample, which draws from our seed set seed up here. So what we want to do is let's look at the arguments for future map DFR. And there's an options argument that allows us to set the seed within that argument. And you'll see down here in this first example, you can give dot options equals for options seed and then the value of the seed. So I'm going to copy that because that's something that I don't use very often. And I lose track of things. It's hard for me to remember things that I don't use very often. And let's go ahead and put each of these arguments onto their own line. And I will put this options argument on its own line. And I'll use my preferred seed of my birthday, 1976, 0620. That is good. And we can then remove this seed from up here. And we will run this again. And we now should not get the error message. The other thing that I'm going to do is output this to a right TSV. And we'll put it in processed. I'm sorry, data processed, forward slash rn db dot RC dot TSV. And so this should then output the large table that we're generating out to this rock TSV file that we can use for aggregating in the future and for making plots and all sorts of other analyses. So let's go ahead and rerun this. And now we should not get that error message. So this time it completed without the warning messages were in good shape. Again, if we look at confusion, filter region before and threshold equals 03. We see that we've got two iterations and slight differences in our accounts for the confusion matrix. So we're in good shape. One other thing that I wanted to show you is that if you're on Windows, it might be possible to use multi session instead of multi core. I think it runs a little bit slower because of how it does the it doesn't do forking it runs, it creates more sessions in the background. It's supposed to run a little bit slower because of how it does that parallelization as well as how it handles memory. Multi session is also supposed to work in our studio, but like I said, I just couldn't get it to work reliably. So I'm going to run this and let's just see how long it takes relative to using the set the plan multi core. So does it take nine minutes or longer or maybe even shorter? We'll see. Let's look at what the what top says. It complained. Yeah, something weird is happening that that I don't really understand. With with multi session, maybe the multi session is the problem that I was running into in our studio. So I'm not going to worry about that. Yeah, this is the same error message I got in our studio. I'm not really sure what it means, because the code works as is. And and I'm not I'm not going to worry about that. There's more that you can do with a future map DFR, you can figure out how to put in a progress bar and all sorts of other bells and whistles that are nice. I feel a little bit bad that I've pulled you out of our studio to run this in the terminal. At the same time, though, remember, we're going to embed this in a make file so that it's going to get run from the command line anyway. So we're actually gaining functionality by moving for to the command line like this out of our studio. And that's really the environment that we're going to try to run things to make things reproducible anyway, right? It's not reproducible or automated to highlight a bunch of lines of code and then run it in our studio or in the terminal. What we've got set up again is our our script or our script, right? And so I'm going to go ahead and remove this system time and put this back to multi core, right? And I will get rid of that closing parentheses. We'll bump this over. I'm going to up this to 100 iterations. And I will get rid of this last step of the plot. That should work. Oh, one thing I want to double check is this file to make sure we got output there. And let me open up a new tab. Move to my project route. Look at that. That's there. I look at the head and we see all our great values. I could do WCL on the file and there's 113 lines. So one for the header, right? This this line counts. And then 112. So 56 and 56 gets us 113. That's great. And that is our target. So I'm going to create a rule for this in my make file. So it's been a long time since we looked at our make file. And we can come down here. And that's going to be my target. And it's going to be code forward slash get rock data dot r. And then the input to this is going to be these two files actually that we've been using in our exploratory data analyses, our genome ID taxonomy and our R and D, B, ESV count table. And then I will do dollar sign carrot to run get code get rock data, which reminds me that I need to make that executable. So I will quit out of R and I'll do chmod plus x code get rock data dot r. That should be good. So let's see what happens when we do make data processed R and DB rock TSV should work and it should take just about an hour to run. So we'll see. So that's firing up. Loading the packages. I'm going to go look at top and see if bam, you see all these R sessions fire up, right? Screen flow is what I'm using to record the screen here. And that looks like it's taking up about like two or three processors maybe. So this isn't going to be completely at 100% using the processors, but it should be pretty good. Maybe it'll take an hour and a half to run with me for recording the screen in the background. Okay, so I'll be back in just a bit and we'll figure out how long this actually took to run. All right. So the day got a little away from me, but the job was running in the background as I was on my zoom calls and various meetings today. Surprisingly, it took three hours to run. I'm not sure why it took three times longer than I thought it would take kind of doing a little bit of Google searching on the future map functions. It seems like it does take a little bit of time to get up and running, but still, that seems like a lot longer to get up and running. Regardless, it's still a lot faster than the 16 hours that I anticipated it would take if I used a vanilla map DFR without using the future map DFR. So I'm happy that it worked. It works. And that's good enough for me. Maybe it is one of those packages that's still a little experimental. I don't know. But it worked well. So something I remembered was that I need to update my readme file to say that I used library for and to go ahead and add that as a dependency that I had used here in my are my readme file. And maybe I'll put this up here with our CPP. That is good. And I will go ahead and make my readme. If I spell it right. It's the end of the day. My fingers are tired. So again, this is building because we updated some dependencies that were in there. If we look at get status, we see we've updated the make file, the readme's, as well as our code get rock data.r. I also want to go ahead and commit that output file. Let's go ahead and do a WC-L on data processed. rndb.rock. So 5601 lines. So one of those lines is the header. And then we had 100 iterations of those 56 combinations. So this makes sense. I'm going to go ahead and add this data processed to the repository. Maybe I should double check how big it is. lth data processed. So it's 228. That's pretty small that we can post that up to GitHub. We just want to be careful about posting really big files up to GitHub. But I think this would be a really nice file to have that other people might want to work with. And that I certainly want to keep under version control. So I will do get add data processed rndb rock.tsv. I'm going to have to do a dash F because I'm ignoring the data directory. So that dash F forces get to add it will add readme.rmd readme.md. And then the code get rock data.r. And that's all good. I guess I forgot to add the make file so get add make file. And that's in good shape. And so then we will do get commit dash m. And so deploy generation of rock or confusion make sure tricks for 100 iterations. Great. So again, that didn't turn out exactly as I expected, but it's still it was quite a bit faster. Again, the goal is to get the right answer, not necessarily in the fastest time possible. So in this episode, as well as two episodes ago, where we talked about RCPP, we're kind of going a little bit into the weeds of our, you know, you don't always need to optimize things to be as fast as as is possible on your computer. But it's nice to know that you have these tools like RCPP, and the future map functions to speed things up with art with C++ code, as well as to parallelize things using those those future map functions. So give these a look next time that you're doing a map function where you think it might take a long time to run. And perhaps you'll find that by parallelizing it on your computer that you can get it to go quite a bit faster. So anyway, thanks as always for all the kind comments that people have tweeted at me or emailed me, those definitely make me feel really good. I got a grant rejected today or really bad comments and instantly I got an email from somebody telling me how much they really enjoyed these videos. That made me so much better that I almost don't care about the grant proposal. Anyway, so please keep practicing, tell your friends about Code Club, be sure that you subscribed and you've clicked on that bell icon so you know all the exciting things that are coming our way. I'm really excited about this series and I think it is going to culminate in some type of manuscript that we're going to submit. And I'll show you all that process as we go through for those of you that have never perhaps submitted a paper and don't know what it's like. So and we'll do this all with a mindset of being reproducible and keeping things automated. And like I said, reproducible as much as possible. So keep practicing and I'll see you next time for another episode of Code Club.