 Have you ever screwed something up? Well, if you're like me, you screw something up just about like every day at least, right? Well, past couple episodes of Code Club, I've screwed up. Stay tuned and I'll tell you why and how we're going to fix it. Hey folks, I'm Patch Loss and this is Code Club. In every episode of Code Club, I try to apply reproducible research practices to an interesting biological question. Over the past couple dozen episodes, we've been digging into the topic of Amplicon sequence variance. Stay tuned, even if you don't know what Amplicon sequence variance are or why you should even care because we're going to be talking about for loops in today's episode, how we can make them more efficient and what practices to avoid. So this episode is going to fit a little bit weird into the series of episodes because I've already recorded the two episodes that are supposed to follow the one I did yesterday or a couple days ago. I don't know when I'm going to broadcast this. So if you've seen the episode on Expand Grid and the double for loop, the nested for loops, well, this is where we're going to be right now. So when you watch the next episode on RCPP, it might see a little bit out of place. Anyway, what we're going to do today is we're going to dig into my claim that we should avoid for loops in R and I have to admit right now I'll give you the spoiler. I was wrong and I'll show you why I was wrong. But thanks to Dan Sobel on Twitter who asked a really nice question and said, hey, you know, I noticed in this book are for data science, which is a great book, by the way, you should really check it out, that they say that for loops are not a problem. Well, if you watch the last episode about nested for loops, you saw it was like a thousand times slower, right? Well, you know, yeah, it was slow. And it turns out that nested for loops are really slow. So in today's episode, we're going to I'm going to show you how we can systematically go through that nested for loop setup that structure and make it more efficient so that we can still use a for loop, but make it much faster. And it's even going to be faster than what we did with expand grid, and anything we might do with the tidyverse. And this is without knowing any C++ or any real special tricks. So I'm going to come over to the command prompt, we're going to go to our project root directory. You'll see I'm on my branch issue 37. What I'm going to do is I'm going to set up a branch off of this branch. So this branch issue 37 is off of master, I'm going to make another branch off of issue 37. I'll show you why here in a second, but I'll call it issue 37 for loops. All right. And on this branch, I'm going to get an old version of get rock data to figure out what version we want and how to do this. I'm going to do git log. Looking at this git log output, you see that it's the history of all my commit messages over the history of the project. And you'll see that again, because I've done a few episodes since we talked about for loops, I'm going to come back to this commit message that was from November 27. And this commit that I have here highlighted is called a SHA SHA. And it's a unique combination of letters and numbers. I think it is in hexadecimal. Yeah, I think it's hexadecimal code. Who cares? Really, all you need to copy or maybe like the first five or six characters. But because it's easy enough to double click on it and get the whole thing, I'm going to do that. And I will copy that SHA. And then to get out of this output, I can hit the Q key. And I'm going to do git checkout. And you may recall before when I've perhaps accidentally deleted a file, I did like git checkout hyphen hyphen, and then the name of the file, I wanted to remove those changes. Well, hyphen hyphen is the same as the current commit that we're on. So I can give it now this commit number, the code. And then I will check out that data from that checkout. And I'm going to do code git rock data dot r. And this then will get me in this branch. The old version of git rock data. So we're in good shape. I'm going to go ahead and open up my our project file in our studio here and we'll get to work systematically working through the code to try to improve those double for loops. So here's my git rock data. And here is my code. And here's my beautiful for loops. Once upon a time, I think I did a project where I had triple for loops. It was like IJ and K craziness. All right. So what I need to do is I need to come back up here. And I'm going to highlight all this code from line 82 up, run it so that I can have that stored in my memory. And that's what we're going to be working with today. Wonderful. And again, if we look at metadata, ESV, we have our two columns, we've got the genome ID, the ESV ID, and then an index, we don't really need the index, but I'm not going to mess with removing it for right now. Now, the first thing I want to look at is again, this expand grid, you'll notice that I'm missing some code. And I'm going to come back and add to that. So I'll add the mutate. And this is the code I want here. So I'll cut and paste this up here. And let's remove those comments. All right, so this will get us our, the columns corresponding to true positive true negative, false negative false positive. It's going to be columns of logicals, truths and falses. I will then output this, I guess I don't need that parentheses because I've got it up here on line 98. And then I will do summarize true pause equals sum tp, comma true neg equals sum tn, false neg equals sum fn. And then false pause goes some fp. Good. And that's all great. I'm going to wrap this and I'm going to call this confusion actually. And I'm going to wrap this in a system dot time function, so that we will generate the confusion matrix as well as benchmark how long it takes to run all this. And we'll run that. I think I forgot an E up here. So let's run this. And it's probably going to take about 86 seconds or so. So I'll do some editing and stitch it all back together. All right, so we see expand grid took about 70 seconds of wall time. So that's the time that would elapse looking at my watch. I'm going to go ahead and copy and paste this output into my issue tracker so I can keep track of what's going on. And I'm going to say this is using expand grid. And I want to go ahead and put in the confusion matrix data so that I can keep track of things. And I'll put that here at the top. And let's make sure that it's formatted nicely. So we're in good shape. Great. And that's seconds, of course. So it's a little bit faster than what we saw in previous episodes. And perhaps I don't have as much garbage running in the background of my computer right now. All right, so the next thing we want to do is we want to dig in and think about that double for loop. And as I look at this, I notice I had a small bug in here, and that I have my interloop J going from two to number of operands, I don't want it to go from two, I want it to go from I plus one. And so this actually might be something that was also making it a little bit slower. So so for comparison, let's let's see what it looks like, right, with this bug in here, I'm going to set my timer on my phone here for a minute and 10 seconds, and see how far it gets in that time. And I will start my timer and running this at the same time. And then I'll kill it when the timer goes off. So while this is running would be a great time to make sure that you're subscribed to the channel and you've clicked on the bell icon so you know when the next episodes will come out. We're going to be doing a few great episodes here before the holiday break. And then after the holidays, we're going to talk about how to write a scientific paper as we try to pull all this information together and tell a story so that we can submit that as a preprint and perhaps even for peer review. Alright, so we kill that. And we do confusion, confusion, unlist, sum. And we see it got 235 rows through 235,000 rows through. That's still pretty slow, right? Yeah, that's the speed that's about like 1000 times slower. Now what we want to do is we want to change this for loop for J to the I plus one, right. And so because we want to look at a triangle, we always want J to be greater than I. So let's run this again. And I will time it. And we will see how far we get in one minute and 10 seconds. And on your marks, get set, go. All right. And we'll again, sum it up. And we see we've got 227,000 comparisons. So it's actually a little bit slower, believe it or not. And I think that's again, because we're doing that addition every time through that it's an addition, right. And so that's actually a little bit slower without the bug, believe it or not. Wow. Okay. Now, again, four loops are slow because you're doing lots of computations. And as I, I think I mentioned that, you know, we've got four sets of addition in here. And so if I remove two of these lines, it might take about twice as fast. So something that we are repeating and we don't really need to repeat is getting access to this value. So the I row and the genome ID column of metadata ESV. So every time through the loop, through the J loop, even though this value doesn't change, we're still going into metadata ESV to get the value. And I think that's pretty slow. So I will pull that out. And that changes in the I loop, but not in the J loop, right. So I'll say I genome and assign there. And then we will put I genome here. And then we will put I ESV there. And we will call this I ESV. My fingers are a little bit thicker than normal today. And if I do I equals one, let's look at what I genome looks like. So we'll do I genome. And we see that this is actually a data frame with one row in one cell. And so what I'd rather have is an individual value to get the individual value. What I can do is I can put this in double square brackets, and there as well. And so again, as this comes through, if I look at this value, we see it's a character vector of a length one. So we will again run this for 110 seconds and see how quickly or how many comparisons it gets through. I'll get my timer ready, and on your marks get set, go, right. And again, if we look at confusion, piping that out to unlist. And then some we see that got quite a bit further 419,000 comparisons. And I know so that's without the J bug, and double for loop, pulling out I index 419,000 comparisons in 70 seconds. Alright, so needless to say, we're making improvements, but it's still really slow. And one of the things that are is really good at is dealing with vectors of data together. So if I make a vector, I can do one colon 10, and I get my vector of numbers from one colon 10, I can do sqrt, one colon 10, and wamo, I get the square root of every number between one and 10, right? So r is really good at these types of operations. It's really fast. That's called vectorization. It's not so good dealing with individual values separately. Okay, so let's think about how we can vectorize our inner for loop here. So what I will do is we're going to use a function called slice, we've seen slice previously, where we use slice as well as slice sample to get one genome from every species, right? So that was our way of controlling for uneven genome distribution across samples. So we can do metadata, easv, we can pipe that to slice. And what slice takes is a vector of row numbers that we want to basically extract, right? It's like filter, but where we're giving it row numbers that we want. And so I want from i plus one to an operons, right? And so again, you can see that this is very similar to what we have up here, only now we're going to make a pipeline with that. And what we now want from our pipeline is to create a mutate where we are going to create values for true positive. Let me see what I've got down here, a true negative, false positive, or false negative, and then false positive. And that is going to be what we will say i genome equals equals genome ID, and i easv equals equals easv, right comma. And then true negative is going to be the same thing. So I'm going to copy this down a couple of times. And I don't need this final comma. But so true negative is where they're both different. And then a false negative is where they have different easv's, but they're in the same genome. And a false positive is where they're in different genomes, but the same easv. Okay. And we can then pipe that into a summarize to then get true pause equals sum of tp comma true neg tn and then false neg some fn and then fp or false pause sorry is sum of fp. So that's great. And we can then get rid of all this other stuff. And so what we see is that we now have a pipeline that has a single for loop indexing over i. And that all looks great. So let's go ahead and run this. Oh, one thing I forgot is that we need to call this say i confusion. And we are going to then add or create, we're going to add i confusion to confusion. Right. All right. And so this is currently a list. And you'll recall that was a little trick that we did to try to speed things up so that we didn't have to rewrite over the same or find new memory space for confusion every loop through. But this is a pretty small vector. And so I don't really care. And it's kind of hard to add lists together. So we will run this. And again, I'm going to wrap this in a system time so that we can get good timing on all that. And we will run this and see if this is any faster than that double for loop. I'm sure it's going to be blazingly fast compared to the double for loop. All right, so that went way too fast. Let's see what happened. Oh, we have an error messages, right? So problem with summarized input true pause, object TP not found. What is that about? Some TP right there. So let me see if we say that I is one, then let's do this. So we have I genome, I ESV, okay, those are all good. Let's then look at this. This then gives us everything except for index one, right, which is good. And then, oh, I know what the problem is, I've got assignment arrows here, you should have yelled at me. Pat, don't put assignment arrows inside of mutate. And do I do that and summarize also? No, I did it right and summarize. All right. So let's run the system time chunk again. And hopefully, yeah, now it goes. Okay, so that was so much faster. So removing that one inner for loop got us down to two minutes, so much faster. Okay. So 125.949 seconds here. Again, not nearly as fast as expand grid. But wait, we're going to get there. And I think we might even be able to do better. Okay. So the next thing that I notice in looking at our code is that maybe I don't really need this slice. And what the slice allows us to do is to look at the upper triangle of our data frame. But if we look over the full data frame, perhaps, so let me back up. So doing that slice, I think is probably pretty slow. Because again, we're doing this I plus one, and we're pulling out a data frame, and we're kind of making a new memory space for that. And then what that's gaining us is it's reducing the number of computes by half, because we're only looking at half the data frame. Maybe if we look at the full data frame, even though we're going to double count things, we look at the full data frame, maybe it'll be faster, even though we're doing twice the compute, because we don't have to do this slice function. So maybe the slice function is really slow. Okay, so we'll remove this line. And that reminds me then that I want to make sure that my index goes from one to an operands not n minus one. You need to get rid of that extra closed parentheses. And this all works. And then at the end, we're going to do confusion tp. Because we're going to have the self comparison, right? So we're going to want to take confusion tp is going to be confusion tp minus n operands. And then confusion is going to be divided by two, right? Because we're going to be double counting everything else. So let's go ahead and run this. And let's see how long it takes to run. So we see it took about 110 seconds to run, but it didn't like what I did after the fact of kind of fixing the confusion matrix. So we'll come back and fix that. But we see that by removing that slice, even though we're doing twice the number of computes, we're still 16 seconds faster. So that's a win. Okay, let's come back and fix our confusion part. And so what does confusion look like? Looks like that. And I think this has already been, no, it hasn't been saved. Okay, so confusion is a data frame, not a vector. So what we can then do is confusion, piping that to mutate, where we will say, I even use like the wrong column names, what was I doing, mutate true positive is true positive minus n operands. And all that divided by two. All right, so get rid of that. And then we're going to take all the other values in the confusion matrix and divide those by two. So true neg, I think I want pause, not positive. Y'all are supposed to be yelling at me that I'm doing things wrong. True neg divided by two, false neg divided by two, and then false pause divided by two, and we can close that. And now when we run this oops, too much. Then we get the output that we had. So 4861, 4861, we're in good shape. Okay. So something else I notice in our code here is that every loop through our operands, right, we are basically adding columns to data frame, creating a new data frame. And I wonder if we can perhaps go more directly to creating a data frame without adding columns on, if that solves problem solves some of the speed issues. So instead of metadata ESV, let's instead say tibble, and that's the function to create a data frame, right. And then what we're going to do is we are going to say mutate, or not mutate, but we're going to have true positive, a true positive column, true negative, false negative, false positive. And we need to get in here, metadata, ESV. And then like we had before, we're going to have metadata ESV, calling that column. And then the same idea here, but for ESV, and ESV there. So that's good. And we can then tack this onto our data frame. I'll save this. And I'll run it. And we will see what we get. And it screwed up the output as it ran through. So I'm going to trust that I got the right values. I think what I forgot to do looking at what I highlighted, I forgot to include resetting the confusion back to zero. Again, that's why you need to initialize these things. Anyway, but we see that the time here is about 110 seconds. It doesn't really change what happened. And I think that we can learn from what we did earlier up here on lines 127 and 128, where we use the double square brackets. Now, if I put double square brackets here, it's going to complain, because for a data frame, it doesn't want that comma there, it wants to get the column genome ID. And, you know, alternatively besides genome between the alternative to square brackets is using a dollar sign. And some people like dollar signs, some people like square brackets, but it it's basically doing the same thing where it's making a vector out of the column, in this case, EASV or genome ID, right? Okay. So let me fix all these up. And we'll see if this speeds things up. Again, what we're doing with these double square brackets is pulling the column genome ID out or metadata EASV out, right? And so that looks great. And I will be sure to include line 22 up here. And we will rerun this and see how long it takes to go. All right, so we got the right output for confusion 90 seconds, wow, we broke the 100 second mark. All right, so that's improvement, right. And again, so looking up into the data frame is slow. And so if we can vectorize those columns, we're going to be speeding things up. Okay, if we look at our code now, what else can we improve? Well, you know what, I don't know that I even need to create a tibble, right? Couldn't I take this code and plop it right here into some, and then I don't even really need to summarize because I could create a vector directly that has all these values. And so let's give that a shot. So what we can do is I can say C to create a named vector. And again, I'm going to say true pause, true neg, false nag, false pause. And I'll say some. And I don't want to include that comma, right? All right, typing is hard. Those of you that have watched me long enough know that I struggle with typing. All right, so then I want to be sure I have my closing parentheses on all these. That all looks good. So this will create a vector, right? That's called I confusion. And we're going to add on to that everything else. So this should be great. And I, I still need to do these corrections, right? Because we're still doing that double counting. And let's give this a run and see how long it takes to go. All right, so I'm getting a complaint. Oh, it's complaining about all this other stuff. Because again, I have a vector, not a data frame right now. 28 seconds, that is so fast. That's so much faster than even expand grid. So yeah, I mean, I can't even get over that twice as fast, more than twice as fast as using expand grid. And so that's, you know, that's like 2000 times faster than where we started with that nested double for loop so fast. Oh my goodness. So that's great. Now we have a problem here, and that I'm treating confusion like it's a data frame, rather than as a vector. So let's fix that quickly. And so we can then let's see what we will do. I think I had this code earlier, right? So we'll do quote. Sometimes the automatic quoting is too helpful. True pause is confusion true pause minus n operands. And that divided by two, or no, let's leave that there, right? Because then we can do confusion, divide by two, and we'll be in good shape and we can get rid of all this mutate stuff. And we run that. And it's not happy because because up here in confusion, I use TP instead of true pause. So let's call that true pause, true neg. So we get this this na, because it don't know what it ended up doing really, I think it was trying to add together things that didn't really belong together. But something weird happened. So let's run this again. Again, it's only going to take, you know, less than 30 seconds. And we'll get make sure that it gives us the right value. Great, we get the same timing and the same values for our confusion matrix, we're in good shape. And I'm really just over the moon that we got it down to 28 seconds. That's that's really fast. So this there's a spoiler alert here. Okay. And so 28 seconds is much faster than we found with expand grid that took us about 70 seconds. But if you watch the next episode where you see plus plus, it's going to be even faster. So be sure that you've subscribed and that you know when that episode comes out. So you can see how I got this to be even faster. Before we end today, though, I want to do one more thing and see just how sinister is this for loop? Can we remove the for loop? And will that affect the performance? What I'm going to do is I'm going to convert the guts here into a function. And then we'll use one of our map DFR function or the map DFR function to map that function across all rows of the data frame. Okay, so we will create a function that I'm going to call index index confusion. And what's going to be a function and it's going to take in I as the row that we want it to analyze, and it's going to be all that junk that we just generated, and I'm going to get rid of the I confusion. And so that we output the contents of C, or this this this vector, right? And so if I run that, and I then do index confusion, say of one, so kind of what is the value for row one, we again see that we've got two true positives, a bunch of true negatives and some false positives. So that works great. We can then apply this over. Let's see, I don't need to I don't think I guess I should be careful what I'm deleting here. So I'm going to replace this for loop with map DFR. And it's going to be one to N operands. And we're going to send that to my function, which was index confusion. Okay. And, and that will then output a data frame that will, for now, I'm going to call a confusion. And it's going to be a data frame where the columns are true pause, true neg false, false neg false pause. And I'm going to need to then sum over those columns. And so we will then do summarize. And we'll do true pause equals sum, true pause. And I'm reminding myself that we have that double count problem still, true neg equals sum, true neg, and then false neg must be tired, my typings get even worse. And then false pause, some false pause. Okay. And what we want to do though, to correct this is to remove an operands, and we'll then divide that by two, as well as all these other values, and we'll be in good shape. And I can get rid of all that stuff. And I think I need a closing parentheses for my system time. I don't need to initialize this, because we're going to sum it across all those columns. Now, we can run this. And the question then, is this going to take the same length of time, so it can take 28 seconds? Or is it going to be faster, right? So if you think for loops are the worst thing ever, then this should run faster than our for loop. But I have a suspicion it's not going to be much different. So we'll run this. And we'll see if it's done in 28 seconds. And more importantly, does it give us the right answer? There you go, 28 seconds. What do you know? Not much different in speed. So I think we've we've shown that for loops are not the worst thing ever. But double nested for loops, those probably are the worst thing ever. And again, looking at our code and where we've come from from that double nested for loop to even when we had a single for loop, we've really sped things up. And the key to this was to remove things from that inner loop that didn't need to be in the inner loop. So keep things at the most exterior loop that we can, and also make use of vectorization. So again, one of the things I want to emphasize, of course, is that throughout the whole process, well, except for that one case where we had the bug, we had the same result each time. And what we're trying to do is speed things up. And the only reason I really care to speed things up than the only reason I'm spending time with this is because what we've done here in this 28 seconds, I'm going to then multiply by four regions and 14 different thresholds. That's 56 combinations. So that's, let's look at the math, right? So if we think 28.265 times four regions times 14 thresholds times 100 iterations, right? So that's all in seconds divided by 3600. So it's like 44 hours to run still, right? And so by making it twice as fast, we've taken it from I think it was like four days down to two days to run. And again, I'm sorry about the spoiler, the C++ version will be even faster. And soon, we'll see a way that we can parallelize this to make it even faster. Besides the speed, the other advantage that we get by this approach is that it's a much smaller memory footprint. So you'll remember that with that expand grid, that we were sucking up gigabytes of RAM. Here, we're using kilobytes, even maybe bytes of RAM to store the information we need to generate this confusion matrix. It's a lot more efficient. It's a lot better. But again, we still get the same result. So I would encourage you to think about the different steps that I use through here to try to optimize my code. One of the things I also notice is that we started with a very tidy verse, lots of dplyr, lots of piping of things. And what we've got here, aside from this map DFR, but you know, if we had that for loop in here still, this is what I would call base R, we've gone from the elegance and the simplicity and writing of dplyr pipelines and workflows to base R. And what we see is that there's a lot of nice features in the tidy verse that make it easy to write, but we might be taking some performance hit in using those functions. Now base R, I don't think this is that much harder to write than what we saw with those dplyr pipelines, but perhaps it's not as easy to write. And it's, you know, certainly there's no piping in here, right, which I find to be really nice convenience. Anyway, so I'm going to go ahead and save this. And I'm going to, I'm not going to merge this branch in with issue 37. But I'm going to go ahead and commit this. I'll wait until I'm finished with you all to go ahead and commit that. But again, think about your own pipelines, try to figure out where there are those bottlenecks. And if they're big bottlenecks that are really slowing you down on the order of like hours or days, try to see if you can overcome them by using things like vectorization and thinking about whether or not you are trying to get an individual cell out of that data frame multiple times like we were here. Anyway, stay tuned for the next episode where we talk about using C plus plus code inside of R, even if you don't know C plus plus, it'll give you an insight into how you can further optimize the code beyond the great steps that we've taken today. As always, keep practicing and we'll see you next time for another episode of Code Club.