 Has anyone ever told you that you should avoid four loops in R? Have you ever wondered why? Well, in today's episode, we'll dig into this issue. Hey folks, I'm Pat Schloss, and in each episode of Code Club, I try to apply principles from performing reproducible research to an interesting biological question. Over the past couple dozen episodes, I've been digging into the question of Amplicon sequence variants, also called ASVs, to better understand the sensitivity and specificity of their use in defining bacterial taxa. Now, in the last episode, we started to build what we call a confusion matrix, where we look at how well the assignment of a 16S RNA gene to a Amplicon sequence variant corresponds with the bacterial taxa or species that that sequence came from. We know that if we use an ASV definition that's too narrow or too thin, then we run the risk of taking a single genome and splitting it apart into multiple bins. And that's because there might be seven, might be 20, even copies of the 16S gene in a genome. But of course, what we've also seen is that as you define that ASV more broadly, then you run the risk of pulling together or merging or lumping together different bacterial species into a single taxonomic unit. Which one you should prefer over the other? Well, I think I prefer lumping over splitting, but we'll get into that later. In this episode, we're going to take the code that we generated last week using expand grid to create a all versus all comparison to calculate our confusion matrix and measure sensitivity and specificity. And instead, we're going to convert it to a double for loop nested for loop. Why would we do this? Well, it's going to help us get somewhere in the next episode. But it also gives us an opportunity to see why for loops have such a bad rap in our and perhaps how we can use them to get better performance. So I'm going to go ahead and move to my project root directory. You'll see that we're in issue 37. Looking at my issue tracker, you'll recall that we were making this confusion matrix, looking at true positives, where two operons had the same species and the same ASV, they're a true positive. If they were a different taxa and different ASVs, and they are true negatives, and we could also think about false positives and false negatives, and then use that to calculate sensitivity and specificity. One of the considerations that we had, though, the second consideration is we might want to look into paralyzing things or thinking about how we can speed things up. I'm not totally ready to go ahead and parallelize the calculations, but I think it took about a minute and a half or so to run the code we generated last time. So let's go ahead and quantify how long it takes to run that. I'm going to go ahead and open up our studio, and you'll recall that our code was in code and it was called getrockdata for receiver operator characteristic curve. And I'm going to go ahead and run a bunch of this stuff. Let's see. I'm going to copy all the way down through in operons. This takes a second to run. And the output is, the data that we have at this point is metadata, EASV. And we see this is a data frame with three columns consisting of the genome ID, the EASV, and the index or the row within the data frame. And so you'll see the first two rows are the same. And that's because this EASV shows up twice in this genome. What we did with this all-v-all comparison was to expand the grid, do the inner join, and then go ahead and generate the confusion matrix, right? So to clock how long this takes to run, what I can do is I can come back to my terminal and I can do chmod plus x on code getrockdata. And I can then use the time function on code getrockdata. And we'll let this run. It's going to take a minute or so to run. But this will give us a reference for how long it currently takes to run using that expandGrid notation. As always, as we wait, please be sure that you're subscribed to the channel and you clicked on the bell icon so you're alerted to when the next episode is released. We've got some really cool episodes coming up in the next few sessions that I'm really excited about that do build upon this work and are going to get us towards writing and submitting a manuscript. I'll go ahead and copy these timings into my R script so that I have a reference for how long it took to run using the expandGrid approach. I do want to run all this in my R session. Again, this might take a moment because one of the reasons that we might be willing to take a performance hit in terms of time running a set of for loops is because we can write those for loops in a way that doesn't hog up much memory. Last time we saw that the data frame I generated after this step, I think, was about 16 or so gigabytes of RAM. My computer has about 32 gigabytes of RAM. We're getting kind of close to the capacity of what the computer can handle. We might run slower using a for loop but use less RAM. That might be a tradeoff that we're willing to accept. A useful package for benchmarking is called PRYR for benchmarking memory usage and memory allocation. We'll use a couple of its functions in today's episode. I can say object size with all the all comparison. Let's see how big this bad boy is. If I look at top, we see that the full amount of memory being used is about 6.4 gigabytes. That's for the whole R session, not just the object. It's taking a long time for this to calculate how big it is. It's not very happy about it, no doubt. One of the reasons this is relevant is I might want to parallelize the process, put different regions on different processors or different thresholds on different processors. But if each of these all the all comparison objects takes, say, 6 gigabytes, then I can maybe only run four or five processes at the same time before I use all the RAM in my computer. Again, this is one of the reasons why a for loop might be preferable because I could theoretically use a lot less RAM even though it might take longer to run and then I could parallelize it. You see that the object size there is about 6.4 gigabytes. One of the other things to note about this processing time is that, again, the real time, or the wall time as we call it, that it took to run the entire script was about 85 seconds. So 85 seconds times 52 combinations of regions and thresholds times 100 iterations is 442,000 seconds. If we divide that by 3,600 seconds and an hour, that's 122 hours, which is 5.1 days, right? Like I said last time, I've got other stuff I want to do with my laptop. I'm not sure that I want to tie it up for over five days running just this analysis. So we'd like to find a way to speed it up or to use a lot less RAM so that it's easier to parallelize what's going on. So I'm going to keep this in mind, this one 85 seconds basically, and we're now going to rewrite this all the all comparison as well as this confusion matrix section and what we'll do is we'll write this as a for loop. So a for loop in R, we say for and then parentheses I in one colon and then however long we want that to be. And so we had n operons, right? So for I in one to n operons, we're going to then use the curly braces to find the body of the for loop. And we're going to nest in that J for J in two to n underscore operons. And this will keep us from comparing row one to row one, we're going to compare row one to row two and then all the others. And one thing I want to note is that for this I loop, we actually don't want to go to an operons, we don't want to go to the end of the data frame, we want to go to one one step one line before the end. So that when we're comparing I at that next to last row, J will be the final row. Okay. So that's good. And what we can do is we can think again about the code that we had down here, which I will copy for now, up here, to give me a sense or to remind me what I'm wanting to look at. And you'll recall that our object is called metadata ESV. And so I'll say same genome. So this will be a logical. And that will be genome ID, or no, it's going to be metadata underscore ESV. And then we're going to use square brackets. So square brackets, we haven't used in previous episodes of Code Club. It allows you to give a row and column designation in the square brackets, separated by a comma. So I'll say I comma genome ID equals metadata ESV, square bracket J comma genome ID. So if the first and the second rows have the same genome ID, same genome will be true. I didn't mean to run that. And then we'll say same ESV. And we'll do the same thing that we had up here. Instead of genome ID, we'll do ESV, ESV. And then what we will do is I'm going to create a named vector that I'll call confusion. And I'll say true pause equals zero true neg equals zero false pause equals zero and false neg equals zero. And I'm defining these as zero from the beginning, because I'm going to add one every time we see one of these cases. And so what we will do is we will say if we'll create an if else block here. So if same genome and same ESV, then then we want to say confusion true pause. And we will then say that is equal to true pause plus one. And I'm going to copy this a couple times. And we'll add else if to each of the lines. And so then if we have not the same genome, not the same ESV, then this is going to be a false pause and false pause. Right. And if we have the same genome ID, but different ESVs, then we will call this a false neg. And if we have different genomes, the same ESV, then we'll call this a this is false pause. This was a true neg. Right. If they've got the different genome, different ESV, that's a true negative, true negative. This is a false pause, false pause. And we're adding one to each of those. Think what I'll do to make this a little bit easier is convert these to two letter abbreviations. That way I'm not scrolling across the edge of the screen. TP, TN, FN, FN, FP, and FP. Okay. So that's going to generate my confusion matrix, right? Good. So we can go ahead and save that. And this is a nested for loop. At the end of all this, we could then output confusion, right? And that will give us what we want. The problem is that this is going to take a long time to run. Okay. So I'm going to run this for 85 seconds. I'm going to set it with my timer on my phone. And when this gets to 85 seconds, I'm going to kill it and we'll see how far it got. Okay. So we will go ahead and run this. And I see it didn't get very far at all. And I think metadata ESV not found. And why would that be? I think maybe I forgot to run all this other stuff up ahead. So I'll run that. I thought I'd already run that, but maybe not. I called it med data, not metadata. Met a data. All right. Spelling is hard. Okay. Let's get this to work. There we go. Okay. So it ran for 85 seconds. And let's look at how far confusion got. And we can then sum across confusion and see that it got 218,000 comparisons in. You'll remember from the last time, that actually got to 200 million as the total number of comparisons. And so this is really 1000th of the way there. If I type I, I see only got to the 11th row of about 20,000 total. So it didn't get very far. And so this is considerably slower than what we saw before. And so maybe it's going to take 1000 times longer to run, which is not good. It's not good, right? We want to do better than that. So one of the reasons why, so let me, let me copy this first before we go into that. And so let's say iteration one got us this and this. Okay. So one of the reasons that for loops are so slow in R is because every time we do something like this, this line here, what's happening is that we're taking a value of an element in confusion, we're adding one, we're copying that variable and then we're writing it to another variable. And so every time it does that R has to find a new place basically to put this vector. In this case, it's not that they give a deal because the vector is kind of small. It's only got four elements in it, right? But imagine if we had to copy a big chunk of data every time, and it was thousands of loops, right, iterations in a loop, that would be really long. So this is what's called modification in place. So we are not doing modification in place. If we instead wanted to do modification in place, what we could do is we could convert confusion to a list. So let me back up a little bit, and I'm going to turn operands and operands to five. So we have something somewhat easy to work with. And what I will do is I will use the function from our player or pot p r y r, however you pronounce that, and we'll say address confusion. Okay, so let's run this and that will output the address of confusion every time it goes through the loop. Ah, sorry, I need to say print address. So print is a useful function for outputting what a the value of an object within a loop or within a function. And so these are the memory addresses for those fifth first five rows of processing the data and you'll see they're all different, right? They're writing, they're modifying updating confusion and then writing it somewhere else in memory. That's really slow. The alternative, like I said, was to convert this to a list. So we can say list, all these things, right? And maybe I'll do as dot list. So then that confusion we see is a list with these different values. We haven't talked about lists either. I usually don't use lists very much in my own work. And so this is one case where it might be useful to use a list. And that again, is because of this modification in place. So instead of using a single square brace, to access the values of confusion when it was a vector, as a list, we need double square braces. So I'm going to go ahead and put in an extra pair of square braces here. For all these confusion values. And then we'll look at the address and see if the address changes each time we go through our nested for loops. Okay, now let's run this and see if the address changes. And sure enough, every time through the loop, the address is the same. So it's modifying it in place. And we're not asking R to find a new place to put this vector. So I think that's great. What I will now do is kind of repeat what we've done before. I'm going to put this back to N operons. And then this will be from yeah, and operons here. And I'll get rid of that print statement, because we don't want to have all that flying at us. And I will again run this for 85 seconds, and we'll see how far it gets if it gets any farther at all. Again, the vector isn't that big to have to rewrite every time. But you never know, a little bit might go a long way. All right. So again, 125 or minute 25, we output confusion, if I do some confusion, and won't let me do that. So let me do confusion. So we can do confusion, unlist gets us a vector. And then we can sum that. We get 220,000, which I think is a little bit further than we had gotten back here. Let me see where that was 218,000. Yeah, it's probably not a super meaningful difference. So this is iteration one with confusion as a vector. And let's see, this was iteration two with confusion as a list. So it got a little bit further along. One of the things I'm noticing actually, is that it doesn't seem to have any any false negatives. And I wonder why that is. And I suspect that it didn't get deep enough into the data frame. If we look at I to tell us how many rows it got in, it only got 12 rows in it only looked at 12 sequences relative to every other sequence, and I may not have seen any false negatives at that point. All right, so again, this difference in kind of modification in place is one of the big reasons why for loops are slow in R. The use case I tried to motivate this with was that our data frame, the confusion data frame that we had generated using expand grid before with those inner joins was was massive. It was like six and a half gigabytes of RAM, just too big, right? And so perhaps what we could have seen was that this I didn't measure anything, because the only object that we have is confusion, right? So if I do object size confusion size not sizes, it's only 720 bytes, it's not even a megabyte, right? So it's really, really small. It's only four numbers. And so the tradeoff could have been that it'd be run a lot faster, but use less room. But what we see is that this is a bit of a dud, right? The for loop doesn't get us anywhere. And we see that even though we thought that that expand grid approach was slow, it's not nearly and not nearly as slow as the double for loop. And perhaps if we had seen if we'd been keeping track of a larger data frame, than just this confusion vector, we would have seen a greater speed up. Now, I don't want you to think that this is a waste of time. What we're going to do in the next episode is we're going to take this code. And we're going to do something cool. We're going to convert this to C++. Now don't worry, if you don't know C++, we're only going to use a very elementary part of C++, and a really cool package called our CPP that allows us to write a function in R as C++ code. And I can guarantee you this is going to run so much faster than anything we've seen yet. And I'm leaving this code here, because this code is really the logic for what we're going to do in C++ and C++, we're going to run the double for loops, and it's going to run faster because it's in C++ and R is written in C and C++. And that's why a lot of the functions like expand grid run so much faster than these for loops that we wrote. Anyway, so if you're ever tempted to create a for loop or heaven forbid, a double for loop like I've done here, I hope this is a little cautionary tale. So if you have to, then use lists rather than vectors, and you can check to make sure that you're not copying your data using that address function from our prior. And if at all possible, try to use alternative approaches, things like those map functions, and things strategies like we used with expand grid. Anyway, again, this was not a waste. This will help us in the next episode, and hopefully will be a reminder in the future. If you ever think you really need to use a for loop. Anyway, keep practicing with this. I know this stuff is hard and sometimes a little bit confusing, but I know you can do it. I have confidence in you and your abilities. Hey, if I can learn this stuff, I think anybody else can as well. So please tell your friends about these code club episodes. Be sure you let you subscribe to the channel and click down that bell icon so you know when this exciting next episode is released. And we'll see you next time for another episode of code club.