 Have you ever worked on a data analysis project where you have one step that just takes forever to run? It's so slow. Yeah, well, we talked about one of those in the last episode. Well, when all your other strategies fail, there's usually one more choice that will really help to accelerate your data analysis. That's embedding C++ code into your R code. And that's the topic of today's episode of Code Club. Hey, folks, I'm Pat Schloss and this is Code Club. In each episode, I try to work with you to identify ways to incorporate reputable research practices into solving an interesting biological problem. Over the past several episodes, we've been talking about analyzing what are called Amplicon Sequencer answer ASVs and their propensity to split individual genomes into multiple bins or to perhaps lump together different bacterial species. Now, this might sound esoteric to a lot of you, but I guarantee that if you stay tuned, you'll learn a lot. And in this episode, we'll talk about using C++ code in your R code to really accelerate those bottleneck steps. Two episodes ago, I started working with you on an issue, issue 37, where I was trying to build confusion matrices to describe, again, this propensity to split apart genomes or to lump together different species using Amplicon sequence variance. We know that individual bacterial genomes might have six, seven, 20 copies of the 16S RNA gene and that they're not identical. And so if we use identical sequence variants or ESVs, or even variants that allow for a little bit of slop or room to define an Amplicon sequence variant, we run the risk of taking an individual genome and splitting it apart into multiple taxonomic units or operational taxonomic units or ASVs, or whatever the heck you want to call them, right? The flip side of that is that as you expand that slop or that definition of what you will accept for an ASV, then you run the risk of merging together different bacterial species. An obvious example of this might be something like Bacillus Sirius, Bacillus Thuringiensis, and Bacillus Anthracis. They're three different species, yet by 16S, they are all the same thing, right? You would have to sequence their genomes to tell them apart. Well, how you define that threshold for how similar 16S sequences need to be to be in the same group affects whether you're splitting or lumping things together. I guess I should use my hands right whether you're splitting things or lumping things together, right? Okay? So we set about for this issue to create a confusion matrix so that we could look at how that confusion matrix changes as we change the threshold. Now what's a confusion matrix, you ask? Good. So it is a matrix that allows us to look at the rates of true positives, true negatives, false positives, and false negatives. What do all those mean? Well, a true positive is when you have two operands of the 16S sequence that are from the same species, and they wind up in the same amplicon sequence, right? That's a true positive. A true negative is two operands that are in different species and in different amplicon sequence frames, okay? Now, we also have false positives. And so these are two operands that are in different species, but the same ASV, okay? So we're saying they're the same, but really, they're different. And a false negative is where we're saying they're different, but they're really the same. So two operands that are in the same genome, but have two different ASVs, okay? So that's a lot of mumbo jumbo about how we're going to create or how we're going to define our confusion matrix. So how do we generate that confusion matrix? That's what we're working on, okay? So two episodes ago, we talked about using expand grid to generate this, and for one iteration, it took about a minute and a half to run. And you might say, that's really fast, Pat. Who cares? Well, the deal that we care, because we have 52 combinations of regions and different thresholds we want to test, okay? And we want, we're taking one genome per species. And so because we're randomly sampling genomes per species, we need to repeat this analysis, say 100 times, so we're not kind of hitting a jackpot or getting unlucky in which genomes we're picking. So we'll want to repeat it 100 times, right? So if we take that minute and a half and then multiply it by 52 combinations times 100 iterations, it's going to take about five and a half days to run on my laptop, okay? You might say, well, why don't we parallelize it? Well, we could, but each instance takes up most of the RAM on my computer. And so it's not really tractable to distribute that across multiple processors on my computer to speed it up. And even if we could speed it up, it's still going to take like hours, right? Maybe 12 hours, eight hours to run. And that's still longer than I would like. So one of the bottlenecks there was memory as well as speed. And so in the last episode, we talked about perhaps using for loops instead of expand grid, because expand grid generated a data frame that had about 200 million rows, right? It's huge. Whereas if we used for loops, then all we need to do is keep track of our confusion matrix as it as we add observations to it. And that's very small, right? We're only keeping track of four numbers. But the for loop approach was about 1000 times slower than the expand grid approach. So what we're going to talk about today is how we can use C++ code embedded in R using a special package called RCPP to carry this out. And I guarantee you it's going to be a hell of a lot faster. So again, even if you don't know anything about microbial ecology, genomics, 16s RNA gene sequences, Amplicant Sequencer, ESVs, all this during if you have no idea what it means, stay tuned, because I'm going to show you a real example of how to embed C++ code into an R script. There's a lot of examples out there we might call them toy examples of how to embed C++ code into R that don't really do anything useful. Right? Well, here I'm going to show you a real in the wild application of RCPP in an R script. And you'll see just how much faster it really is. All right, so I'll go ahead to my issued branch issue 37. We're good. I'll go ahead and open up Schloss, our project file in our studio. And what we were working on last time was get rock data dot R, because again, we want to use the confusion matrix data to generate a receiver operator characteristic curve. This is my code here. I've got a lot of the stuff that's here is also in the issue tracker that you can go check out. You'll recall my plan of attack. And where we're at is thinking about these steps four and five. So four is to create a data frame that compares each operon to every other operon. We're doing an all the all comparison using expand grid. It took about a minute and 25 seconds using the nested for loops. We were able to get through about 1000 1000th of the comparisons that we needed to in that minute and 25 seconds. So this is the code that we generated last time for generating the confusion matrix. So what we'd like to do is replace this with C plus plus code. Well, I'm not going to teach you C plus plus here, but I'm going to use some elementary C plus plus approaches that are not super advanced. And so again, what I'm trying to show you here is how to incorporate C plus plus code into your R code, not to teach you C plus plus C plus plus probably shouldn't be the first language you learn. But it can be a really useful tool for speeding up your code. Most of the tidy verse is actually written in C plus plus, which is why it's so fast, right? C plus plus runs a heck of a lot faster than R. But R is a lot easier to write. We talked about this in a previous episode where I kind of contrasted developer time versus execution time. R is great for developer time. C plus plus is great at execution time. Okay. So because I don't do this a lot, I don't use C plus plus in my code a lot because ours generally fast enough for most of what I do, I have two resources that I like to go to. The first is this chapter from Hadley Wickham's book, Advanced R. It's available online. I'll put it down below in the show notes, a link to it. He has a chapter on re rewriting R code in C plus plus. And there's another online book called RCPP for everyone by Masaki Suda is a whole book online about using C plus plus. Both are great resources. So again, the first step in putting C plus plus code into your code is loading the library C plus plus RCPP. So up here where I did library tidy verse, I'll do library RCPP that R is capitalized. And you might come over to packages and make sure you have RCPP installed. I believe that if you have tidy verse, you should also have RCPP. Okay, so let's go ahead and load that library. Everything is good there. So we'll come back down to our code. And the next step is that we want to use a function called CPP function to then describe the header of our function. And again, this is kind of getting into the nitty gritty of C plus plus, but stay with me. So I'm going to go ahead and comment out these lines through here. We'll probably come back and end up deleting them. But I want to leave them because this code here really represents the logic that we want to put into our C plus plus code. So I'm going to start my function by doing CPP function. And then in quotes, I'm going to put the data type that I want it to return. So I'll say return type. Ultimately, I'm going to want this to be a row of a data frame. But we have to figure out how to do that. And then the name of the function. So I'll say get confusion matrix. And then in parentheses, we want the argument. And I'm pretty sure this is going to be data frame with capital D, capital F. And I will then say ESV matrix. And let's see. So here we will put the closed parentheses on the arguments, and then an open brace and a closed brace there. And we're in good shape. So what we want to do is this return type is also going to be data frame. And let me look back at my resources here to see if Hadley talks about returning a data frame. Maybe I'll do a search for data frames, lists and data frames. Let's see, does this help me any provides a list and data frame classes. But they're more often more useful for output than for input. Because lists and data frames can contain arbitrary classes. But C plus plus needs to know their classes in advance. Okay, we're going to take that input. And we are going to create EASV. I'll say EASVs. And it's going to be an equal sign. And we'll say EASV matrix zero. And so important thing to know is that C plus plus is zero index. So it starts at zero rather than one. And so I also need to define this as a character, character vector for EASVs. And then our lines need to end in semicolons. It's another C plus plus thing. And then we need another character vector, character vector genomes equals EASV matrix one. So the first column are going to be our EASVs. And our second column is going to be the genomes. And that reminds me I should probably go ahead and run all this stuff. So that I get everything I need loaded. And so if I look at metadata EASV, I see my first column is the genome ID. The second column is the EASV. So I actually need to flip those. Good thing I looked. Okay, so where was my code? Good. So this is going to be one and the genomes are in the zero slot. So I'll switch these orders, the order of these two lines, so it makes more sense. And I'm going to go ahead and call this genomes EASVs. So it's a little bit more descriptive of what it contains. So that's great. And I'm also going to define an int for true positive to be zero, false positive zero, true negative to be zero, and then false negative to be zero. And I'm initializing these to zero, because if I don't initialize them, then I, you know, C plus plus isn't on the up and up for me, then it can actually initialize these with any values. And so it's the safest operation to initialize these to zero. And that's kind of what we had seen up ahead here when we ran the for loops in the previous episode. Okay. So now what we're ready to do is we're ready to index through all of our rows of genomes EASVs are these two columns, right. And so what we'll do is for int I equals zero, I less than genomes EASVs zero, and I'm going to define this as int n operons. And that's going to be the length of genomes. So this will be for I less than n operons. And you'll recall from the last episode, we didn't go all the way to n operons, we did n operons minus one. And so then J, so this will be at a curly brace. And then we'll do for int J equals one, J less than an operons, and then J plus plus, and I forgot the I plus plus up here. It really is remarkable how if you don't use a language in a while, you kind of forget some of the subtleties of the syntax. So like we had up here, if we have this, if we've got the same genome, then we want that to be true. Right. So I'm also up here going to define a Boolean for the same genome and same EASV. So same genome if genomes I equals genomes J, and then same EASVs I equals EASVs J. So again, the syntax is or the logic at least is very similar to what we had up above here for our for loop. And so now we can plug this all in where again, a lot of this is going to be the same. And maybe I'll go ahead and remove the comments and put this into place. And so we'll have this. But instead of all this, we can do TP plus plus semicolon. So that's a unary addition operator that adds one. So if you've noticed C plus plus is C plus one, right. And actually, R is a little bit of a play on that, because R is based on S, because it didn't have a unary operator. I didn't mean FS, I meant FN. They subtracted one. So there's a little bit of tongue and cheek there in how R was named in reference to this operator. Okay. All right. So this then is going to scream through. And it is going to generate the values for the confusion matrix. But I need to output a confusion matrix. And it's going to be of type data frame. So how do I do that? Well, let's go back and look at our resources here. Let me make this a little bit bigger. And let's see, how do we output a data frame? And so this isn't super helpful. But I know that RCPP has a section I believe on data frame. So we see data frame here. Yeah, that's not quite what I wanted. Here we go. Here's chapter 12 on data frames. And we can use data frame create to create a data frame object, also use named or underscore square brackets, if you want to specify column names when creating a data frame object, great. So we are going to create use this example here, we're going to give names to our columns. And let's see, let's see, we'll put that there. And our data frame is going to be called confusion. And we will create. So this is going to be TP, TP equals TP. And then we're going to replicate this for the other three values. And I think I can probably break this across multiple lines, and close that off with, yeah, I think that's the closing parentheses there. And instead of TP for everything, I'm going to do false positive, false positive, true negative, true negative, and then false negative equals false negative. That should work. And then we will say return confusion. Okay. And that with semicolon. And so that should be good. And let's go ahead and load this. And what it does is it actually compiles it and it'll tell us if there are any problems. So expect it a semicolon at the end of the declaration. So I wonder if I do need to keep all this on the same line. So let's, let's give that a shot. So we'll go ahead and rerun this. It's still unhappy. So named TN equals TN, named fn equals fn. So let me go back to my resource here and see the, huh, so data frame create, and then the parentheses, do I have a semicolon at the end? I do. Oh, I've got an extra closed parentheses here after the FP. And you know what? If I look at my error message, yeah, it tells me that it's expecting a semicolon right here. And that's where my extra closed parentheses was. Okay. So we should be in good shape. So I'll go ahead and reload this function and hopefully it'll go through without an issue. Boom, it worked really well. Now what I can do is I can use this function like it's a normal function, right? So I can use get confusion matrix. And I can then put in what was it? It was metadata ESV. I spent a little while since I looked at this. Yeah, metadata ESV. And if everything works, I shouldn't get any error messages, I should get an output data frame. So that went pretty quickly, much faster than a minute and a half for sure. And so that looks really good. So what I'm going to do is I'm going to go ahead and wrap this in system time. And you will call last time it took about a minute and 25 seconds to run. And so we'll see how fast this is. So it took about 10 seconds to run. So that's pretty fast. It's certainly a lot faster than what we saw before. It's about eight or nine times faster than using expand grid. Okay, so that's great. And so again, using this bit of C++ embedded in our code makes that a lot faster. And part of the reason it's faster is because C++ is a kind of one way of thinking is like perhaps a more basic programming language, and that it's kind of much closer to the processor than say R is R is written in C and C++. And so it's kind of abstracted away a level. And that level of abstraction tends to slow our down. Okay, so let's go ahead and clean up our code a little bit here to kind of remove all this old detritus of code that we had from those previous episodes. One thing also I can do is I can get rid of this index column. And I don't think I need this and operands thing because this isn't used anywhere. I define my own and operands inside of this get confusion matrix function. And this all we can revisit now, right? And so let's remove that and come down. And our confusion matrix is going to be basically get confusion matrix ESV, I'll go ahead and remove these comments, remove those comments. And I think this should all be integrated now. So if I run this confusion matrix as part of the whole pipeline, it should give me the confusion matrix as output, as well as those extra metrics that I was measuring like sensitivity specificity. And sure enough, we see region threshold true positive true neg false neg false pause sensitivity specificity, we're in good shape. Great. And so now what we can do is we can think about how we would perhaps take this and map it over all of our regions and all of our iterations to tie it all together. So I think that's enough for today. We will come back to this in the next episode to see how we would tie it together to go over all the combinations of regions that we're looking at the different thresholds, as well as those hundred iterations. And so what we should see now is that if it takes 10 seconds per iteration times 52 combinations times 100 iterations, divided by 3600 seconds, that gets us 14 hours 14 or 15 hours. That's a lot faster than five and a half days. And also, because we're not storing that massive 200 million row data frame, it's going to use a lot less RAM. And you know what, we might be able to speed it up even faster by using some of the parallelization options that come with some of the R packages. So stay tuned for that for a future episode. But so that you know when that future episode comes, please be sure that you subscribe to the channel and that you're getting alerts for new episodes being released. So let's go ahead and I'm going to go ahead and commit this issue. I'll commit at it, I get commit to say use RCPP to speed up all the all comparison. And something that I'm thinking about is that I should probably add to my read me that I'm using RCPP as a dependency. And if I come to my read me dot RMD file, that let's see, we're going to do library, RCPP. And then down here, we'll copy that and instead of our markdown, I'll say RCPP, and get that. And so that's good. And I can go ahead and update with the dash amend. And we're good. Great. So if not the next episode, the episode after that will finally close out this issue by being able to construct our receiver operator characteristic curve. Please think about if there are steps in your pipelines where there's a real bottleneck that takes forever to run. And you know that might be a good motivation to start exploring how to learn C++. Not asking you or suggesting that you write an entire packet using C++, but perhaps there's functions that you could write in C++ to really speed things up. That's the approach we took here. Anyway, C++ certainly shouldn't be the first language you learn if you're getting started. Feel comfortable with R and know that to go that next level to learn C++ to speed things up, there is quite a bit of a hurdle to get over to learn that new skill. So anyway, give it a shot, see how far you get, have some fun with it. It is kind of cool to be able to run C++ from within R. This is a relatively new package for R in the last, I would say it's become popular really in the last five or six years. Give it a shot and see what you think. Have you used RCPP in any of your previous analysis? Let us know down below in the comments. I'd love to see where you've used it and perhaps what might be some of the points or friction for incorporating C++ in your own analyses. So until next time, keep practicing and we'll see you next time for another episode of Code Club.