 Hey friends, have you ever heard that you're supposed to avoid using for loops in R? Well, if you don't use a for loop, then what else should you use? Well, in previous episodes, we talked about using map functions. In today's episode, we'll see how we can do an all versus all comparison of values in a data frame using expand grid along with inner join. I'm patch loss, and this is code club. In each episode, I try to take reproducible research practices and apply them to an interesting biological question. Well, I think they're interesting at least. I don't know about you. But over the past couple dozen episodes, we've been looking at using these practices to better understand the sensitivity and specificity of the Amplicon sequence variance. Well, in today's episode, we're actually going to quantify sensitivity and specificity. Even if you don't know what bacteria are or 16s genes or Amplicon sequence variance or why you should even care, I know that you'll get a lot out of these episodes. Please be sure that you subscribe to the channel, click that bell icon, and you tell your friends so that we can expand the word about all those benefitting from all these great practices. We've probably all heard that in our four loops are to be avoided. One of the nice things about our is that it's what's called vectorized, and that you can run square root on a vector of numbers and every number you'll return the square root for, right? Or you can add two vectors of numbers together without really having to think about it. You could have a vector called x, a vector called y, you do x plus y, and voila, you have a new vector that's the sum of those two vectors. It's great, right? Well, r has performance problems when it comes to four loops. Things get really slow, and that's that's a challenge. You'll notice all through my past r code, I've never used what's called a four loop. Well, that's not going to change in this episode, we're going to learn another strategy for avoiding four loops. And that's called expand grid to solve a unique problem where we want to make comparisons between all possible rows of a data frame, you can think about this in terms of like calculating distances, right? So say you had a table that had 1000 sequences in it, and you wanted to calculate the distance between all 1000 sequences, right? Well, you could do a for loop where you compare the first one to the second, the third, the fourth and fifth, and then come back to the second and compare that to the third, fourth, fifth all the way down. But again, that tends to be very slow in r. So to avoid that, we'll use expand grid from the tidyverse package to create a new data frame that has two columns with all combinations of indices rows or index rows from the data frame that we want to do our all versus all comparison on. And then we'll use interjoined to bring that in. Again, all of this is working towards calculating the sensitivity and specificity of Amplicon sequence variance at any threshold of defining Amplicon sequence variance, as well as for any region of the six NUS gene that we're interested in. So as I mentioned in the introductory comments, in this episode and several other subsequent episodes, we're going to be looking at how we can use r to take our ESV data to generate a receiver operator characteristic or rock curve. And to do that, I have created issue 37 already here on the issue tracker. To give you a brief overview of the idea is that we want to create what's called a confusion matrix. And a confusion matrix allows you to quantify the number of true positives, true negatives, false positives and false negatives, as well as associated statistics, things like sensitivity, specificity, accuracy, precision, recall, all these things. We'll talk about those in a bit. And confusion matrices and rock curves allow you to look at the trade-off between sensitivity and specificity for a classification system as a function of different thresholds that you use. So that sounds a lot like what we've been doing, doesn't it, that we can look at the sensitivity and specificity or as we said in the last episode, the lumping and splitting as a function of the threshold we use. And so that will allow us to generate a rock curve. And so to convert our data into a confusion matrix, it's going to take a little bit of thinking. So the mindset or the mental model I'm thinking of is that we've got two classification systems. We have the Linnaean taxonomy, the bacterial taxonomy, we've got species names, and we have our Amplicon sequence variant classification. And we want to know how well those match with each other or if they're not matched. And so if we take two operons from the 16S genes from all of our bacteria, and these two operons come from the same taxa, the same species, and have the same sequence, then that's a true positive, right? So they're in the same species and they have the same sequence, those two operons go together. If they're different taxa and they're different ASVs, then they should definitely be true negatives, right? Then those are two things that are different. And they're both different. So it's what we call a true negative. If they are, if the two operons are from different taxa, but they have the same EASV, that's what we call the lumping in the last episode, and what we'll call here a false positive. If they have the same taxa, so say that the two operons come from the same genome, but they have different Amplicon sequence variant identifications, then we'll call them false negatives, right? So they're saying they're different by ASVs, even though they're the same thing. And so that's, again, what we call last time splitting. And as we saw last time, we can look at lumping versus splitting as a function of the threshold on the x-axis. And we'll do the same type of thing here, where we can, as I say down here, we will end up building the rock curve, which shows you the sensitivity, the true positive rate versus the false positive rate on the x-axis, which is one minus specificity. And we can do this for each region. And as we draw that curve, each point on that curve is going to be from a different threshold that we use to define an Amplicon sequence variant. So I've created a plan of attack. I've gone ahead and copied this text into an R script that will be the basis of issue 37, as we get going on today's episode. I'll go ahead and bring us into our project root directory. And do get check out dash B issue of 37. Good. Get status, you'll see it's already read because I again, put in that code, get rock data, I'll fire up our studio for us to do our work in there. And we look at files come up, I'm going to be working out of the code directory rather than the exploratory directory, because I can already see that this is going to be an analysis that gets kind of complicated and perhaps more complicated than we really want to put into an R Markdown document. Perhaps we'll go ahead and make an exploratory document once we're all done to kind of summarize what we found. But the bulk of the work we do is going to be done in this get rock data dot R. And as you can see, I've already copied a lot of stuff in here, including the shebang line so we can make this code executable and then run it from the command line, as well as an initial setup of our inputs and outputs. And again, copying over what I had in the issue tracker. I also put in a couple of considerations that we'll want to think about that we may need to generate data for additional thresholds, depending on the shapes of the curves, right? So if there's a big jump in the curve, that would indicate that perhaps we need better resolution between different threshold values. So we might need to then go back and get more points. Also, it might be ideal to look into how we can parallelize our steps to speed things up. I suspect this analysis is going to be really hard for our to do. It's going to be a lot of computation, and perhaps more than I really want to be able to do on my laptop, it might suck up a ton of RAM, it might take up a ton of processing power. So again, in this episode and several subsequent episodes, we'll be looking at how we can perhaps speed this up. And so that's something I'm already trying to think ahead about. So our plan of attack. Again, I like writing out this pseudo code or this outline for myself, so that I know what I need to do to achieve my goals. So I need to read in my ESVs and subset them to get the separate region and threshold combinations. I need to read in the metadata and sub sample it to get one genome per species. So we have the genome ID, ESV and count columns for each taxon. And of course, then we want to join those together. So these first two steps, I'm going to probably borrow from last episode's code, because why rewrite it? And then we want to generate the data frame that contains the separate rows for each operon and the data set replicating each genome ID, ESV, row count times. So in other words, what I want to do is I need to create that all versus all comparison where every I compare every ESV to every other ESV to see if they're the same ESV and whether or if they're different, and whether or not they came from the same genome or species or different species. Okay, and that then will allow me to create that data frame that'll be that data frame. And I can then use that data frame to generate my confusion matrix. I can look at whether or not the ESV columns are the same or different, and whether or not the species columns are the same or different to then determine if it's a true positive true negative false positive false negative. And I can then generate associated statistics for each region ESV and iteration and say iteration, because we are going to as I say up here subsample. And so we might want to run this 100 times for each region and each threshold. And I think that's really where this is going to get really computationally demanding, and probably not today, but eventually we will be plotting our rock curve and perhaps some of these other statistics across thresholds. So let's get to it. And I'm going to come back to my exploratory directory and come back to my last episode. And in this code, let's see, I'm going to go ahead and copy all this stuff. And that is going to be kind of these these first steps. And let me move some of this out. So I'll get it in the order of my of my comments here, my outline. It's nice when you've got code already, because then you can copy and paste it the downside of copying and pasting is that if you have an error once, then you propagate that all the way through. We've talked about dry in the past. This definitely is not dry. But again, although we're writing this in our script, we're still kind of an exploratory mode, and we haven't really like nailed down our analysis. Okay, so I'm going to do the join here. And so let's make sure that our code does what we want it to do. So I'm going to go ahead and remove these here statements, because that was something we used within the our markdown, I'll remove it from down here in the metadata as well. Okay. And we want to get separate region and ESV threshold combinations. So I'm going to come up here and define threshold. And I'm going to say 0.01 and region. I'll do view for I don't know why I use the equal sign. Wow, such a noob. All right. And what I can then do with this is I'm going to bring this code that I had from down here with the threshold and tack that on to the end of this, where I'm reading in the ESVs. Let me get my, I guess that indenting is right. And I will then do filter. And I'll say desired threshold, desired region. And I will then do filter, threshold equals equals desired threshold. And region equals equals desired region. So let me go ahead and start running some of this stuff. And make sure it does what I want. And I look at ESV. And I see that I've got ESV genome count region threshold. I don't need the region and threshold columns anymore. So I'm going to go ahead and drop those. And I'll do minus region minus. I'll keep it positive. Let's do ESV genome and count. And that looks good. Right. And, and so that's what I wanted, right? Read in the ESVs and subset to get separate region and ESV threshold combinations. We're good. Read in the metadata and subsample to get one genome per species. So that we have the genome ID, ESV and count columns for each taxon. And I think that all looks good here as well. Running metadata. I then have my genome ID and my species. So we're in good shape. So then generate data frame that contains separate rows for each operon in the dataset by replicating each genome ID ESV row count times. And so if I go ahead and run this interjoin, what you'll notice, I forgot to rerun this ESV where I remove the region and threshold. Okay, so let's try that again. Good. So now we have our genome ID, we have our species, our EASV. I really only need genome ID or species. It doesn't really matter. And then I have count. This is the number of operons that this ESV appears in this genome, right? And what I want really is to pull that apart, right? So this is on an EASV basis, not an operon basis. And you'll recall that we got that count column many episodes ago using the count function. And we can actually use a related function called uncount. And so uncount on the count function. And what this will do is that this will create separate rows for all of our EASVs and genome species, right? And so whereas before 972 appeared twice in this cathoranthus roseus, here we now have it duplicated two times. Okay. And so that looks really nice. We now have a row for each operon in the dataset. Okay, and let's save that as metadata ESV. And we're in good shape. Okay. Now we want to create a data frame that compares each operon to every other operon. So one thing we could do that I'm trying desperately to avoid is we could write what's called a for loop where we compare this first first line to the second line, the third line, fourth line all the way through. And there's almost 20,000 different operons in our dataset. And then we could come back to this one and compare it to the remainder and keep going like that. In R that tends to be rather slow. And so I would like to try this without without using a for loop, we might come back to doing a for loop, because in some contexts, it might actually be a little bit faster. So there's a function in the tidyverse called expand underscore grid. And I can do say one colon five, one colon five. And if I run that, it's not happy. I think I need to do x equals one colon five, y equals one colon five, run that. And what we see is the output is that it replicates x, the x column one through five, five times. So we have in x ones, and then in y 12345, and then two 12345, right? And we can do a trick then, so they pipe that to filter and say we want x less than y. This then gives us the unique combinations of x and y. And we also don't get the self comparison, right? So this is the strategy that I want to use is that if I then have a index column, a column that numbers all of the rows in metadata ESV, I can then do an expand grid with the number of rows in that data frame. I can filter to get those rows where x is less than y. And then I can do an inner join, or I can then do an inner join between this expanded grid and metadata ESV. Okay, so to get this to work, I need to modify this a little bit and do a mutate. And I'll say index equals row number. And that will add a column to metadata ESV that we see over here in the final column called index, which indices every row in the data frame. Okay, the other thing I can do is say n operons is n row metadata EASV. Okay, and that should be like close to 20,000, right? It should be 19,996. And operons, 19,996, bingo. Okay, so instead of one to five, we're going to do one to n operons, y one to n operons. And then we're going to filter to get x less than y. And again, this is where we start to see things slowing down. So while this is thinking, go ahead and make sure that you're subscribed to the channel, and that you've liked this episode if you don't mind. So we see that we have 199,901,000 rows, I guess, and 10. So it's a very large data frame. The next thing that we will want to do is to then pipe this into an inner join, where we're going to take what's coming through the pipeline and inner join that to metadata EASV by, and then we're going to do x equals, I guess x equals index. And then we will add another row, another line here, where we will then do y equals index. Okay, so we'll give this a run, it's going to be a little bit slow. I'm going to go ahead and call this all the all comparison. And we will let that rip. And I will do some editing to speed this up a little bit. Alright, so that ran through. And if I look now at all the all comparison, it's big. Yeah, I can tell it's not really happy here. Anyway, we see x, y, our genome ID x species x, EASV x genome ID y species y EASV y. So my sense is that this is probably really big. And one thing that I can do is if I return to my command line, and I do top, this will show me what processes on my computer are using the most CPU, or the most memory and I can do if I hit the o key, and then type mem in all caps, it will then sort everything by the memory being used. And I see that my R session is using 18 gigs of RAM. And I forget what my Mac has. I have 32 gigs of RAM, right? So I'm not using it all, but I'm using a lot. One thing that I could do to free up some space is I could certainly get rid of x, y, and probably species x species y, because all I really need is genome ID x, EASV, the ESV x genome ID y and ESV y. So if I remove those four columns, I could actually free up some space. All right, so let's let's do that. So let's go ahead and do select. Actually, up earlier, I'm going to do, I'll do select a minus species, minus species, so that'll get rid of the minus species x, and minus and species y. And then so if I run this, that that's pretty quick. And operons and then down here, I'll do select minus x, minus y. Give that a run. Again, this is going to take a moment. I'll speed things up with the editing. And now it's run. If I look at all the all comparison, I'll of course see that I've got genome ID x, ESV x, genome ID y and ESV y. Okay, so again, what I want to do is I want to compare and see for this row, do the two do x and y. So the x and y are two different operons I'm comparing. Do they come from the same genome? In this case, yes, they do. Do they come from the same, do they have the same ESV? In this case, yes, they do. So this would be a true positive, right? Here we could look at this next case, where they come from different genomes, right? And they have different ESVs. So that's a true negative, right? So what I can do is I can use our filtering or not our filtering, I can use our logic with a mutate to define our true positives and true negatives. And I can then let's see, I can add this here, and do a mutate. And I will do, I'm gonna, because I easily get confused, I'm going to grab this text for my comment up above here, I'm going to plop it down here, so that I don't easily forget what is in my confusion matrix. Okay, so we have a TP for true positive. And so that's going to be genome IDX equals genome IDY. So I'm going to put that in parentheses, and then say, and what, ESVX equals equals ESVY. Okay, and so again, I think that's good. If I run that, this is going to create a column called TP. And so as we see, that first column is a true positive and that first row is a true positive. Great. I can then add TN for the same thing. I'm going to copy this line down. Oops, must have hit the wrong key somewhere. So I got a semicolon there. So true negative is where the genome IDs are different, and the ESVs are different. And again, we can look at this and see that we will get a column called true negative. Right. And so where most of these actually are true for true negative. And that's great. And we can then do true negative false net false, false negative. And I'll again copy this logic down. And so we said that a false negative is same taxa. So they're equal, but these are different. And then a false positive is where these are different, genomes are different, but the ESVs are the same. We'll remove that, run it. So when you're working with this many rows, it gets really, really time consuming. And, and so then we've got our columns of true positives true negatives false negatives false positives. And we can summarize. I think I get my yeah, I can summarize to do TP. I'll say true positive true pause is some of TP. True neg is the sum of column Tn. And then false neg is going to be some of Fn. And then false pause is some of Fp. And if we run all this, we will get out our confusion matrix. So I'm going to call this confusion matrix. Let that rip. And again, it's going to take a moment or two, because we are looking at close to 200 million rows in this data frame. And if I look at confusion matrix, I see I've got about 48,681 true positives. I think that's like close to 199,000 or 199 million true negatives. We've got about 2,300 false negatives and about 515,000 false positives. So these false negatives again, are splitting genomes. And false positives are lumping. So splitting false negatives are falsing splitting species, false positives are lumping species. Okay. Now, again, this is our confusion matrix. And that works pretty nicely. And we can do some other summary statistics, where we might do mutate. And we could then say things like sensitivity, we could measure sensitivity, specificity. And because I always forget the correct formulas for these things, I embedded a link in my issue for associated statistics, which I will open up. And this goes to a page on everyone's favorite Wikipedia. And so the true positive rate is the number of true positives divided by true positives plus false negatives. So true positives over TP plus FN, false negatives. And then our specificity is going to be, I believe, our false negatives, sorry, true negatives divided by true negatives plus false positives. False positive. You can look at that. And I misspelled false. And we can see that our sensitivity is about 95%, 96%. And our specificity is about 99.7%. And again, this was at a threshold of 0.01. And the region was v four. So what I can do to add on to this would be to say region equals desired region. And threshold equals desired threshold. Run that. And we then have all of our confusion matrix stuff, as well as our region and threshold. I can reorganize this output by doing select region threshold. And I can instead of listing everything else out, I can type everything with open close parentheses as a function. And that will put my region and my threshold first my confusion matrix, and then all my other metrics after that. So this gets us, I think, into a really good place. And so let's call this confusion. I'm going to actually, you know, I'm going to pen these pipelines together. And we'll run this. And it's not happy. I do wrong. I've got that plus sign in our studio. Get out of that by hitting the escape key. So run this again. This will run. And I think this gets us what we want. So again, this is rather slow. I mean, I think it's slow. It takes a minute or two to run. We have 52 combinations of regions and thresholds. So take this and multiply it by 52. And then multiply it by 100. Because if we want to do 100 iterations, because we're doing that slice sample function, it's going to take a long time running on my computer. And perhaps we do it once and we get over with it. And it's not that big of a deal. But it's still going to take a long time on my computer. And I might want to do other things with my computer than watch our studio just kind of wait and wait and wait to run everything. Okay, so I think this is a good place to stop for now. We have a really good scaffolding of the project. We've gotten through all of our plan of attack. Something we might think about would be, you know, how do we iterate over these different thresholds we have? How do we iterate over the different desired regions? And then how do we replicate this, say 100 times? But for now, I'm going to keep it with one threshold, one region, one iteration, because in the next episodes, we want to think about how can we actually speed this up quite a bit. So I think again, we're in a good place. I will go ahead and do a get status and get commit or get add code rock or get rock. What we did on this episode, that was kind of unique. Again, we're seeing how we can use our code to work through a complicated problem. The other thing that we did to avoid for loops was to use the expand grid function along with inner joins. Some of that is slow. It remains to be seen whether or not that's actually slower than doing for loops or perhaps doing some other approaches that we might pursue. Again, expand grid is really nice. If you want to get all combinations of two or more variables, we used it in this case with the filter function to get those combinations of indices values that if you think about representing a cells within a square, right, we're getting all those in the lower triangle or the upper triangle, I don't know which, but in one of those triangles of that of that matrix of comparison, and then creating our columns of the true positives, true negatives, false positives, false negatives, and then using our summarize function to aggregate those together, add the number of cases, and then measure things like sensitivity and specificity. Once we're all done with that, then we'll be able to make our plot and other figures that we might be interested in out of this analysis. Okay, so again, play around with that expand grid function. Again, that was a strategy we could use to avoid using for loops. Think about your own code. Are there places in your code where you are using for loops? You know, we've seen before using the map functions, here we saw how we could use expand grid with joins to avoid for loops in your own code. Do you still have for loops after trying these strategies? Where are you still using for loops? Let me know down down below in the comments. And maybe in a future episode, we can think about how we would go about avoiding those for loops. I hope also in a future episode, we can look at for loops and thinking about places where for loops are good, and in for places where for loops are definitely to be avoided. Anyway, keep practicing with that. Please let your friends know about code club, and we'll see you next time for another episode.