 In many data analysis pipelines, there comes a point where you have to decide whether you're going to move your analysis to be more sensitive or more specific. Sensitivity and specificity are held in tension because they're inversely related to each other. How do you decide how to balance sensitivity and specificity? Well, that's exactly what we'll talk about in today's episode of Code Club using R. Hey, folks, I'm Pat Schloss and this is Code Club. In each episode of Code Club, I try to apply principles of reproducible research to data analysis question. Over the past couple of dozen episodes, we've been looking at the sensitivity and specificity of Amplicon sequence variants. Finally, we might be getting towards our answer. In today's episode, we'll look at different threshold definitions for creating Amplicon sequence variants and see what impact those different thresholds have on the sensitivity and specificity of how we define an Amplicon sequence variant. Okay, don't go anywhere. Trust me, even if you don't know what Amplicon sequence variants are or 16S RNA gene sequences, I know you'll get a lot out of today's video. We're going to talk about making receiver operator characteristic curves called a rock curve in R using data that we've been generating over the past few episodes as we've been pursuing the creation of a confusion matrix. Okay, so once we've got our rock curve, well, how do we interpret it? Well, in today's episode, I'll show you a couple different ways that we can look at that plot to balance the sensitivity and specificity of how we define an Amplicon sequence variant. You'll recall that if we set that threshold of how we define an ASV to be too narrow, then we risk splitting one genome into many different Amplicon sequence variants or many different bins. Yet if we set that threshold too wide, then, yeah, well, you know, all the operons, all those 16S genes from one genome, coalesce into one ASV, but we also run the risk then of having multiple species cluster together. We called that lumping, whereas defining it too narrowly is splitting. And so we're going to try to do something that's rarely done in taxonomy, which is to try to balance lumping and splitting. I'm only kind of joking. All right, let's head over to our terminal. And I will go to our project root directory. You'll see that my issue tag here is red, because I've already put in a file for our exploratory data analysis, which I will now open my R-proj in RStudio. And if we look in files exploratory, you know, I wonder why RStudio, when I open it as a project, doesn't put me into my project root directory in my files tab. I find that weird. I wonder if there's a setting I'm missing. If you know of the setting that I'm missing, please tell me down below in the notes. All right. So what we're doing is we're down here on 1221rockcurve.rmd. And I've already populated with some commentary and overview, a definition of sensitivity or the true positive rate, and specificity or the true negative rate. And I've got a series of questions and topics that I want to look at. So we're going to first start out by building that receiver operator characteristic curve. Go on to look at the sensitivity and specificity of a 3% difference, which is commonly used to define operational taxonomic units or OTUs. We'll then talk about the threshold that provides the equal balance of sensitivity and specificity. So if I want to have the same sensitivity, the same specificity, what should that threshold be? And then finally, we'll look at the point in our receiver operator characteristic curve that is the closest to a perfect classification, where you have perfect sensitivity of one and perfect specificity of one as well. And what threshold gets us that state we're closest to a perfect classification. All right. So we are going to start by loading our libraries. And I'm going to go ahead in here and put confusion data. And this is the data that we've been generating over the past several episodes, which we'll do read TSV. And in there is going to be data processed. Let me make sure I've got my path right data processed. And then actually, you know what, I want to wrap this in here, because that's going to help me with my paths data processed, and then rndb.es, up, no, rock.tsv. All right. And I think I'm going to go ahead and put in here some call types. And we will do calls. And we'll do dot default equals call double. And then what's the other one we want to do region is going to be call character. And that's a function as well. So it needs parentheses. And then we need a closing set of parentheses on our read TSV. Why is it putting it down there? All right, so we'll run that. If we look at confusion data, we now see in our stuff, we've got the iteration column, the region, the threshold, and then the true positive, false positive, true negative, false negative for all permutations of the region and threshold, and then for our 100 iterations. So there are 56 combinations of region and threshold, as well as then a 100 iterations. So we're in good shape, everything's read in nicely. No warning messages. Again, the warning messages usually don't mean anything. But I just don't like seeing warning messages. So what we can think of doing then is we want to generate columns for sensitivity and specificity. And we want to get the average sensitivity and specificity across all those iterations. So I'm going to go ahead and add that to my pipeline here. So we will do a mutate. And I will do sensitivity. And that is going to be the true pause plus. I'm sorry, true positive divided by true pause plus false neg. I think I got that formula right. True positive divided by true positive plus false neg. Great. And then we also want our specificity. And that is going to be true neg divided by true neg plus false pause. And double check that formula, true neg plus false pause. Great. And so again, we can run this and look at confusion. Confusion data. And so we see we have columns here for sensitivity, specificity, and those look good. Yeah. And what we now want to do is we want to group by and we'll do region and threshold because we want to generate a line for each region. And we want to get the average or median sensitivity, specificity for each threshold and region. All right. And then we can do summarize. And I'm going to do sensitivity equals median sensitivity. And then specificity, median specificity. I'm using the median rather than the mean, because I could imagine that if if it's not normally distributed, if the variation isn't normally distributed, then the mean might give us a slightly wrong perception. One other thing that I want to do just so I know, I'm going to do the sensitivity IQR equals IQR sensitivity, sensitivity. And I'm going to do specificity IQR. And this is the difference between, say, the sensitivity at the 25th percentile and the 75th percentile. I want to know how wide, how large is that IQR? The other thing I'm going to do here is I'm going to remove my iter column. So I'm going to select region threshold. And I don't need the iteration. I don't need the confusion matrix. But I would like the sensitivity, the sends IQR specificity and the spec IQR. And I also want to do the groups drop dot groups equals drop. Give that a shot. It's unhappy with me. Why are you unhappy with me? Oh, because I think this is supposed to be all caps IQR. Sometimes I wonder. All right. So now if we look at confusion data, we see that our threshold, region, sensitivity, specificity, these look really small. The IQRs are quite tight. Let me do the confusion data and filter on let's say threshold greater than 0.01. And yeah, these IQR values are really small. So I'm not going to worry about the sensitivity and specificity in I'm sorry, I'm not I am going to worry about the sensitivity specificity. I'm not going to worry about the variation. I think I've got the median. That's good enough. I don't need to worry about plotting the error bars because with an interquartile range that small, if I were to make the error bars, you wouldn't even see them because they're basically going to be as large as the thickness on the line. Okay. So let's go ahead. And I got to get rid of this from my select. All right. So again, we've got confusion data. And you know what? I'm going to rename this to be sensitivity, specificity, specificity. I think I spelled that right. All right. And so we've got our region, our threshold and the sensitivity and specificity are in great shape. Okay. So the first thing we want to do is to create a receiver operator characteristic curve. And so we will do sensitivity, specificity, and we will pipe that to ggplot. And our AES to set our aesthetics, we are going to do x and for a receiver operator characteristic curve, rock curve, the x axis is actually one minus the specificity. And the y is going to be our sensitivity. All right. And I'm going to then do geom line. And actually, I need to add another aesthetic up here. I need to do color equals region. Okay, so we'll get a different line for each region. We'll get geom line. I'm going to go ahead and do theme classic. I really don't like the default ggplot plots. Running this, and we see that we've got our four regions and our receiver operator characteristic curve. And one thing I noticed is that for a lot of the work I do in my research, the sensitivity goes from zero to one, and the specificity goes from zero to one. But this is so crunched in, right? So if I were to do chord cartesian, x lim equals c zero to one, y lim equals c zero to one, that you'll see that it's so compressed in the upper left corner that you can't even see anything. So I'm going to remove this chord cartesian line, but that again gives you a sense of the full rock curve space by blowing it out like that. But I don't think that's super helpful. And so we'll stick with this for now. I think that looks pretty good. So that is a rock curve. Okay? So the next question then is, what's the sensitivity and specificity for a threshold at 3%? And what we can do is we can again take our sensitivity and specificity data, and we'll do filter threshold equals equals 0.03. And we get what we'd expect, right? And so hopefully this filter function looks familiar by now, that we can set that threshold, and we get our thresholds, and we get very good sensitivity and specificity at a 3% level. But where does that fit on this curve? So if you kind of notice there's little bumps in the curve, each of those bumps is a different point. And it's not quite clear where 0.03 is. So down here, the left side of each curve is zero threshold, and the right side is a 5% or 0.05. But where's 0.03? So I'm going to define this as a variable three. And I'm then going to take my rock curve and put that in there. And I'm now going to add points for the data at a 3% threshold. And to do this, we talked many episodes now ago about how to combine different plot types. And so we can do geom point. And the data is going to come from the data frame three. And our AES, x is going to be, again, one minus specificity, y equals one, is sensitivity, and color equals region. We can run that. Three is not found because I don't think I actually ran this. So it's not happy with me. And why isn't it happy with me? I think that's because this needs to be data equals that. There we go. So I forgot the data equals three. And so you can see the balls on the line that correspond to a 3% O2 definition, right? So here's v19. This green is 34. And then v4 and v45 are further out here. Okay. So that looks good. One thing that's a little bit hard to see on this plot, because it's so smushed in is the overall shape of the plot. Usually, you know, when you're dealing with kind of a full range of sensitivity and specificities, the window is a square going from 0101. So what I'd like to do is I'm going to add a geom AB line. And so we can then say intercept, we can basically give the intercept and slope for a line that we want to put on the plot. So my intercept is going to be one for the upper left corner. My slope is going to be negative one, because it's going to be a diagonal going down. And I'm going to make my color equal gray. And I'm making this the first geom because I want it in the background. So let's go ahead and run that. And you see that it basically looks flat, right? That this perfect line is the balance of sensitivity and specificity, right? And so that's the next question we want to talk about is what are the values on that grade gray line, right? And so I'm going to copy my code down because we will reuse that. And let's see. So let's do balance. And we want to look at the find the threshold where the sensitivity and specificity are as close as we can get them. So before I name that balance, let's do sensitivity specificity. And I'm going to then do mutate diff. And that's going to be sensitivity, I'm going to say absolute value of sensitivity minus specificity. Okay. And so that creates a column for me of the difference, right? So way down here, this first point on the red curve for v19, the difference is 0.565, right? And so whereas you come up here, I get closer to where that diagonal crosses, then diff goes down. Okay. So the next thing I want to do then is to do a group by region. And I'm going to do summarize. And we will do what? We will do, let's do min diff equals min of diff. And what I'd also like to know is what is the threshold and the sensitivity and specificity where we see the min diff? And we can do a trick here, which is to use the which.min function. So we can say threshold equals threshold and threshold. And we can use the square brackets, which is a way to get into a vector. And I think we talked about a couple of episodes ago when we were talking about using for loops and how we could vectorize things and different ways to get access to elements in a vector. So I can do which min diff. And we can then copy this logic for our sensitivity and specificity. So sensitivity, specificity, right? And now, and then we'll also do our dot groups equals drop. And so we get our tidal, right? And so we find that the point where we have the best balance of sensitivity and specificity is actually at the far end for v19 and 34. So at 5%, whereas for the v4 and v45, it's at 4%. Something we're going to talk about is that I had kind of front loaded all the points. So I have a lot of points between 0 and 0.01. And I don't have much between 0.01 and 0.05. And now that I look at these results, I'm wondering if I need more points in between 0 and 0.5. But also, do I need to go out to 0.1, going out to 10% to get a better shape on my curve here? So, but again, this shows that data. And I'm going to call this balance. And instead of three, I'm going to use balance to now show where those points fall on this curve. And again, they should be, I got to load my thing, they should be pretty close to where this diagonal line crosses. Okay. And so again, if we got more points in between here, we'd probably get a closer or points being plotted closer to that line. But again, you can see that like for v4, 5, that's this purple is basically right on the line because it's the difference here is very small. Great. Something else that I would like to do is to output each of these data frames, because when I render this document, I want it to output balance. And I also want it to output three. So I have the table along with the figure. Okay, we're marching right along through here. So which threshold provides a sensitivity and specificity closest to perfect classification. So perfect classification is right here. So where sensitivity is one, and specificity is one, right? And so again, remember, we're plotting one minus specificity on the x axis. So one minus one would be zero. And so we want to know what point is closest, what point on our rock curves is closer to closest to this position here. And so let's do sensitivity, specificity, and we will pipe that to create a mutate where we will then say distance. And we can now calculate the distance between all of our sensitivities, specificity points, and that one. And so what we'll do is the square root, we'll use like a Pythagorean theorem, right? So we'll square the x and y distances and then take the square root of that. So square root. And on the one minus specificity, we will say the difference between us. I need another's parentheses. So specificity, specificity, minus one, and that is going to be squared plus sensitivity, sensitivity, t, too many i's, minus one, and that's squared. Right. And then we're going to take the square root of all that. And so what we see, again, is like down here, it's really far from that point. Whereas, you know, as we come up this curve, the distance gets a lot shorter. And so what we're going to do next is the same type of thing that we had done up above, where we found the minimum difference between sensitivity and specificity. So I'm going to go ahead and kind of cheat by copying this down, and tacking that onto the end of the pipeline here, where I'm looking at the minimum distance, so I'll get rid of this extra tab. And instead of min diff, I'm going to do min distance. This will be distance. And again, I'm going to replace diff with distance. And that should all be good. And I now see that the point with the minimum distance, again, it's not perfect, it's the down a distance of zero, because we don't have perfect classification. But I find actually that v four and v four five give me a threshold of 3%. 34 gives me a threshold of 4% and v 195%. So again, I find that very interesting. I didn't know really what to expect. I actually thought it might be more towards the 0.01 end of things or perhaps even smaller. But for it to come out at 3%, for v four and v four five is is pretty interesting. So let's go ahead and copy our plot again. All right. And we will call this distance. Great. And I'll go ahead and put my data frame here so it gets outputted when I render the document. And instead of balance, I'm going to do distance. And that should be good. And so now we see our balls here indicating the thresholds of the 0.034 and five. This being a distance of 5% on the red line. And then, yeah, we see v 45 v four on the left at 0.03. And then this is at 4%. And so I think I really would like to go out to maybe point one. I don't know that I want to use point one. But it helps me to frame the boundaries on my parameters, right? So if I'm picking a parameter that's at the edge, then you kind of wonder, well, what's 0.06 look like, right? So but if we see that the distance goes up, as I go further out, then I'll say, well, 0.05 was a good choice. The same time, I really would like to have points between say 0.02 and zero and say 0.05, right? Like, let's do 0.025, 0.035, 0.045. So we can get some more definition on our rock curve and kind of get a little bit more of a precise answer as to which threshold gives us the minimum distance or the best balance between sensitivity and specificity. Okay, so the conclusions for now are going to be we need more points going out to 0.10. We need more points between percent thresholds, so e.g. 0.025. But it appears that larger, let me put that in quotes because it's all relative, larger thresholds provide the best balance and overall classification of ASVs relative to genomes or species. So we'll save that and I will now go to Adam to open up my make file. And I'm going to come back down to my exploratory analysis files. And I have data forward slash, nope, it's going to be exploratory, exploratory forward slash, what's the name of that thing, 2020, 1221-rockcurve.md. And that's going to be dependent on the rmarkdown file as well as this file, my processed rndbrock.tsv. And I'm now going to borrow this rcall. And let's go ahead and make this thing in our command line. Excellent. Look at our status. And again, we can open up our exploratory files. What are the other ones? If I hit tab a few times, I see that there's a 5.1. And so this is the plot for the shortest distance to 01. Okay, good. So again, I'm going to get status and I see the files that have changed. So I'm going to go back to my make file. And I want to go ahead and add some points here, right? So I'm going to do, I guess I had two five, so let's do 035045, 05055, 06065, 07075, and then one. Okay. So the other thing I know is this line I had in here for secondary, I don't want. So I'm going to go ahead and delete that. And if I come back down, and I'm not going to, so if I do, I'm not going to run make again, I'm going to do dash n. One of the beautiful things about make is that I can add targets, like I just did by adding values to threshold. And I see all these other files that it now needs to generate. And it has instructions for doing that. I don't have to change any of the code. It'll come back then and combine all those table files, which is great. And then it will regenerate that rock curve data. So it's probably going to take more than three hours now to run because I'm adding more thresholds to be combined. And then it will regenerate that file that I just created. So I'm going to go ahead and do this. I've got a variety of meetings that I during the day today that I can set this up, let it run, and it'll fire off in the background. And I'll come back and show you the results at the end of my day. So as this was running, I had a thought, did I actually calculate distances going out to 0.1? Turns out I didn't. Looking at get distances.sh, I see my cutoff was 0.05. I need to change that to be 0.1. So I'll save this and relaunch the whole thing. And again, if I do make exploratory, now because I've updated the distance file, that's going to regenerate everything, unfortunately, which is fine. It'll just take a little bit longer. So we'll do that. And we'll be back. So I regenerated all those count table files that were the dependencies for the individual regions and thresholds. We actually more than doubled, I think, the total number of permutations or combinations that we were generating. It then coalesced all those together into the big count table file. And then I updated all of the exploratory markdown files. So let's go look at those over here in Adam. And the file that we created yesterday was rockcurve rmd. And this is the rendered version of that file. And as we scroll down, we see where the different figures would be inserted, but also the tables. Again, this first table shows the sensitivity and specificity at a 3% threshold. So I wouldn't expect this to change any. And then this was balanced. So looking at the conditions where our sensitivity and specificity were as close to equal as possible. And we see that getting additional resolution with kind of these intermediate thresholds did help us to see like v4. You get the best balance of sensitivity and specificity at 0.045 v19 at 6% 34 at 5.5% and v45 is still at 4%. So that's interesting confirms what we saw before. But again, gives us a little bit rather little bit better resolution of what we're looking at. This next condition was the distance to perfect classification you'll recall. And so here we saw that the minimum distance to that upper left corner gave us thresholds of 5.5% for v19. You'll remember that was at 0.05 before, 0.045 for 3.5% for v4 and 3.5% for v45. So that all looks good. And let me briefly open up the figures that we generated. And so that will be an exploratory 2020, 12, 21 rock curve files of that. And I'll put a star, they should open up my figures. And so this is obviously not what I would use for publication. The lines are really thin relative to the size. But you do see that the curve is a little bit more smooth for all of these and that they do kind of line up on top of each other. This was the next one. Yeah, so these all look pretty good. So once I push these up to GitHub, you could go into the GitHub repository for this project and see the exploratory file that we just looked at with those figures embedded. It'll be also useful to help us think about the direction of any paper we might be working on. So I think we're in good shape. I'll go ahead and do a get status. Again, I did re render with make the exploratory phony variable. And that went through and generated these other files for like lumping and splitting and threshold to drop nasvs and all these other things. So I will go ahead and update these or commit these. So I'll do get add exploratory. I'm going to put a star to get everything in exploratory, as well as data processed, rn db rock, TSV. I also want to yeah, I also want to add data processed rn db, ESV count tibble, that big data file, as well as code, get distances, because I recall that I previously was only getting it to 5%. And this updated it or increased it up to 10%. And then make file. So it's upset with me. So I'll go ahead and do a get add hyphen f. It's not happy about that ESV count tibble. And so we see everything is staged. And I will do get commit dash m to build rock curves and update data closes number issue 37. Finally, after several episodes. And we will then do get checkout master and then get merge issue 37. It's going to bring all that stuff from my branch over into the master branch. And I'll double check to get status. And we will get push. And finally close out issue 37 of building that receiver operator characteristic curve based on our ASVs, as well as our genome or species information. So this is great. I think we've learned a lot about the effect of the threshold on the lumping and splitting of ASVs across different species. It is upset with me that I've pushed up a big file. It's overall not that big of a repository. So I'm not going to worry about it too much. I probably shouldn't have done that. Anyway, we'll be okay. So anyway, like I was saying, I think we're in good shape. And so this is coming out on the Monday of the week of Christmas. I think I'm going to try to post something on Wednesday that will give a summary of where we're at to help us think forward into the new year. For perhaps the first paper I'm going to work on for the new year of throwing this all together and trying to tell a story. And I'm going to do that here on YouTube, sharing my writing process with everybody. I know everyone has their own idiosyncrasies of how they write. And I want to share with you how I do it and an added layer of doing this all in our markdown. And I think that will be a unique approach. And I will also walk us through things like how do you pick a journal? How do you outline a paper? How do you even submit a paper to a journal? Preprints, getting figures together, all that stuff. And so that will probably be the plan for most of January and a little bit of February. So if you want to know when that comes out and perhaps you're taking the holidays off like you should, be sure that you've subscribed to this channel and you've clicked on the bell icon so you know when things are back up and running in the new year. Anyway, so look forward to the next video on Wednesday, summarizing all the results, and then we'll see you in the new year. All right, take care and we'll see you next time for another episode of Code Club.