 Often when I'm analyzing data, I find that there's some operation that I want to repeat multiple times, perhaps for multiple variables or for multiple colors or for for many different things, right? Well, that repetition, it works, but it can cause problems. Stay tuned for the rest of the episode and I'll show you why it can cause problems and what we can do about it. In recent episodes of Code Club, we've been building out different versions of Figure 1A from this paper that my lab published a number of years ago by Alex Schubert in Mbio, and I think we've made it look a lot better. We're going to move on from that now and I want to look at is how we might go about constructing curves like we see in BCND. These are receiver operator characteristic curves looking at the value of using inverse Simpson index as a tool to dichotomize people as either being having a case or being non-diarrheal control or diarrheal control versus non-diarrheal control or a case versus non-diarrheal control. We have those three comparisons and BCND are those three different comparisons. We look at the blue line for each of these plots that is using the diversity information that is in A. In some ways, we could get a significant B value for comparing these three different groups, but then at the end of the day, it's like, okay, it's statistically significant, but is it biologically or practically significant? Could we use inverse Simpson index as a diagnostic marker to indicate whether or not somebody has Cdiff perhaps? And so that's what I want to do today is I want to convert what we have in figure 1a, the data that's represented here, into what we see in these blue lines in BCND. To do that, again, what we'll need is the sensitivity data as well as one minus the specificity so that we can make these rock curves. I'm going to start where we've started many of our previous episodes with this code chunk that, again, you can find what I'm starting with at a link down below in the description to a blog post where, again, you can grab this chunk of code. Also across the top here now is a link to a video episode I made about installing our studio, getting the tidy verse in the data that I'm working with. This initial chunk of code reads in the loads of the libraries, reads in the metadata, gets the alpha diversity data, the Simpson diversity index, and then joins those all together so that we have a variable now called disease inverse Simpson that contains our disease status, as well as the inverse Simpson value for each of those disease statuses. I'm going to walk you through my thought process in building out the code to generate the data for the rock curve. I know that eventually I'm going to want to make all three pairwise comparisons between these three different disease status groups. I'm going to start by defining two variables, negative, which for now I will call a non-diarrheal control and positive, which I'll call case. And again, I'm going to create these variables and then use those variables to create a function that will then make the comparison between non-diarrheal control and case. And so that later, thinking ahead, I can give it other values of negative and positive to make those pairwise comparisons as well. I'll start with disease inverse Simpson and piping that to filter. And I'm going to get those rows where disease stat is either negative or positive, right? And again, I run that and it's yelling at me because I have not defined or loaded already positive and negative. And so now I get back data frame that's got my two disease statuses and I can double check that with count on disease stat. And I see I've got my two disease stats, which is good. I now want to, I'd like to recode these two values, the negative and positive value to being true and false. If it's positive, I want it to be true. If it's negative, I want it to be false. So I'll do mutate disease stat and I'll use a recode and recode is a function that allows us to take a current value in that column disease stats or need to say disease stat and recode that value to something else. So I would like to do negative equals true and positive equals positive equals true and negative equals false. And if I run that, it complains or it doesn't complain, it gives me all NA values. And that's because negative as a word does not exist in that column. Again, I'm trying to use a variable because at some point this is all going to be encapsulated within a function that is going to bring in a value of negative, bring in a value of positive. And so I need to say that. And so we're going to use a friend that we learned about in a recent episode, which is the glue package. And you'll recall that we can put negative in quotes in curly braces to then get the value of negative in there for what we want this to be. And to do something special, to assign that we need to do some funny notation with the colon equal sign. So we run that and now we see all of our disease stats are true or false. And again, we can then count on disease stat and we get the counts that we had before 155 negatives. These are the healthy individuals in this case in 94 trues, or these are the people with diarrhea and C difficile. And so again, the glue function here comes in really nicely where we can define dynamically the value that we want to recode as false and dynamically the value we want to be true. We can do that with glue again, using the quotes and the curly braces. But we also need that colon equal sign to assign this dynamically defined variable to false or to true. Go ahead and remove that. So again, at this point, we have a data frame that has true or false for the disease status, as well as the inverse Simpson index. And what I'd like to do is to be able to step through each of my inverse Simpson values and see whether or not that value or lower is indicative of a case or a control. And so the disease stat is the true the observed diagnosis. We want to use the inverse Simpson then to predict the disease status. So the next step is to create a column using the mutate function that I'll call send spec. And this will contain the sensitivity specificity information. And because I want to operate row wise going through each value of inverse Simpson, I'm going to use the map function. And so the map function then is going to operate on each row. And the column I wanted to operate on is the inverse Simpson column. Now, if I were to do in Simpson, I think I'm going to get an error. But I will then pass this to get sends spec, which doesn't exist yet. So wait for it, as well as the incoming data. And so I'll show you why I've got that and what this all means. There's like three things going on at one time. And I have to maybe step back now show you where we're going and how we get there. So I will create a function send spec. And so that's the name of my function. The syntax now is to use the assignment operator, the arrow to the keyword function. In the parentheses, I'm going to use x for my threshold and data for the data frame that we're going to work through here. And then you use curly braces to define the body of what's in the function, what the function is doing. For now, I'm going to have it do something really simple, which is to say, sum up the number of cases where x is greater than data dollar sign inverse Simpson. And the function will return the last value that it generated. And so if I run this, this loads my function. And I suppose I could test this right with this with this code chunk that I have up here. So let me define this as test. And we'll run that to test. And then we can do get sends spec. And if I do zero comma, test, then so nothing, right? Because x is greater is x greater than inverse Simpson. No, everything is less than zero, right? So x here is zero. But if I were to say, let's use 50, then everything is less than 250 for the inverse Simpson. So the function seems to work pretty well. We could do more library testing, but that's not really what I want to do today. Anyway, for now, I'm going to go ahead and remove that test. And bring this back. And I'm now going to load this. And let me plop in here. So we'll go ahead and run this code chunk and see if it works. Something that's important to remember is that before you can load the function, run the function, you first need to load the function. And so general, I would move all my functions to the top so that I'm sure that my functions get loaded before they're called. That being said, R is pretty smart. And that if you have functions inside of functions, then it loads all the functions before it tries to execute the functions. Anyway, so that works pretty well. What we're seeing as output here is a list. And so I can use map DBL. And it then outputs the number of rows in the data frame that have the number of rows that have an inverse Simpson that are less than 2.52. So we have a pretty good framework here to work with. And again, we have this function that now I want to make more complicated to calculate the sensitivity and specificity. So I'm going to go ahead and turn this to be predicted as a variable, if it's going to be true, we'll predict true that it's a case, if the value in inverse Simpson is less than my threshold x, right? And what we can then do is we can say tp for a true positive, if predicted was true, and data dollar sign disease stat is also true. So if a row is predicted to be true, meaning that inverse Simpson is less than my threshold, then predicted will be true. And if the observed value, so the column disease stat and data is also true. So if both of these are true, then tp now will be a series of truths and falses. So what I want to do is I want to wrap this in a sum function, because a true value is one and a false is zero. And if we add up those truths and falses, we'll get the total number of true positives. And again, to test this out, I can do return tp. So return forces the function to return that value tp. Now I've changed get sense back. So I will load that and then rerun this packet this function, this code block, and we get what we had seen before. So things are working. So I'm going to go ahead and copy this a few times, because I want to get the true positive true negative false positive false negative. So true negatives will be those were both predicted and observed are false. We also want false positives and false negatives. So our false positives are going to be predicted true, but actually false false negatives are going to be false predicted false, but actually true. Excellent. And I'm just going to introduce some white space here to clean things up. This looks good. Now I want to get my specificity and my sensitivity. My specificity is going to be the number of true negatives divided by the number of true negatives plus the number of false positives. My sensitivity is going to be the number of true positives divided by the number of true positives plus the number of false negatives. And again, this will run. And for now, let me go ahead and return the specificity. And we'll again load this function so that it works and then rerun the code block that I have up ahead. And so now we see our specificity values here. But we want both the specificity and the sensitivity. What we could do is we could return the specificity. I'll do sensitivity first because I have that in the name. Specificity, sensitivity, run that. I got to load the whole function. And then rerun the code block. And it's complaining. Result one must be a single double, not a double vector of length two. So the output that's coming out here is of length two. And it really wants things to be of length one, if I'm using map double, what I will do is I think I will return this. I don't need to use the return. Necessarily I can call that function or call the variable name and it will output it. I'm actually going to save this as a table. I'm going to output a data frame that only has one row. And I will call this sensitivity. I'm going to spell it right. Sensitivity. And then specificity equals that. And I think for good measure, I'll put the names in quotes and go ahead and rerun this. And so this is then going to output a single object, a table, for each level or each value of my inverse Simpson. And again, max map double, it's not happy with that. So I'll remove the double and go back to plain old map. And now what you see is that I've got disease stat inverse Simpson and the send spec column. And so this is a column of data frames, which is kind of weird, right? And we can then output this to do a unnest. So you might see nest used with map nesting and nesting with map, we can use it here to unnest our send spec column. Voila, we get our disease status, our inverse Simpson, our sensitivity and specificity, and we are in great shape. We have this function that if we give it a data frame of predict of observed values, and our marker like inverse Simpson in this case, we will then get out a table of sensitivity and specificities for each value of our X or the threshold, right? That's great. I'm going to add a column to this. So I'll do a mutate comparison equals, and I'm going to again use glue, where I will put a negative underscore positive. So that now if I look at the output again, I see that I've got that. And it output it, like it says, because I didn't include glue. So we'll add the glue, get that. And now we have non-diarrheal control underscore case, and we're in good shape. Now again, this worked for non-diarrheal control and case, right? I could repeat this three times, changing the value of negative and positive for each of those three comparisons, that wouldn't be dry, right? We would be repeating ourselves. So we don't repeat yourself. I can now change this into a function. I'll call this function get rock data. And we'll assign it to the function keyword. And it's going to bring in values negative and positive. I mean, an open curly brace and a closing curly brace. And I could then assign this to a variable and then return that variable. But again, the function will return the last value that is output. So everything that I've been running and outputting to the screen, the function will then return for me. And again, I can do get rock data. And I could give it non-diarrheal control. I need to put in quotes though, non-diarrheal. And then case, or let's do diarrheal controls, comparing non-diarrheal control to diarrheal control, I run that. And then I get the comparison of non-diarrheal control to diarrheal control. And I also see that I have the sensitivity and specificity values, comparing those two groups to each other, again, according to the inverse Simpson index. And so we have that function now that we can use to generate the data for those different comparisons. And so we are in good shape. And now all we have to do is merge those three comparisons together or generate the data for those three comparisons and pull them together. And so down here I'm going to steal this get rock data. And I can now call this function three times. And each time I'm going to make a different comparison, right? So diarrheal, non-diarrheal control to diarrheal control, non-diarrheal control to case. And then we also want diarrheal control to case. And again, if I run these three functions, I get the three bits of output. But what I'd like to do is I want to create this as a variable that I can then use for plotting the data. So I can do bind rows. And that is a function that will take three data frames and bind them together, assuming that the columns all have the same names. And I need commas after each of these because these are the three data frames that it's going to bind, if you will. So with bind rows, I can then run this, and I can then output the full data frame that you'll see now has 676 total rows, which has all possible comparisons. And again, I could then do count running that, I then see the number of comparisons that I have for those three different comparisons, right? And so here I will put rock data for that, and I'll remove this final pipe. And now we will be ready to plot the data, right? So we could do rock data, type that to ggplot. And then on my x axis, I want one minus the specificity. And on the y axis, I want my sensitivity. And I'm going to do color by comparison. I'm getting here that I can't find sensitivity because within ggplot, I need to say AES. And I also need to give it a geom. I don't know why I was expecting it to just work. We'll do geom line. I don't think we've seen geom line in a recent episode. And I will save this to do g save Schubert rock curve.tiff. And this needs to be wrapped in quotes. And very good. We've got our three rock curves. I'm noticing a little bit of jitter in the points. And that's because it is plotting the one minus specificity in order of the data. And so something that we could do would be to say do arrange inverse Simpson. And then we find that we get that stair step. There's other things that we could do in geom line to not get that kind of jittered look. But again, what what I did here was to reorder the points. So that kind of going down the data frame, you get the ordered inverse Simpson values and the corresponding specificity and sensitivity. Okay, so this doesn't look amazing. There's a lot we could do to clean this up. But I think this looks pretty good. We've come a long ways in just a short amount of time. I'll do a theme classic. One other thing that we like to do would be say geom AB line. And I will give geom AB line a slope of one and an intercept of zero. And now we see that we have that diagonal line showing the kind of random classification. I'd like to put this behind my geom AB line. And I'm going to go ahead and make the color gray. I'd like to do a couple other things in the time remaining. The first is I'm going to set my width. I'm going to set that to let's do seven and height four. We see our rock curve looks pretty good. We can get with this dimensions kind of square or equal length axes. This legend leaves a little bit to be desired. Again, looking at the rock curve that we have here for these three comparisons looks a lot like the blue lines that we have in BC and D. One difference that I do notice is the comparison between case and diarrheal controls that has to do with the direction of the comparison. We've been doing greater than indicating health and less than indicating disease for the comparison of case and diarrheal control. Although it's not statistically significant, we find that the mean and the median case inverse Simpson is actually larger than the value for the diarrheal control. So that explains why the line was flipped. And again, you just kind of need to know the direction that the comparison is creating here. Anyway, I think we're in good position. What we'll come back to in the next episode is how to clean up these line plots, show you a way that we can get away from having to use that arrange function. Something that I'm also kind of wrestling with is our coloring scheme that we have individual colors for the three different groups. But how can we bring that to our three different comparisons? Do we come up with a whole new coloring scheme? Can we do something with dashing or with shapes or something else? And then of course, we'll come back and we'll play with the labels on our axes. Now that we've gotten to the end of the episode, you should know actually that there are built in functions and there are our packages out there that will build these types of rock curves for you that will generate the sensitivity and specificity values that will do the modeling all that stuff. And that's actually what we did in the original paper. I think our collaborator actually used STATA rather than R. But again, it's the same idea. But what I wanted to drive home in this episode is how we can create our own functions to then replicate analyses many times. We used one function and replicated that for every value of the inverse Simpson index. That would have been a slog, right? If we'd have had to repeat that function, you know, 300 some times. Also, we create another function that we then repeated three times for those three different comparisons that then made it very easy to bind those three resulting data frames together. Along the way, we learned some new cool tricks with glue that allowed us to dynamically set what values we wanted to change from different rows in our data frame. And we again used glue to create a column in the resulting data frame indicating the comparison that we were making. All right. Well, I hope you found this interesting. I really want to encourage you to go out and write your own functions, deploy them into your analyses. A really useful exercise is to take code that you have that works and then dry it out, as we say, to take that make a copy of the file and then find places where you're replicating your code and encapsulate that into functions that you can then dry your code so that you're not repeating yourself and wherever you're changing things, you're calling that function. I didn't show it here. But if I wanted to change the sign of the comparison that I was making, I only only would need to change it once in that function from greater than the less than and rerun everything and I'm good to go. If I hadn't written in dry code, I would have to replace every greater than sign with the less than sign. And that's just horrible to maintain. That's a great benefit of having dry code. Anyway, like I said, keep practicing with this. I really appreciate your time and your enthusiasm that the community built coming up around these videos has been showing me. I'd love to see more of it. Please let me know if there's other things you'd like to learn about as we go along. I really hope this is helping you to strengthen your R skills. As always, please tell your friends about Code Club and I'll see you next time for another episode.