 When we have a data frame in R, and we're working in the tidyverse, it's a common practice to group by one or more variables and then use a summarized function to generate a column of summary statistics. But what do we do when we want to summarize, but instead of generating one value from a function, we want to generate multiple values. That's exactly what we're going to talk about in today's episode of Code Club. Hey there, I'm Pat Schloss and this is Code Club. In each episode, I try to apply reproducible research practices to an interesting biological question. Over the past couple of dozen episodes, we've been looking at the sensitivity and specificity of Amplicon sequence variants. Now, hold on. Even if you don't know what an Amplicon sequence variant is or what a 16s RNA gene is or even what bacteria are, I know you'll get a lot out of today's episode. This problem of how do we generate summary statistics where we have perhaps multiple output values from a function is very common in data analyses. So instead of going and looking at some tutorial using some silly example, let's look at an example using real data to answer a real question that at least I'm interested in, even if you're not. So stay tuned and I'll show you how we get this done. In our most recent episodes, we've been building up to try to build a receiver operator characteristic curve using our Amplicon sequence variant data as well as the species of bacteria that those sequences come from. So we're working in issue 37. I'm going to go ahead and navigate to my project root directory. You'll see from my tag here that I am on issue 37. I'm going to go ahead and fire up our studio here. And we'll get right into the analysis. So again, the file we've been working on is getrockdata.r in our code directory. And you can look back at the previous episodes, the last two or three episodes to see what's going on in all of this. I think it's pretty well documented. Where we're at is in the last episode, we generated a function written in C++ using the RCPP function. And this runs rather quickly. It took about 10 seconds to run the whole analysis. And what it's doing is it's comparing everything to everything. I think we've got about 20,000 operons that we're comparing to each other. And so that's about 200 million, yeah, 200 million total comparisons. So if it takes a 10 seconds, we can live with that. The challenge, though, is that we have something like 56 different comparisons. I mean, previously I said 52. Forgive me 56, I was off by four. And so what we need is a way to apply this function across all of the different regions we're looking at, as well as the different thresholds that we're using to define an Amplicon sequence frame. So I am going to go ahead and highlight all this stuff and run it so that it's loaded in my current R script. Hopefully there won't be any error messages that come up. Right now it's compiling my C++ code, and we're in good shape. One of the first things I'd like to do is perhaps organize my code a little bit better. I like to put my functions at the very top of my R scripts. So I'm going to take this C++ code. And I'm going to move it up here to the top. Perhaps right below my library calls and see what else we have here. We've got the RCPP. We've got all that. This is kind of going off the edge of the page. So I'm going to go ahead and clean up the code a little bit. So I've got nice wrapping. And again, if I run this, it should compile with no problems. Great. So we've got our set seed, our desired threshold, the desired region. In today's episode, we're actually going to get rid of these two lines. So remind me to come back and delete those. And you'll see that we read in our EASV file, our metadata, which tells us what genomes are in our data set and what species they belong to and what EASVs they have. We then assembled these to generate that metadata EASV file. And if we look at metadata EASV, we'll see it has two columns, genome ID and EASV. And I pause there for a second because I'm like, where did all the data go? And I'm remembering that up here where I define my EASVs. And so that's for exact sequence variance as well as amplicon sequence variance. I recall that I had this filter function. And so I want to be able to look at all of my regions and all of my thresholds. And that comes back to these two lines that you're supposed to remind me to remove. I'll go ahead and remove those now. And I'm going to remove this filter line. And I noticed that before we had removed the threshold column and the region column. So in here, I want to be sure to keep my region and my threshold. Right? And so if we run that code block, as well as this next one to read in the metadata, we can then regenerate metadata EASV. And we see that we've got our genome ID, the EASV, the region, as well as the threshold. I'm going to go ahead and move this comment way back up top to my C++. The number corresponds to where in my data analysis plan this had fallen. It's probably not that important to have the number, but I'll leave it there for now. Okay. And give myself a little bit more breathing room to look at the code. We see that we have our metadata EASV. We have this confusion matrix stuff. One of the things I noticed when I was editing the last video, actually, was that I had the summarize function still in in my code. And that's a legacy to maybe two episodes ago or three episodes ago. So I can go ahead and remove this. But for this to work, I need to have true pause, true nag, false nag, false pause as the names of the columns outputted from my get confusion matrix C++ code. So that was back up here. And so this here, I'm going to put true pause, false pause, true nag, and false nag. And again, I'm going to compile this. So one of the things to notice that C++ code is compiled, whereas R code is interpreted. And that's again, part of the reason why C++ is faster than R. Again, it's not a big deal. And we come back down here. And we've got get confusion matrix. And then we look at the region, the threshold, the sensitivity specificity on all that. And actually, I'm for now going to come I'm going to come back to this and set this aside for now. But what we're going to do is we're going to take this metadata ESV. And I know I just told you that we wanted to get rid of that filter function, but I'm going to insert a new filter function to make our data frame a little bit smaller so that it's easier to work with. You know, if I ran everything all 56 combinations that might take, you know, 40 minutes or so to run. So let's let's only look at maybe one region and two thresholds. So I'll do region equals equals v four. And threshold equals equals 0.03 or threshold equals equals equals, let's say, let's do 0.00. Okay, so make sure that works. We see that we've got our region v four and our thresholds of zero and 0.03. And what I'd like to do is, you know, if we're thinking about the typical group by summarize idiom, we might do group by our region threshold, right? And, and then summarize, and we might then say confusion, or confusion matrix equals get confusion matrix, get confusion matrix, and then we pass it some, some value. The problem is that for summarize to work, this function needs to return a single value, not four values, as is our case, right, the four values of the confusion matrix. So we need a slightly different, actually, it's not slightly different, it's quite different approach, but it's the same kind of idea. So let's remember what happens when we look at metadata filter and group by. And so what we'll see is that the output tells us that it's a data frame that is grouped by region and threshold. Okay, and then normally what summarize would do would take each of those groupings. And so there's two groupings in this case because of how I filtered it. For each of those groupings, then it would run the summary values. Okay, so again, hold on to that thought. What we're going to do is we're then going to do what's called nest. And so we will then nest the data frame. So we'll nest everything that's not region and threshold. I'm going to remove that pipe because it's going to annoy me. And so what we get is a really interesting new data frame, right, we get a column for the region, a column for the threshold, and then a column for the data, right? And the data is a column of type list, which we really haven't talked much about. That contains two data frames, right? So each row of the data frame in this column is a table is a data frame that contains 20,041 rows, and two columns, right? So that's a bit different. And what we can then do is we can then pipe this to the next step, which will be a mutate. And we can then say mutate confusion. And so we're going to create a column. So remember, if you see mutate, that means create column. And so we will then create the confusion column. And we will populate this column using a map function. And we've talked about map before. I believe we used it previously when we were reading in a bunch of data frames that we wanted to concatenate together. We used map underscore dfr. Here we're going to use map. And the input to this is going to be the data column, right? And so we're going to take the data from the data column for each of these rows. And we are then going to give that data to our function. And so a little funny syntax, I can't remember if we've talked about before, is a matrix. And then we do get confusion matrix, open close parentheses, and then period x. And so period x is the value of data. So that's the variable that data is representing. And you can figure that out if we do question mark map, if you look at the usage, you'll see that map takes dot x and dot f as its argument. So dot x is the data, and dot f is the function. We're being a little bit lazy in not putting in the argument names. But this works pretty well. Now, if I run this, hopefully we don't get any error messages, we should get a new column that comes out called confusion that will also be of type list. And that each of those rows will contain tibles themselves. And so actually, their data frames are not quite tibles. And so we see is that confusion is a column of type list, like data was, and it's a data frame, each cell is a data frame with one row and four columns, again, those four columns correspond into true positives, true negatives, false positives, and false negatives. Great. So I'm going to go ahead and remove this summarized line because we don't need this anymore. And what we can then do with the output is that we can then do unnest. So we can unnest, and we give it the column we want to unnest, which will be confusion. And this works really nicely because it's a single row data frame column that we're unnesting. And we run this and it again, it's going to take a little bit of time, but we will then get output that is region threshold data still in this list format, as well as then can for columns, the confusion will go away, but we'll have the true pause, true neg, false pause, false neg, as well as the values for these two thresholds for the v4 region. And sure enough, that's exactly what we get here. Okay, isn't that cool? We can then pipe this to get rid of that data column. So I can then do select hyphen data. And we run that. And again, this is going to take a little bit of time, but what we'll get out is a really compact data frame containing the region, the threshold, and then the four values of the confusion matrix. So that works really well. We can, of course, then do ungroup to remove the grouping variable by region and threshold, and we will be in good shape. I'll go ahead and call this confusion matrix. We'll assign that run that and we'll be ready to run to the next step. Alright, so that ran. One thing I want to remember remind myself to do because I always seem to forget this is I'm going to remove my filter line. And I'm actually going to cut this code and paste it down here where I defined confusion matrix before. And so this is exactly what we want where we want it. I'll go ahead and remove some of this extra space. So I'm curious how long this will take to run for all 56 combinations. I think of each rep takes each condition takes 10 seconds or so to run. And we've got 56 combinations, divided by say 60 seconds per minute, it should take about nine or 10 minutes. Let's go ahead and run this. And let me see if I can run this within system dot time. For some reason, I think it won't like that. But you never know, we'll see. It likes it. Alright, so probably in about 10 minutes, we'll check back and see we'll see where this is at. So that completed and that took about 628 seconds, which I believe is yeah, it's just over 10 minutes. So we're fairly close. Now the thing to keep in mind is that we're going to repeat this 100 times to get 100 iterations, because if you noticed up above, we're taking one genome per species. And we want to replicate that a bunch of times to get kind of an average over many iterations. So if we run this 100 times, that's going to be 1047 minutes divided by 60, it's going to be about 17 hours. It's a long time, certainly not as long as it would have been if we hadn't rewritten that code and see plus plus, but it's still a long time. So stay tuned for the next episode. And this reminds me, go ahead and subscribe to the channel and click on the notification. So you know when that next episode is dropped, we'll see how we can use the map functions and maps from another package that will make this a lot faster to run. So but before we move on to that and close out today's episode, want to briefly go back through this code, I'll go ahead and remove my system time function. And to remind you what we do did right, we had this metadata, easv data frame that had our genome ID or easv region and the threshold, we could use group by to segment our data frame by region and threshold. We then use the nest function to create a data frame where each where there's a column called data, where the data frame each row of the data frame was a data frame, right? So it's very inception esque. We could then use mutate to create a new column called confusion. And remember, we had to use mutate because the summarize function only allows you to use functions that produce one value to populate a column. So the map function is used to create values when you're going row wise, right, or group wise. And so what map does is it takes the data from the column data, and for each row, which is basically its own data frame, it then used get confusion matrix on that rows data in the data column, probably, you could probably call this something different in nest, but who cares. So that then creates a new column in this case called confusion, which is a type called list, again, where each row is a different data frame, we could then use unnest to remove to unnest the confusion columns that we then got four separate columns for the four elements of the confusion matrix, we can then get rid of that original data column from the nesting procedure, and then we remove those grouping variables and wallah. In the end, we have a data frame confusion matrix, where we have the true false true pause false pause true neg false neg, for all combinations of the region and threshold. Now, again, this is only for one iteration one sampling of genomes per species. We'd like to repeat this 100 times, and then get some say mean or median representation of these values. Alright, this is going to close out the code for today. Make sure I save that. And I'll go ahead and commit this change. I'll get add code get rock data, and then get commit to let's see apply get confusion matrix to all regions and thresholds. Great. So again, the next step now is to deploy this for 100 iterations, and to find a way to do it quickly. And we'll use another map function to do that. And as I alluded to, we'll use a package called future that will allow us to parallelize this process. So if my computer has 16 processors on it, and I want to do 100 iterations, that it'll divide those 100 iterations across those 16 processors. So stay tuned for that. In the meantime, please be sure to tell your friends about Code Club. Keep practicing with this material. Have you used the nest map mutate unnest idiom before. See, see what you think about it and see if that might make your pipelines a little bit more streamlined and coherent. It is a it's similar but different again, by from using the group by and summarize idiom, but it's another useful tool to have in your collection of of tools in using our and the tidy verse that those map functions again come from the per package p u r r r, which is encapsulated within the tidy verse. All right. So keep practicing. And we'll see you next time for another episode of Code Club.