 Hey folks, about 18 or 20 years ago I was a postdoc going out on the job market for the first time. I was coming from engineering and trying to get into a biology department. One of the things that I did as part of my job talk was to bring with me paper bags of M&M's and I would ask people to tell me, you know, how many different types of candies did you have in your bag? And so people would get into groups. It was an active learning exercise in the middle of a research seminar. Yes, this was very weird, but I think it's still one of the coolest things I've ever done. And I still get people that come back to me 20 years later and say, did you do this thing with the M&M's? And I'd be like, that was me. And they're like, that was so cool. Anyway, the task was for people to count the number of types of candies in their brown paper bag. And so they had to sit around, devise a scheme for how they would design a type, and they would then have to sample. That's the fun part, right? So they would bring out a candy and they'd say, oh, that's a blue M&M. They'd set it aside, pull out another and say, that's a green one. And they'd keep sampling, keeping track of the total numbers of candies they had sampled, and on the y-axis, the total different types of candies that they had seen, right? And so as they kind of go through the bag, they kind of make a curve. And, you know, everybody had the same candy, right? But perhaps the frequencies were a little bit different in all the different bags. And certainly, you know, one group could have taken all their candy and put it back into the bag and done it again. Well, what I'm describing here is a collector process. And if you were to plot all that, that would be what's called a collector's curve. So how do we go from a collector's curve to a refaction curve? And why would we? Well, that's the topic of today's episode of Code Club. A collector's curve is really just one random sampling of your community. And, you know, lucky or not, you get, you know, a certain distribution, a certain sampling scheme of the different entities in your community that's no more special than, you know, again, throwing everything back in the bag and trying it again. And so we can get lucky or unlucky in how we pull candies or taxa out of a community when we sample them. What refraction does is it recognizes that random process and says, well, let's repeat that collecting process a thousand times. So basically put the candies back in, count and pull them out, put them back, pull them out, you know, and repeat that a thousand times, keeping track of the total different number of candies you see for the total number of candies that you've sampled. And I'm kind of using my hands here and thinking about things in terms of a plot, because normally what we put on the x-axis would be the number of individuals sampled, and the y-axis would be the total number of different types of individual samples. And again, because each set of draws is going to be different than any other set of draws, again, it's the same individuals, but in a different order. What we'd like to do is perhaps get an average curve across those, say, thousand different collector's curves. And that average curve is the refraction curve. And that's what we're going to make in today's episode of Code Club. So here in our studio, I have my rarefaction.r script. It has code for reading in and getting an OTU count data or taxa count data into a tidy format where we have a column for the sample ID, a column for the taxa or OTU name, and a column for the number of individuals in each OTU or taxa in each sample. I also have my rarefaction function in here, which does the exact calculation of the estimated number of taxa we would see for a given sampling depth from a community. I'll go ahead and load all this so we can get going. And again, if we have shared, we see this three column data frame that I talked about. Of course, if you want to get your own version of rarefaction.r and the code that I'm working with down below, there's a link to a blog post that'll get you all the information you need to go to GitHub following the instructions I have up here in this video to get you all caught up. I strongly encourage that people follow along as I go through these tutorials so you can get the most out of the experience. Okay, I'm going to look at a single sample F3D0 as I go through this analysis. So we'll start with shared, and I'll pipe that into filter to get group is F3D0. And so now we have our OTUs and our values. So again, the value is the number of time I saw OTU1 in F3D0 115 times. And so I've already collapsed together all of the observations. And so to generate a collector's curve, which is what I'd like to do, I need to pull this apart. And so what we've seen me do in the previous episodes is use the uncount function. So it can do uncount value. We have the group name, which isn't really relevant because I only have one group that I'm working with. And I have the name. Now, these rows are in the order of the OTU. What I'd rather do is to randomize the order. And so to do that, I can use the sample n function and sample n will sample the rows of my data frame to grab n things. And so I want to grab n things. I want to grab all the things. And so running that, I see that I get back a data frame that still has 6,229 rows, which is what I had previously. But the OTUs now are in a random order. And so this is going to be my collecting process for one iteration. And so I've basically drawn 6,229 entities, individuals, out of sample F3D0. And I know the type that they belong to. So let me go ahead and create a column using mutate that indicates the number of observation that each entity represents. And so that should go from one down to, was it 6,229? And so I'll say observation equals row underscore number. And so row number is a function that returns the number of that row. And so now I can see that I have one down to 10. And again, if I did this on tail, to look at the bottom, I would see 6,229. What I need to do is I need to know after I have sampled, you know, five individuals, how many OTUs have I seen? So now what I need is I need another column that has the cumulative count of distinct OTUs. There is an end distinct function, but that only applies across an entire column of observations. And so to get my own cumulative and distinct function, what I'm going to do is I'm going to take this data frame, and I'm going to arrange it, sort it by the name column. And you're probably thinking, Pat, you just randomized it. So why are you going back? Well, if I organize it by OTU, I can then identify the first row for each group, and then say that that is the new individual. And then I can reorder everything by the number of observation. And then I can do a cumulative count across all of the observations. Hold on, and I'll show you what I mean. I'll arrange by name. And so now I've got all my OTUs first. And just in case, I'll also arrange by observation so that within each group, it's ordered by the observation. And now I'm going to group by the name, right? So now I have 167 different groups. So then within each group, I'm going to do a mutate to make a new column that I'll call distinct. And I will then say equals row number equals equals one. And so for each group, if it's the first row, that's going to be row one. And so if it's the first row, then distinct will be true. And so sure enough, we now see that we have this individual is true, right? Because that is the first row. And similarly, if I were to do like a filter on name equals equals OTU 00002, I will see that the first row there, observation 23, was also true, okay? Great. So again, we have this data frame where now we have still things are sorted by name. It's also grouped by name. So I'm going to do an ungroup to remove that. And then I'm going to arrange by observation to go back to the ordering that we had from back up here on line 35. And so now you can see we've got things reordered. And you'll see that this distinct column, the first 10 values are all true. Now, if I run this a couple times, you'll see that we'll get different OTUs and we'll get different true and false values. So in this iteration, observation three was OTU 19. But we have already seen OTU 19 in observation one. So that is not a new or distinct observation. So I can then feed this into another mutate function. So I'll call this column S and I'll use the cum sum function on distinct. And remember that this distinct column is logical, where true is one, false is zero. And so cum sum is going to give us the cumulative summation over all of the distinct values. And so you'll see here that OTU five was not distinct and had been seen before. And again, if we come back further up, we see that the first observation was OTU five. And so now I can do a select to get my observation and my S. And so again, every time I run this, because I have the sample N in here, it's starting over from another random generator process. So I can then feed this into ggplot as x equals observation, y equals S, and then I'll do a geom line. And now I've got my collector's curve. And if I run this multiple times, we will see that I've got different iterations of the same collector's curve, right? And so again, these are random samplings of this overall community. So again, because no one of these is any better than the others, we might want to repeat this process 10, 100 or 1000 times, and then get the average to generate our refaction curve. So how would we go about doing that? Well, I'm going to, for now, remove this plotting stuff, and I will turn this into a function. And so I will call this collect, and it'll be a function. And we will take a data frame, we'll take a group, and we'll put this inside this as the special sauce. And then we'll close parentheses that instead of shared, I'm going to use data. And instead of f3d0, I'll use group, right? And yeah, so we'll go ahead and load that. Again, we can do collect on shared and f3d0. And I guess I shouldn't have deleted that too quickly, ggplot, aes, x equals observation, y equals S, and we'll do geom line. Then I misspelled shared. There we go. So this function works to then generate our collector's curve. So what we can now do is we can then feed this into a map dfr function to iterate this as many times as we want. So I can do map dfr, and I'll do 1 colon 10. So we'll get 10 collector's curves. We'll then run the collect function. And I need the tilde in front of that. And I'm going to then do dot id equals iteration. So then I'll have a separate column for the iteration. This then generates a data frame that's 62,290 rows. So it's the 6,229 rows iterated 10 times. And you'll see I have a column here also for the iteration. Now what I could do is let's go ahead and feed this into ggplot. And I'm going to add an aesthetic, which will be group equals iteration. And so now we see that we have these 10 collector's curves stacked on top of each other. I will call this collect curves on that. And I will do a little bit more. So I now want to get the average of those 10 collector's curves. So I'll go ahead and bring that down. And we will group by observation. So again, for each observation on the x-axis, I want to get the mean of the s. So we'll do summarize r equals mean s. And so now what we see is that for each observation, we have the average number of taxa that we saw for that sampling depth. Cool. And so I'll call this refraction curve. Now what we can do is we can pull this all together to make a plot. And so what I want is to show all the collector's curves with the refraction curve on top of it. And so this will allow us to bring together two and perhaps three data frames so that we can look at these different pieces of data all in the same plot. So again, we'll do collector's curves, piping that in here. This generates our 10 collector's curves. Don't worry, I'll scale this up to a thousand or so before we're done. And then I want to add in an additional geom line. And I can add other data frames. So I could do data equals refraction curve. And I'll do AES. And so x will be observation, y will be r. I also want to do inherit dot AES equals false, because this doesn't have an iteration column. And so it's going to complain if I don't do that. And also my lines for the collector's curves, I want to make those gray. So I'll do color equals gray. And this I want to make, let's do color equals black. I mean, I know it's already black. And let's do size equals one. So it's a little bit thicker than the default. And so it's easier to see those collector's curve lines. I'm going to do theme light. And so now what you can see is this thicker black line is the refraction curve on top of those 10 different collector's curves. Pretty cool, eh? So I'm going to scale this up to 100 so we can have smoother lines. And so there's our refraction curve on top of those 100 collector's curves. One thing you'll notice is a kind of toggle back and forth between the one with the 10 and the 100, that the black line is basically in the same position. With the 10 here, it seems a little bit more jagged than it did with 100. And if we went up to 1000, the curve would be even smoother. And so as you increase the number of collector's curves going into the refraction curve, you get closer and closer to the true value that we would get by our rarefie function. So let's go ahead and add on a dot for our rarefie function. So I'm going to create a third data frame that has the rarefied value using the mathematical formula. To get that, let's go ahead and do shared filter to group equals equals f3d0. Good, we have f3d0 again. And what we can then do is pipe that into a summarize. And I'll call this m for math. And that will be rarefie. And the data coming in will be from the value column. And the value that we want to rarefie down to, let's do 1828. And that gets us to 126. So I'll call this math. Great. And now I want to add a red point on my rarefaction curve that corresponds to that value. And so again, I could hard code it directly in, but I like using this data and AES process because it's a little bit more explicit to me what's going on and generalizable if things were to change. So I'll do geome point with data equaling math, and AES x will be 1828, y will be m. And again, I'm going to do the inherit AES equaling false because I don't have group for a point. And let's go ahead and then do color equals red in quotes and size equals two and then add that to everything else. And there we go. We have our red point right on that black line and looks pretty good, right? So again, what we did with the math before using the formula was to give it a specific number of sequences or observations that we want to determine the richness for. Here with the collector's curves, we're empirically doing it, right? We're randomizing a thousand rarefaction curves, and then we can pick any number of observations to then say, well, what is the total number of O2's or taxa we'd see. We could have done the same thing with the mathematical formula. But again, these are two different ways to get a sense of how close things are. Why don't we go ahead back to our rarefaction curve data frame. So we could take a rarefaction curve and we could filter for observation equals 1828. And we get 126. And let's go ahead and pull r, right? And so we get 126.2. So for comparison, our math version was 126, but it's not showing the decimal point because of how it formats things. So let's get the m there. And so we see that we've got 126.0468 versus again, 126.2. So if I were to increase the number of iterations, I would get a value here that gets closer and closer to the true value. I'm going to go ahead and run that just so you can see what I mean. And so I went from 126.2 to 126.01. It's a little bit lower than the math version. But again, you get the idea that as we increase the number of iterations, we converge to the true value that we got by using the choose and factorial functions. Again, what I was trying to simulate here is generating 100 or 1000 collectors curves and then representing that as the average, which is the refaction curve and showing then that as we increase the number of iterations, that curve gets smoother. And our estimate for any given observation gets closer and closer to what we'd get by using the pure math approach. So I encourage you to play around with these functions and this approach that I talked about today, maybe take your own data and, you know, something that would be a cool exercise for you to do. I'm not going to do it, but take this refaction curve that I generated empirically and generate the same refaction curve, but using the mathematical formula and see what the deviation looks like between those two lines and how that deviation changes as you increase that number of iterations. I'll leave that for you to work on and also, as always, feel free to play with this using your own data that'll make it come alive all the more. Keep practicing and we'll see you next time for another episode of Code Club.