 One of the critiques against rarefaction is that you have to rarefy to some arbitrary level of sampling coverage, some arbitrary number of sequences to base your refraction on. Is that true? Not in my case. I don't feel like it's arbitrary. Maybe to someone outside looking in at what we're doing, they might say, well, you know, why'd you say 3,000 sequences? That seems arbitrary. In this episode of Code Club, I am going to remove the mystery. If there was a mystery, I don't know about that. But anyway, we're going to remove the mystery, and I'm going to share with you my thought process for how I determine what number of sequences that I will rarefy my data to. We're going to use r, we're going to use some logic, and I'm going to kind of give you some principles to go off of that you can take and apply to your own project. As always, I'm working out of our studio. I've got a script already set up here called alpha coverage dot r. This is basically what we've seen over the past several episodes as a starting point. We've got vegan, tidy verse that we're loading in as libraries. I don't think I'm going to actually use vegan, but we'll leave it in there as well as the set seed. And it's been a while since I talked about this pipeline, right? So days wanted. These are the days that I want to extract data from my study. So my study was originally published many years ago using 454 sequencing technology. Actually, later when we're developing our MySeq pipeline, we resequenced all those samples. Well, those samples originated from a collection of mice that my lab had in its mouse breeding facility. And what I did was I sampled a fecal pellet from each of about a dozen or so mice every day for about half a year. My question was how variable was the merine gut microbiota? Turns out, not very, right? It's pretty steady. There was a bit of a shift about 10 or 12 days after weaning. And that's what we discovered in this study. So each day represents the day after we weaned the mouse from its mom. So I really emphasized my sampling efforts between day zero and nine post weaning in days 141 and 150 post weaning. Again, we generated all this data on a MySeq, and then we ran it through the mother software package that my lab develops. And in this pipeline, we then read in a shared file. The shared file across the columns contains the different taxa or operational taxonomic units, OTS as we call them, and the rows represent the different samples. You'll see a variety of different steps in here that we use to get the shared file into a tibble so that we have three columns, one indicating the group or the sample that the data came from, one indicating the OTU and one indicating the number of times we saw that OTU in that sample. Now, there's a couple steps in here where I filtered the data. I want to remove those steps so that we can see the full data set, because these filtering steps in a way were me applying my logic to really focusing in on the samples that I wanted to look at. The first is this filter step, where I only wanted the days where I had this heavy level of sampling between day zero and nine and 141 and 150. We also collected other samples in between those points so we could figure out where the shift in the community occurs. We also have this filter line on line 16, where we filter out any sample that contains fewer than 1800 sequences. That is what I want to talk about today. Why did I decide on 1800? Let's go ahead and load all this so we can have some data to play with. Again, if we look at shared, we see this three column data frame with the sample ID, the taxonomic name, and the number of times we saw that OTU in that sample. I want to compile together the counts of the different OTUs so I can find out the number of sequences in each of the samples. What I'll do is share a group by group, which is the name of that first column. Then we'll do a summarize to get the number of sequences, some on value. Now we have the number of sequences for each of our samples. Let's go ahead and call this sampling coverage. Let's go ahead and make a few plots so we can visualize the distribution of the number of sequences we have in each of our samples. We'll go ahead and do sampling coverage. Pipe that to ggplot.es with x. Let's go ahead and start by making a density plot. I will put n seeks and it'll do geom density. There's our density plot. One of the nice things about a density plot is that it makes the curve nice and smooth and pretty. The bad thing about a density plot is that it makes it nice and smooth and pretty. I also don't have a sense of the n on the y-axis. I get a density but I don't get a count. If I want the count then perhaps the better tool to use here would be geom histogram. So as we see with the histogram the nice thing is on the y-axis I get the counts of the number of samples that fall into each of these frequency bins. On the left side there's a bin for zeros so those are things around. Samples that have close to zero sequences per sample. There's about six or seven there. The bin to the right of that there's about five. All the way out to the right on the x-axis there's one for just around 30,000 and it looks like there's maybe one or two samples there. So one of the things I kind of get the sense of is that kind of at the left side of the tail of my distribution I have some samples and so maybe there is a break in my data and so as I go through these different visuals what I want to find is whether there's any natural breaks in the data that would help me to make a decision about kind of where to cut my samples, right? Where to find a line that I can use to verify my data to. So we could zoom in on our histogram and let's go ahead and see how we might do that. So we could go ahead and do chord cartesian x-lim. We'll do zero to let's say 5,000 and here we get pretty wide bins. We can maybe make the bins a bit finer. It tells us that it's using 30 bins. We could go ahead and do bin width equals say 500 and so we can see that you know right around zero we have some and it isn't until we get to say 2,000, 3,000 or so that the number of sequences in each bin starts to accumulate and be substantial, right? Okay so let's move from there and looking at the distribution of the density of the number of sequences to thinking about things perhaps looking at individual samples as points in something like a scatter plot. So what I'll do is I'll go ahead and grab these lines of my sampling coverage and my gg plot and on the x-axis I'm going to put the value one and on the y-axis I'm going to put n seeks and then I'll go ahead and do geom jitter and so what this has done is this has created a jittered plot and so basically the position on the x-axis is randomized so the points don't overlap each other so much and then the y-axis is the actual number of sequences. Here we see you know that there's some way up at the top that then kind of pushes everything down effectively so one way of looking at this that might be a little bit different would be to instead do scale a y log 10 and so what this does again is it puts the y-axis on a log scale and then I can see this line this nice grid line at about a thousand that there's a handful of samples below a thousand right and so these points kind of stick out as you know having quite a bit fewer reads than all the others and so I might start thinking about a threshold of a thousand sequences right and so again this is getting me thinking about you know where's the critical mass of sequences of course when we looked at our histogram we always certainly had a nice spike you know perhaps between 3000 and 7000 sequences and so we certainly see that in here although with the log scale things get a little bit distorted it's hard to visually integrate the data in a jitter plot like this and so it might be preferred to use something like a box plot so let's go ahead and replace geom jitter with geom box plot and so with this geom box plot we kind of see what we basically already saw let me go ahead and remove this scale y log 10 and see what this looks like and so we don't get a whole lot of information there's just kind of too much summarization going on here this solid line across the middle of the rectangle is the median this bottom edge of the rectangle is the 25th percentile and this is the 75th percentile an alternative to geom box plot would be geom violin this gives you the shape of the distribution and so where it's the widest we have the most data and where it's the most narrow like up at 30 000 we have the least amount of data also the smooth things out a lot like we saw with geom density i don't feel like it really gives me that much help to thinking about you know where i might want to draw that threshold so i think i'll go back to my geom jitter and adding on the scale y log 10 and again that gives me a sense that i'm looking right around a thousand or so maybe two up to two thousand for where i want to draw my threshold in the data another approach i want to take to looking at the data is to plot it in order of abundance right and so if i look at sampling coverage i get my data frame but i can also sort it by n seek so i could do a range n seeks which then gives me the samples in order of the number of sequences and then i can pipe that to gg plot aes on the x axis i'm going to put one two n row of the data frame coming in and so that'll basically give a one two three four five six seven eight nine ten all the way down to you know was at 300 or so samples that we had and on the y axis i'll put n seeks and i'll do geom line and so we now see in this plot the number of samples that i have on the x axis and the number of sequences in each of those samples and so you can look at this curve and see kind of where most of the samples lie right and so i'm kind of feeling like you know again right at this minor grid line it's about five thousand and you know right around here whatever that shoulder is is probably right around where i want to make my cut and i could do chord cartesian x limb let's do zero to 50 and then for my y limb i'll go from zero to five thousand and so again right about there is the shoulder that i see and that's at about the 13th sample in so one thing we might do that is come back to this pipeline where i had sampling coverage arrange n seeks and i can then see well in this case the first 10 but then i could pipe that to a print let's do n equals 20 and let's make this window a bit bigger looking at this ordered list i see sure enough sample 13 here f6d 364 so that was a female six mouse that we actually went out to 364 days almost a full year post weaning and we had 1806 sequences for that sample and so again there does appear to be a pretty nice break of a few hundred sequences in between that sample and the next smallest size sample so we have a candidate threshold say 1806 1800 is what i think we used earlier you then have to ask yourself two questions so question one is if i set this threshold what samples am i going to lose and how much do i care about those samples so if i set that threshold and i remove these 12 other samples what am i going to lose and so the factors i like to consider are you know how important are those samples are there controls in there uh am i working on a paired data sets like a pre post treatment um you know are there like a lot of those pre samples or post samples or controls in that pool of things that i'm tossing um that would be a concern right so if i'm throwing away a lot of controls or like say time zero in a time course study i wouldn't want those to be among these samples that i'm effectively going to throw away and so again there's 12 samples in here and it looks like there's maybe one two points in here that aren't from those periods of concentration so there's 10 points uh from the time frame that i'm interested in now the next thing i might say is well are these all coming from the same animal right and so again if i look at these um i see that i've got um three four from male one i've got one from male two i've got five from male three one from a female five right and so i'd kind of have to judge how problematic that is right so i seem to be losing quite a few samples from male three um as well as male one i think there's three or four you know there's four in here from male one and there's perhaps an equal number uh from male three and that's about half the samples from those animals right and so i'd have to kind of come to terms with whether or not i'm willing to sacrifice that many samples from those animals perhaps something happened when we pulled the samples together before we sequenced and we'd like to go back and resequence those samples so that we can get a larger sequencing depth for those samples that's certainly something you could do you could also just say meh i don't care um you could also say you know these are a lot of samples i don't care about and you know frankly i set this threshold of 1800 but you know maybe i don't care about some of these other samples either right and maybe i'd be willing to go up to 3000 right and so i feel like finding the breaking point is a starting point for thinking about you know what samples are you losing and gaining by setting that threshold it's a starting point right and so you might want to increase the number and you might you know be happy to decrease the number i'm happy with using 1800 and six i think that's about what we used in the study in the end i wasn't so concerned about you know mail one or mail three losing those samples from the overall story that we told so i said there are two things we need to worry about the first i mentioned what samples are we losing the second is what can we do with that sampling depth that we have this is a really challenging question to answer personally i think way too much is made of the number of sequences you have per sample i would rather have 340 samples where i have 2000 sequences than 100 samples where i have 10 000 sequences i would far rather have breadth of sampling than depth of sampling that's something you need to work out for yourself i find that with the greater depth of sequencing we have we have an improved uh sensitivity to rare taxa that's great you know the difference between 2000 and 10 000 i don't know that the limit of detection changes all that much that i really care certainly not more than i would care about having more sequence more samples in my study and if the number of sequences you have stresses you out always go back and generate more sequence data for those samples now if you're happy with running with 1800 sequences and your reviewers are giving you crap because they say that's not enough to do anything well i have a laundry list of papers i could point you to where we have used 500 sequences per sample and told perfectly adequate stories right um sometimes things are difficult to sequence sometimes when we sequence things or do pcr we get a lot of non-specific stuff that sequences along with it and then we have to remove that stuff and we're just kind of stuck with a low sampling effort i think i've talked about this in previous episodes where we once had a study looking at the lung microbiota relative to the mouth microbiota right and that's a perfect illustration of what i'm talking about here today that when we sequence the lung microbiota we got a lot of non-specific amplification products that we ended up sequencing by the time we removed all that non-specific stuff from the lungs we were left with a threshold of about 500 sequences per sample while the mouth samples sequenced really well right so if we had used a larger threshold say like 5000 we probably would have lost all of the lung samples but we'd have all the mouth samples right and so by using that lower threshold say down to 500 we were then able to make a direct comparison between each person's lung and each person's mouth also if the reviewers are giving you grief there's one other metric that i'd like to share with you that you can give back to the reviewers to say meh i think we're okay and that's goods coverage so goods coverage is the fraction of sequences that appear in an otu that has been seen more than once i've got the formula here for you to look at and one thing i would encourage you to do is pause the video and see if you can't take the shared data frame and calculate goods coverage for every sample in the data set go ahead and pause it give it a shot and then come back and i'll show you how i would do it hopefully you did try that so go ahead and take shared and we know that again we have this data frame with the group id the name of the otu and the value and we can then pipe this into a group by and we'll group by group and we'll do a summarize and so we're going to get a few different values here so i'll go ahead and get n seeks and this will be the sum of the value column right and so that's the number of sequences we have i also want the number of sings the number of singletons right and so that will be the sum of value equals equals one and so these are the number of singletons that we have in each of our samples and then we could do goods so goods coverage and that will be one minus the number of sings divided by the number of sequences right and so then this gives us goods coverage we could multiply about this by a hundred if we wanted right and so now we've got our percent goods coverage along with our number of sequences and our number of singletons and look at these first 10 values they're all above 98 percent so that's a really good coverage value now let's go ahead and plot this so we'll do gg plot aes x on the x axis i'm going to put the number of seeks on the y axis i'll put goods and then we'll do g on point and so we can see this kind of hockey stick shaped curve and so i think we've decided that we're going to go to 1800 sequences so let's come back here after our summarize and do filter n seeks greater than 1800 and so now we see all of our goods coverage values are over about 97 and a half percent right and so why don't we go ahead and grab this and i will call this coverage stats all right and then we'll take coverage stats and pipe that into here and again if we take coverage stats we can arrange this by goods and we find yeah the lowest coverage if we go to 1800 sequences is 97.7 percent i'm really happy with that i think that's pretty good right and so as we can see certainly as we add more sequencing depth our coverage goes up one of the challenges though is you might say well how do i get everything to 99 percent well then on average we'd have to get everything basically up to about 5000 sequences per sample right so we would need to more than double the number of sequences for these samples that are below 5000 sequences i'll leave it to you to figure out how to find out what those samples are that have fewer than 5000 sequences or even how many there are um you know if you want to go up to 99.5 percent then we're going to probably have to come up to about 50 15 000 sequences per sample you know and you're just going to keep adding more and more sequences and that means money right so again you have to think about what is it worth to you to get that greater depth of sampling and what does that improved coverage get you and again i think that if you go to most reviewers and say yeah i know i only have 1800 sequences per sample but my coverage is 97.7 percent i think we're in pretty good shape i think they'll be satisfied by that type of argument right again far from being an arbitrary process of picking a threshold there really are some principles that i think we can use to pick a threshold for the number of sequences we want to use in our study some other principles that we've used in the past are you know for this type of data set how many sequences have we gotten from a sample and so you know we have a data set looking at cancer in the microbiome where we had about 10 000 sequences per sample we've gone back to those same samples and resequenced them well because we want to compare the two data sets and perhaps storage conditions of those different samples with the newer data we're trying to get 10 000 sequences so we can make an apples to apples comparison to the new data to the old data that had at least 10 000 sequences okay so again there definitely are some principles i wouldn't say it's arbitrary i don't think it's biased um also you know there's nothing to say that somebody couldn't come along later and say you know we're going to repeat pat's analysis but using a lower threshold or a higher threshold and seeing how things change again if you're doing things within a reproducible framework which i strongly encourage you to do then you can come back or anybody can come back and do that type of experiment to see you know does my inference change dependent on the number of sequences i have in my study for each of the samples right so again i hope you find this helpful for thinking through how to pick a sampling threshold i would encourage you to do this with your own samples um and you know go through that logic share it with your lab at a lab meeting and kind of see what kind of feedback you get i would be really interested in hearing what other types of issues or factors that your research group thinks about when you're picking that threshold let me know down below in the comments what you come up with and we'll see you next time for another episode of code club