 We often use the select function to select columns from a data frame, or the filter function to filter a data frame for rows or certain criteria are met based on the values in the data. But what if we want rows based on their placement in the data frame? Say we want the first 30th and 130th rows. What if we want five random rows from the data frame? To do these types of operations, we need to learn about the slice functions from the dplyr package. And today's episode of Code Club will do just that. Hi, I'm Pat Schloss, and this is Code Club, where we learn about reproducible practices to improve our data analysis skills. Please be sure to subscribe to the channel and smash that bell icon so you know when the next episode is released. In the last episode, we learned about writing pseudocode, which helped us to chart a path to measure the specificity of 16S RNA genes to allow us to differentiate between different bacterial and archaeal taxa at various taxonomic ranks. One problem with that analysis, however, was that some species in our dataset are represented by hundreds of genomes and versions of the 16S RNA gene along with those, and other species are represented by only a single genome. In today's episode, we're going to use those slice functions I talked about to measure the specificity of 16S RNA gene variants, often called Amplicon Sequence Rants, or just ASVs, using a common number of genomes from each species. Along the way, we'll see that these slice functions don't exactly do what we'd like them to do, so we'll have to do some troubleshooting that leverages functions we've seen in previous episodes, functions like group by, count, distinct, and inner join. Finally, because we're documenting this analysis in an R Markdown document that we're keeping under version control with Git, I'll show you how we can use Git-Diff to see how our results have changed by subsetting the data. Even if you're only watching this video to learn more about R and the slice functions and don't have a clue at a 16S RNA gene is or why you should care about Amplicon Sequence Rants, I'm sure you'll still get a lot out of today's episode. Please take the time to follow along on your own computer. If you haven't been following along with the previous episodes but would like to, welcome. Please be sure to check out the blog post that accompanies this video, where you'll find instructions on catching up, reference notes, and links to supplemental material. The link to the blog post for today's video is below in the notes. So I've added a couple comments to our issue on GitHub, saying that we need to control for uneven sampling of our species, and that previous analysis suggests the decent number of species had five or more genome sequences. If you recall a few episodes ago, we made a plot showing the number of species that had different numbers of genomes represented in the database, and I think five was pretty decent. Let's see what the results look like with five and see how that changes our results. At the same time, it would be nice to kind of be able to modify that threshold. So that's something that we'll keep in mind as we go through our analysis, because maybe we'd like to use two, right? If we go to our terminal, I'm in my project working directory. You'll see we're still on that issue 31. We've got the green, which means we're in good shape. I'm going to go ahead and open up our studio. And again, opening it with this our project file allows us to launch into our working directory. If I go over to files, you'll see that because last time I use this, I was in our exploratory directory. It's put me in there as well. And what we want to open is this 2020 1021 ASV taxa overlap file with the RMD at the end. And you'll recall that this was what we worked on in our last episode. Something that I've tried, but it just really frustrates me is this chunk output. So I'm going to have the chunk output into the console, not in line, because again, that just it's just a pain. So again, kind of walking through what we've talked about in the last few episodes on this are markdown document. You'll see this initial code chunk reads in the metadata the information about the taxonomy for each of our genomes, which we'll need to use as we kind of again sample from each of our species. The second data frame ASV tells us what ASVs are found and in what genomes and in what abundance, we then generate metadata ASV. I'll go ahead and run this code chunk. We now have this metadata ASV data frame. You'll also then recall that in the last episode, we generated this overlap data data frame, and then we then went ahead and plotted it. And we then also generated this data table, a table using cable. Let me go ahead and run this chunk so you can see where we're at and where we left off with the last episode. And you'll see over here on the bottom right that we have our kingdom through species with our four different regions. And as we saw last time, the full length of the one nine data has the least amount of overlap at the species level, meaning that if you take two full length sequences from different species, that there's about a 3.6% chance that one ASV could be found in two different species. So if we look at the sub regions like the V4, the V34 and V45, then we see between like, what do we say, between 8.8 and 12.6% of the ASVs are found in multiple species. And that number of course, as we'd expect falls off as we go to the genus family order and up. But the problem with this analysis is that it uses all of the genomes. And as we saw before, E. coli has something like 900 genomes represented in the database, whereas something else may only have one genome represented. In the database. So we'd like to do is control for that uneven sampling to help motivate that. What I'd like to do is show you down here in my console some new commands related to what we call the slice function slice is part of d plier. So if I do metadata ASV, we see our data frame, right. And the slice function allows us to give numbers and the numbers then refer to those rows that we want to get out of our data frame, right. So I could do 130 130. Let's do 1300. And so output should be four rows from 130 130 and 1300. And so we see then we get out an output of a data frame with those four rows. So that doesn't really help us with what we want. Perhaps we could do something like, you know, slice one through five and get the first five rows. That's not so great. That that slice one through five actually is its own function. And that is slice head. And we can then say n equals five to get the first five rows, or we could do say three and get the first three rows. The default if you don't give an argument is actually one row. And to complement head, there's also tail. Right, so we get the last three rows of the data frame. So we could use like head and equals five to get the first five genomes from every taxonomic group. So that would work. Or we could use one colon five or we could use tail and equals five, but that's not random. If I want to randomly select five rows, what I could do is again, as we've seen metadata pipe that to slice sample. We could then say n equals five. And this will then give us five random rows from the data frame. If I run it again, I get another five rows, right. So this is an argument replace, which by default is false. But if we said true, that this would sample with replacement. So without replacement means I'm going to draw a row. And I'm going to put it to the side, and then I'll sample another row and put it to the side. That's that's without replacement with replacement means I'm going to draw a row. I'm going to identify it and put it into my new data frame. But then I'm going to throw it back into the mix, right. And so if I drew another, I could actually sample that same row multiple times. It's very unlikely when we have as many rows as we have in metadata ASV. I realize here I'm using the metadata data frame to illustrate this. It doesn't matter. So replace equals false is the default. And again, that's that's what we really want. Of course, this is taking five random rows from across the whole data frame when we want is five from each species. So the other thing that you might notice in all this is that the five rows that you get is different than the five rows that I get. If I want to peg it so that we all have the same randomness, if you will, we need to use a function called set dot seed. And we give set dot seed a number, you could give it one, you could give it any number. I generally like to put my birth date in the ISO standard. It's 1976 06 20. And so then if I run my slice sample, then I get those five rows. And if I then rerun the set seed rerun the sample, you'll see that the five rows I get here are the same as the five rows I get there. And so setting the seed is really important for reproducibility. Just because the numbers that you get are still random or pseudo random as we call them. The seed tells the random number generator where to start. And if we all start at the same spot, then the random numbers that are generated by the random number generator should all be the same. Okay. So great. Now how do we apply this to our project? Well, keep in mind that when we do slice sample with n equals five, that we're taking any five rows from the data frame. What I really want is five rows from each of my species. Let's come back and we will see how we can use slice sample in this overall output overall pipeline. So for the most part, I think all of the code we have here is in really good shape. What I'd like us to think about is how do we get five rows from every, every species. And if you will, let me show you one small problem that we'll have to deal with when we do this, right? So if I do slice sample, say I did n equals five, and then I do slice sample n equals seven, right? What do you think is going to happen? What we would like it to do is return nothing, right? But what actually happens is that it returns as many rows as it can up to seven rows. And again, if we had done replacement, a replace, sorry, equals true, we would get seven because it's going to sample with replacement. But that's not what we want, right? We don't want to sample this, you know, if there's one genome in the species, we don't want to sample that genome five times set. We want to have it discarded from the analysis. So that's something we need to be mindful of as we kind of chart our plan of attack. So what are we thinking? Again, our metadata ASV has the information we want is the species and the genome ID. So what I want to do is I want to get a list of genome IDs so that I have five genomes per species say. And then I'm going to use that to select or filter the rows out of metadata ASV so that my overlap data then only contains five genomes per species. Okay. So we have all this information in metadata ASV and my point is we don't have to use all of it. So we'll get the species and genome ID. We will then say, well, what are the distinct or return the distinct rows? Because if you looked at metadata ASV, we'll see that, you know, we have basically the same row for all four regions. So these four rows are duplicates. They only vary really by the region. And so if we only look at the genome ID and species, then we're going to have a lot of duplicate information. And so we'll need to return the distinct rows, or perhaps you think of as the unique rows, right? We will then want to group by the species. And we will then want to do slice sample on each species for n genomes. Okay. So that's going to get us every species up to five genomes per species. The next thing that we'll want to think about is that we want to return species that have five or have the n genomes, right? So have exactly five genomes. We'll then want to perhaps say, going back to original list of species genomes, we will want to say, to filter out species with fewer than n genomes. Okay. So this is a plan of attack, my pseudo code. You'll see I've started incorporating some R logic in here. And again, it's not critical to put in the R code as we're developing our pseudo code, but it's helpful to have the pseudo code again to help us to separate the what we want to do from the how we're going to do it. So we'll do metadata ASV. And I'm going to then pipe that to select to get genome ID and species. And again, we've got our data frame. And because we see those duplicate rows, we'll want to go ahead and do distinct. And then we get unique rows. And we can then group by species. Right. And then we can then do a slice sample n equals five. And something that occurs to me is that I want to set my seed, right? I want to set seed 1970 6 0 6 20. And I'll put in a comment here that set our random number generator seed to Pat's birthday. All right, so I'll run that. And we will then run these lines. Okay. So what we're getting here then is the genome ID and the species, and it's grouped by the species. So at this point, I want to double check what's going on, right? So I'm going to, as a test, pipe this out to count species. And I'm going to arrange a descending order of the end column. And we'll see that the most abundant of any of the species has five copies. And if I do an ascending sort that we see we have one, right? And so I'm going to call I'm going to remove this test code that I put in here. And I'm going to call this a sub sample species. And what I want to do are these next steps, right? And so we have sub sample species. And I'm going to then I need to count number of genomes in each species. And we'll do count on species. And then we will then say filter N equals five. And so running that we see that everything has five copies. And I will call these then what will we call them call them good species can't come up with a better name. And I'm going to go ahead and have it return only this the species name. So I'll then say select species. And so good species has those listings of names. Okay, now we want to go back to that original listing of sub sampled names. The sub sampled species. And we see that we've got genome IDs and the species. And I see that I've already done this thing where I filtered out filtered out species with fewer than N genomes. And one other thing I noticed is I have hard coded in the end. So remind me to come back to that in a minute. So we want to go back to the original list of species genomes and return the genome IDs from species with at least N genomes or with with N genomes not at least with that many genomes. So I'll do inner join. And I will then do I can never remember my variable names sub sample. Sub sample species with good species. And I'm going to join by my species column. Got to spell that right. And we see that we then have genome ID and species. I see also that I'm still grouping by the species up here. So I will go ahead and do an ungroup on that. And so now if I run my inner join, I no longer have that never no longer have that grouping. Actually I do. Why do I have that? That's probably because I have group by up here as well. So I'll do an ungroup on this for my sub sample species. And I actually didn't have a group by here. So I don't need this ungroup here. All right. And then we do our inner join. This should work. Great. And we can already see that these are sorted by the species name. And that we see we have five haemophilus ducriae and five acetylbacter pasterinus. And so that is great. And this will then allow us to get our, let's see, our sub sampled genomes as that. And I'm going to pipe this out and I'm going to then do a select genome ID. And that then gives us sub sampled genomes. And these are the names of the genomes we want. And so what we can then do is modify this code so that we do an inner join between metadata ASV and sub sampled genomes. And we're going to do it by genome ID. Don't forget that closing parentheses. Run that. And then look at our plot. And we can maybe toggle back and forth here in our studio. So this was using all the genome data. And you'll notice that like the v4 line is over 12. If we only use five, we're just over eight. So it seems the numbers came down a little bit. What I'll do is again, because let's see, we're running this set seed. So let me run it again with that seed. And we see that again, we're just under eight. And you can perhaps recall this table over here. If I rerun everything except for the seed, let's see how much things change. And so things have changed a little bit, I can tell. It might be nice if we had a fixed y axis to kind of see how much things are changing here for like v4. We see we have 9.1 earlier. Where was it? We had 7.75. So there is a fair amount of variation depending on which five we grab. So something we'll talk about in the next episode is how can we perhaps run this block instead of twice or once? How can we run it maybe a hundred times and then get an average? So we're not so sensitive to the random number generator, right? Okay, so something that you were supposed to remind me of was that I want to create a variable for my threshold. So I'm going to say threshold. And for now I'll make that five. And so I'm going to go ahead and replace these fives that I hard coded in with threshold. And we'll replace this as well. And let's make sure everything goes well. And this looks like the plot that we had two versions ago. Yep. So that works well. What this allows us to do is I could say let's do two instead of five. And so now we rerun and we get the analysis again and that we're taking two genomes from every species. And the advantage of doing say two is that we include more species. The disadvantage perhaps of two is that we're not using as much information per species. So there's maybe a little bit of a trade off. And I think though that we would overcome that trade off if we did a bunch of iterations. And so again, we'll talk about that in the next episode, how we can perhaps scale this up to run it 100 or 1000 times and then output an average. That's not so sensitive to the random number generator. So I think I will leave this. Let's let's do three. Just, I don't know why. So we'll run this, we'll get a result for three. And let's go ahead and come back to our terminal. And I will then do make exploratory 2020 2010 21 ASV tax overlap MD. And I'm going to run this. And we see that it runs and we see actually as we expect that our issue 31 label there has turned red, telling us that something has changed. And we see of course that our AMD file changed, our markdown file changed as well as our figure. So something I'd like to show you is something that's really cool. I think about using version control and thinking about reproducibility is that if we have a report like this, where the main output that changed is in our markdown file thinking about that final table. Right. What I'd like to do is let's do use a new function called get diff. We can do exploratory 2020 10 21 overlap that MD. This then shows us everything that's different between our current version of the MD file, as well as the current version to the version that we last committed. And this output we've kind of seen before. You see a colon in the lower left corner if I hit my space, I go down a page. And what this allows me to do then is to very concisely compare my tables. And so the the minus and the red tells me what's been removed. So this was was in there for using all the data. And this table is the output from using three genomes per species. Right. And so we can kind of see how that changed. And again, that's a really nice feature of using version control. And using version control to say, you know, what did we add as we ran this, what what changed in the output. And so if you think about that, from a reproducibility perspective, if we run an analysis multiple times, this file shouldn't change right unless we've changed code or changed how things have have gone. So, again, these are the slice functions. I'm trying that the slice sample function is really useful for cases like this, where I want to get an even number of representatives from any group. Also, it could also be useful if I've got say a data frame with like a million rows, and I really only want to maybe test something because I know that if I use a million rows, it's going to take like five minutes to run. I couldn't perhaps use slice sample to test something on say a thousand of those rows as a way of kind of developing my pipeline that I can then go back, remove that slice sample and do the full thing. But in this case, where we're trying to get an even number of representatives from each of our species, this works great with the slice sample using n equals whatever value we want. Again, we'll come back and look at this in the next episode. So keep playing with these slice functions. I know sometimes people will ask, well, how do I get row five? Well, we've seen today that there's a slice function for that. Keep practicing. In the notes below this video, feel free to ask any questions you might have or tell me where you think you might end up using the slice functions. Have you ever used this slice sample function before? Really powerful again for randomly sub sampling different groups. Please be sure that you subscribe to this channel. I know many more people watch this that are actually subscribed. So go ahead and subscribe it. Click on that bell icon so you know when the next episode is released. And we'll see you next time for another episode of Code Club.