 Welcome back for another episode of Code Club. I'm your host Pat Schloss. I hope you're doing well. In the last few episodes, we've been working through how we can parse the taxonomy information for each genome in the RRNDB dataset that we've been working with. We finally decided to use NCBI's taxonomy because it allowed us to get a single taxonomy for each genome rather than possibly getting multiple taxonomies per genome, as was the case when we looked at the RDP taxonomy. Perhaps at this point, you're wondering why we went through all of that trouble. Well, in the next several episodes, we're going to update an analysis we did earlier when we looked at the number of ASVs per genome, as well as the number of ASVs that were found across genomes. Recall that ASVs are called Amplicon Sequence Variants, or thinking another way of each 16S sequence being its own entity, its own taxonomic unit. Today, we'll be using our files to get a sense of the number of taxa we have at each taxonomic rank, as well as the number of taxa in each rank that have more than one representative, more than one Amplicon Sequence Variant. One of the problems we identified before was that if E. coli have hundreds of genomes in the database, and another organism only has one, then we can't really make an apples to apples comparison, right? The E. coli one we'd expect to have a lot more Amplicon Sequence Variants than the one that only has one ASV. In this episode, however, we'll get a sense of one way that we might correct for overrepresentation of a taxa in the database. To help us along the way, we'll use the geomine function from ggplot to plot our data. We'll see how we can improve our taxonomic information for each genome using if-else, and we'll broach the scary subject of factors. To me, factors are the scariest part of R, at least so far in my understanding of R. Even if you're only watching this video to learn more about R and don't have a clue what a 16S-RRNA gene is, or an Amplicon Sequence Variant, for that matter, I'm sure you'll get a lot out of today's episode. Please take the time to follow along on your own computer. If you haven't been following along but would like to, welcome. Please be sure to check out the blog post that accompanies this video, where you'll find information on catching up. You'll also see reference notes and links to helpful material. The link to the blog post for today's episode is below in the notes. For today's episode, we're going to make another RMD, our Markdown document, to show our exploratory data analysis as we go through. You'll recall that several episodes ago we created one to show how the distribution of Amplicon Sequence Variants, ASVs, varied by genome, and then how specific an ASV was to each genome. So what we'd like to do now is kind of pull that back and start thinking about it at different taxonomic levels, say like species or genus or family or all the way up to say kingdom. So I've already gone ahead and put in an issue here. You see I did some editing already on this issue. I got a little bit over ambitious on how much we were going to do. We'll do some more stuff with another report to kind of explore these data in the next couple of episodes. But for today, what I want to do is find the number of taxa within each taxonomic rank that has a minimum number of genomes. Turning to our command line, you'll see that I'm already in my project root directory. I'm already on an issue branch. If I do my get status, you'll see that I've already created a template RMD file. And if I do open Schloss Rproj, this will launch RStudio. And for some reason it already put me into exploratory. But again, if I click on my project root directory here and then come down to exploratory, that brings me there to the listing of those files. But what I really want to make sure is that I'm in the correct working directory and getWD again shows me that I am in fact in my project root directory as my working directory. And you'll recall that using the here package. So I've got library tidy versus library here. I'll go ahead and run those because I always forget to load my packages. But you'll see for a title, we're analyzing the distribution of genomes across different taxonomic ranks. I'm the author. This is the date that I am working on this on. And then these were a variety of outputs, settings that I created, as well as editor options to output the output of my code to the console here. And if that looks foreign to you or doesn't ring any bells, just be sure you look back at the previous episode where I talked about making our markdown documents. Okay, so we have a metadata file, which if I look at my data, and I think it was in references me sort by time created, it's this genome ID taxonomy.tsv and that's in data references. So I can do read tsv here. And that, again, is the package we use to get the paths to work right within our markdown documents. So we don't have to use that nasty set WD function. And we'll do data references. And what was it genome ID taxonomy.tsv running that loads everything very nicely. And then we want ASVs, which will do read tsv here. And then it's going to be data processed. And then that is going to be surprised. It doesn't pop that up for me when I hit tab as LTH data processed is going to be R&D, and now looking at ASV, we see that we've got a region or ASV or genome and account. And what we should be able to think about is that we can do, we can define a variable metadata ASV, which can be interjoined. And we're going to use metadata on the left ASV on the right. And then we'll do buy and the metadata on the left has genome ID equals and then I believe it's genome for ASV. Yeah, genome. And that will join it all together. And now if we look at metadata ASV, voila, we got everything pulled together. Very nice. As I read these files in, it tells me the specifications that it parses the file with that output. It doesn't matter. It does a good job. But I like to be a little bit more explicit and tell it exactly how I want it read in. And so for metadata, where did I read that in? That was up here. So these are all read in as call character. So why don't we read everything in as call character and so that read TSV doesn't have to guess, which is what it typically does. It typically guesses based on what it sees in like the first thousand lines or so. So I can add as an argument here. I can do call types and that allows us to set the column types and then we'll say calls and then we can give it the type of columns that we want to set before we did this as a string right where we put in double quotes and then we put single characters to tell read TSV what to read each column in as well. All of these are call character. So what I'm going to do is I can use calls, which is a little bit different specification or very different specification than what we used before. I can say dot default. And so this is the default way to read it in is going to be call character, right? So we can do call character, read that in and you'll see now that it reads it in and it doesn't give us any verbose output about how it parsed the file. Now if we look at the reading in the count table, we see similar type of output. But we find that there's one column that is a double. So I'm going to again do default call character. And I will then also do count, which I will then say is call. I'm going to say integer. It's not a double their counts. They're all integers. So we'll run that. And again, you see that reads it in. So if we look at ASV that everything here is a character and that's a count integer. And again, we can merge these two together to have one big data frame before I do that though, one of the things I notice about metadata is that it's a for our purposes, it's a wide data frame. What I'd like to get is the number of genomes or the number of taxa actually the number of taxa we have at each taxonomic rank. So I'd like to eventually group by taxonomic rank and then count the number of things in each taxon for each taxonomic rank. The easiest way to do that will be to pull our columns together to pivot longer. And I can pipe this into pivot longer. And we will I'm actually going to get rid of the scientific name, get rid of that. And then we'll pivot longer. Everything except for genome ID and the names to will be rank. And then the values to will be taxon. We run that and look at metadata. We see that we've got genome ID rank and taxon and eventually like, like we did already here, we're going to join that back in with our ASV data. So there are a variety of values in here where there is no value for the taxon. We get an NA value. We can actually get rid of that by doing drop NA taxon. So before I think I had done a filter. So filter, you know, what was it like is that NA taxon. And I use an exclamation point to get rid of that. Drop NA is much more concise and easier to do. They both give us the same output, but I think drop NA is easier. So again, we now see that it's a cleaner data frame. It's got 108,604 rows. Whereas before we had 109,000 with those NA right. So it's not that big of a difference, but we don't want those NA values. And so they're gone now. Okay, let's go ahead and join this together. And what we're going to find now is that we have a, I don't want that double M. We now have a very long data frame, 780,000 rows. But we have our genome ID, our rank, our taxon, the region, the ASV, and the number of times that ASV shows up in this genome for that region. Okay. So what we'd like to do is find the number of genomes within each taxonomic rank. And I'm not sure I worded that quite right, but let's go ahead and we'll take metadata ASV. And I will then pipe that to a group by, and so I'm going to group by the region. I'm going to group by the rank and the taxon. And again, I want to know the number of genomes within each taxonomic rank. And I will then do, let's see, if we then summarize and do capital N, and if we can do N distinct as a function. This will then tell us how many distinct values there are if I give it genome ID. So how many unique genome IDs are there if I run all this and what we see for these different regions, ranks, taxon. This is the number of genomes that I have for each of these taxonomic ranks. Good. What I'd like to know then is how many taxa I have at each rank. So I want to know the number of taxa within each taxonomic rank, not necessarily the number of genomes that are there. Okay, so we can then pipe this to do another summarize. And so you'll see now that it's at region and rank and I can do summarize and I'll do another N, which will then be N distinct and I will then say taxon. Okay, so this will give us the number of distinct taxon taxa in each taxonomic rank for each region. And this also gets me thinking that it doesn't make sense to have different regions. I really only want to look at region v19. So let me do filter region equals equals v19, because again, it's going to be the same whether it's for v4, v35, whatever. Let's only focus on that. And let's not group by region. And let me kind of run back through where we were already. And again, we'll see our taxonomic rank, our taxon, and the number of genomes we have for each taxon. And now if I run the whole pipe, I then get the number of taxa at each taxonomic rank, right? And so one of the things that you'll perhaps see here is that it's an alphabetical order. And that's not the order I want things in. You know, perhaps in the battle days, you might, I may have named kingdom like one kingdom to phylum. I basically give them a order, a numerical order, or change their names to give them the order that I want them to come out in. And that's really a clue and is not really very helpful, right? And so one of the reasons why this matters is say I come here and I do n taxa per rank to define that as a variable. And now say I want to make a plot. And I could do G plot AES. And I can then say on X, I want the rank on Y, I want capital N. And say I'm going to do a geom line, or let me do geom point first, because I know geom line might give us a little bit of trouble. And so you'll see in my lower right corner here that I've got my taxonomic ranks, but they're not in the order I'd like it to be taxonomic rank really is an ordinal variable. So how do I fix this? And what I'm going to do is I'm going to come back to where I've defined metadata, and I'm going to add a mutate line. And so I'll say mutate, and then I will say rank equals, and I'm going to say factor rank. So if I run this and look at metadata, I will see that my rank has a different column type. Instead of CHR, it now says FCT. And if I then, so that's a factor. And a factor tells R that this is categorical data. Now we can make it ordinal data by setting certain levels to our factor. So we can say levels equals, and then we give it a vector of levels. I'm going to go ahead and put this on a separate line, and I'll put them in order now. So kingdom, phylum, class, order, family, genus, and then species. That's kind of running over the line. So I'll put that there. And again, if I look at metadata, the output isn't going to look meaningfully different to me. I need to go ahead and redo the join. But now when I run these two code chunks, I now see that I have kingdom, phylum, class, order, family, genus, species. Wonderful, right? That's really great. I'm going to go ahead and change this to a line. So do geom line, and I'll do theme classic. I like the clean classic look. And so it complains that each group consists of only one observation. Do you need to adjust the group aesthetic? No. So what I can do in here as an argument, as I can say group equals one. And now I get the nice line, right? And we could also talk about perhaps changing n to be something more meaningful here. So here I'll say maybe n taxa and rerun that. And we'll get, so now I have to update this to be n taxa as well. All right. So that looks a little bit better. And again, we get our different taxonomic levels. So something that's missing from here that I would really like to have is the number of strains. And that didn't come through in our data. I removed it from here, but there's also a line for, there's a column rather for scientific name. Let me go ahead and let's look at this and you'll recall that there's a column for scientific name. And if I do select scientific name comma species that you will see that. Well, these aren't doing it very nicely. Perhaps I'll look at the tail. And so here you can see, at least with these microplasmas, my coities that the species column doesn't have all this other stuff added onto it. So I'd like to do is to create a column before I do the pivot longer. So to create a column, I'll do mutate that all call strain. And so what I'm going to say is if the species is the same as the scientific name, then we don't have strain. If they're different, then we'll use the scientific name as the strain name. We can do that using the if else function. And we've seen this in previous episodes where in the first slot goes a question, and then if it's true, what value do you want for strain? And if it's false, what value do you want to be put in there? Right? So the question is going to be scientific name equals equals species. So if that's true, then the output is going to be for strain is going to be NA, right? Because we don't have a strain name if our species name is the same as the scientific name that there wasn't anything added on, right? So like there wasn't anything tacked on here. But if that's not true, then what I want to give for the strain name is the scientific name. And I can then pipe that out to select. And again, I'm going to remove the scientific name. I want to keep the species name. I'm going to run that. And I think this all looks good. And so the thing I need to add here to my output then would be strain. And now if I run that, it's complaining. So if that's if scientific name equals equals species must be a logical vector, not a character vector. You know what? It's complaining because this NA scientific name is a character and is a logical. So this needs to be NA character underscore, which is a character type of NA, but it's still an NA value. The if else function, make sure that your true and false that this particular if else function, make sure that the true value and the false value are the same type. And so in this case, because scientific name is a character, this NA is normally a logical and so forth to be a character, but still an NA, you need to have NA underscore a character underscore. All right, so now if we run this, we're good. No complaints. We again do our meta meta data ASV join. And we can generate our end taxa per rank. And it's complaining about those groups arguments. And so now you'll see that we've added strain to the very right side, kingdom phylum class order family genus strain. And what I noticed is that there are fewer strains than there are species. And to be honest, I don't know that I totally trust the strain level designations. I trust that something is called that strain is that strain, but I don't trust the things that aren't called a strain aren't necessarily a strain, right? So for example, someone might submit an E coli strain K 12, but they might not add the K 12 strain name. It might be E coli that they submitted ads. And so there might be strains in the species that aren't listed as strains, if that makes sense. So I think for my analysis, what I'm going to focus on is probably things from species back to kingdom. But for now, we'll go ahead and leave the strain. One of the things that also occurs to me is that for some of these taxa, there might be hundreds of genomes, right? So we know there's hundreds, maybe 800 E coli genomes in the data set. But for some other taxa, there might only be one species represented or one class represented with a gene or one genome per class, right? So it'd be good to get a sense of what that distribution looks like. And what I'd like to do is to perhaps modify what I have here with my n taxa per rank. And you'll recall if we run these first four lines, we get a table that has our taxa, our rank. And you'll notice that it now has our rank here in this first column by the order that we gave it when we defined the factor. But there's some in here that you'll see that only have one genome sequence. And what I will do is to say what? So I'm going to modify this n taxa. And I'm going to say n1. And this will then be, let's say, I'm going to do, yeah, so that's correct, right? The number of taxa that have at least one genome is n distinct taxon. The number that have two, I'm going to do some and I'm going to say n greater than or equal to two. And let's make sure this runs right. And if I again look at n taxa per. So this is the number of kingdoms with one genome to write. And so you can kind of see there's not as many species with two genomes as there are with one. So good. So I'm going to do a couple others here. So let's do three, four, five, and I'm going to add one last at 10. We'll do three, four, five, and 10. Okay, so that all looks good. And again, when we run this, we're going to get a wide data frame, right? So that was n taxa per rank. Again, we get this wide data frame to make it tidy. I'm going to go ahead and do pivot longer. Again, these pivot longer pivot wider functions are really key, really helpful. I will do everything but the rank is what I'm going to pivot to longer. And the names to will be minimum and genomes and values to will be n taxa. And again, if I do n taxa now per rank, you'll see that I've got kingdom, the minimum number of genomes and the number of taxa. And I can now come back to my plotting. And so I basically want a different line for each threshold. I can then do group equals minimum and genomes. And I'm going to remove this group equals one. I'm also going to do color equals minimum and genomes. Run all that. And we'll see that I've got really bad output, but I've got minimum and genomes. And I've got, it's an alphabetical order again, right? Not good. So let me go ahead and remove that minimum to clean up the output a little bit. And this needed to be n genomes here as well as here where I make the plot. All right, there we go. So that looks a little bit better. And I think it's good enough for our purposes here. And what I'm going to do at the end of this pipeline where I created n taxa per rank, I'm going to do mutate and I will then say n genomes. And again, I'm going to use factor, right? So do factor n genomes and I'll do levels equals. And because this is going to probably run off the line, I will do n one and two and three and four and five and n 10. And again, this is going to put my lines in order of minimum frequency of the number of genomes in each taxonomic group. And so I run that and bingo, it worked very well. Okay. So my interpretation that I'll leave you with here is that, you know, even if we require that every taxonomic group had at least five genomes, there would be a few hundred species represented. I'm also going to say, not sure I trust the strain level data as being complete. There may be taxa in the strain column, strain rank that are in the species, but the people, the data didn't indicate the correct strain name. Okay. So I trust the species name a lot better than I do the strain name. Okay. So I'm going to save that and knit it. And this all looks pretty good. There's some funky things maybe with the tabs, but I'm not going to worry about that for right now. And I'm going to go ahead back to, let's see, where did I have my exploratory analysis before down here. I'm going to copy that to give me a template. So it's going to be 2020 10 or 929. And this was called taxa representation. And I will copy the same thing over there. And I'm going to render that right. And the data as input are going to be this genome ID taxonomy.tsv file as well as this count tibble file. These are the backslashes. I'm going to go ahead and save that. And I'm going to go ahead and try to make it tells me it's up to date. Let me kind of force a change in here. Right here. There's an extra space. I'll remove that, save it, make it. And it's complaining. And that's because I didn't give the path to exploratory in my make file. All right. So this needs to be exploratory forward slash that. Now let's try it again. And this is working well. So I'm going to go ahead and close out this issue. I'll do git add make file. And then exploratory 2020, 2009, 29. I put a star to get everything that has that root. We're good. Get commit. Determine number of genomes and taxa represented at each taxonomic level. Closes number 27. Get push. Oh, I got to get checkout master. Get merge. And our studio is complaining. Yeah, I'll close that. And I'll do git merge issue 27. Okay. That all looks good. And I'll go ahead and push this up. And we'll close the issue. So again, the concepts that I wanted to drive home today in today's episode, we're thinking about creating factors so that we could get our lines in the correct order and our taxonomic levels in the correct order. This also advances our analysis because we can think about different ways that we might correct for overabundance of different taxa. Perhaps we would require everything have two or more, three or more, five or more genomes in the database for us to think we have a good sampling of that taxonomic group. And so this kind of also gives us a sense of how many species are represented or, you know, we see this fall off in the number of strains. So again, it's giving us a visual description of what's going on. And it's also allowing us to play with factors. It's also allowing us to play with if else. And it's also allowing us to play some more with these are marked down documents. So that if I return to my code and go to exploratory. And this was taxa representation.md. Yeah, those tabs get really screwed up. I'm not sure what that's about. But here we see our plot and it looks a little bit nicer than what I have down here in the lower right. And it all looks good. So in the next episode, we'll move forward in talking about how we can summarize the number of Amplicon sequence variants within a taxonomic group as well as how well they show up between or how often they show up between different taxonomic groups. And we'll move forward and we'll talk about this thing that you've seen me perhaps get hung up on before, which is the dot groups argument in summarize and talk about what's going on there and what might be some good practices in terms of using the dot groups argument. So keep practicing. Hopefully this made some sense to you. So you're not going to be changing names of your variables anymore to get them in the right order, but you feel confident in using the factor function to do that for you. So keep practicing. Be sure you like and subscribe the video. Tell your friends what we're doing here and we'll see you next time for another episode of Code Club.