 Welcome back, folks, to another episode of Code Club. I'm your host, Pat Schloss. Thanks for coming back for another episode. In the last episode, we learned about the separate function from the tidyR package. We used that function to parse the RDP column from our RRNDB metadata file that we've been working with. This allowed us to create distinct columns in our data frame for each taxonomic level from kingdom to species. But along the way, we noticed that there are about 200 genomes or so that had multiple taxonomic assignments. This happened because the folks at the RRNDB used the RDP's classifier to classify each copy of the 16S RNA gene in each genome. As I've been suggesting throughout these episodes, this is part of the problem with Amplicon sequence variants, also called ASVs. What's the problem? Well, 16S RNA gene sequences from the same genome aren't identical. And by treating them as identical, we run the risk of splitting one genome into multiple taxonomic groups. Instead of having multiple taxonomies per genome, we'd really like to have a single classification for each genome. Taxonomies should be organism and not gene specific. Doesn't make sense for each gene to have a different taxonomy if it's coming from the same genome. So after thinking about this issue a bit, I've decided I'm going to scrap what we've done for the RDP column and instead retrieve the NCBI taxonomy. Well, unfortunately, although the RRNDB metadata file has a column for the NCBI taxonomy, there's not actually anything in it. I suppose I could email them and ask them to put it in, but eh, that's such a bother. Therefore, what we need to do is to trace back the taxonomy of each sequence using some other files that we can download from the NCBI. You'll recall that our metadata file has a column for the tax ID for each of our genomes. So we can get a file that gives us the taxon ID of the parent node of each of the taxon IDs that we have for our genome sequences. Then by getting the parent node of each parent node, we can trace each lineage back to the root of the tree. We can also get the scientific name for each taxon ID so that we can always bring back a name to the story. So how do we do this? Well, that's what we're going to talk about today. We've previously talked about doing an interjoin to join two data frames to each other. Well, we can also do an interjoin to join a data frame to itself. This is called a self-join. Imagine doing an interjoin between our metadata and the file or the data frame that maps our nodes to previous or parent nodes. We could keep doing this, joining the output of that to the mapping file on and on and on until all of our parent nodes are at the root of the tree. Along the way of today's episode, we'll learn about the unite function for uniting two columns together. We'll also see some of our old friends like separate, anti-join, left-join, pivot longer and pivot wider. You'll recall that I covered a couple of these functions in previous issues without explicitly focusing in on them. And so what I'd like to do on today's episode is re-expose you to them and see how you can use them in different contexts. Even if you're only watching this video to learn more about R and don't know what a 16S RNA gene is or don't have a clue what an Amplicon sequence variant is, I'm sure you'll get a lot out of today's video. Please take the time to follow along on your own computer. If you haven't been following along but would like to, welcome. Please be sure to check out the blog post that accompanies this video where you'll find instructions on catching up, reference notes, and links to supplemental material. The link to the blog post for today's video is below in the notes. So you'll recall at the beginning of the last episode we actually made two issues. One was to generate a taxonomy string using the RDP. The other was for the NCBI. We didn't know at the time what the RDP would have some issues with it. So it's good that we already filed this issue. And in this issue, you'll recall that we have a number of different links. Let me go ahead and start by opening up my RStudio project. Here I am in my project root directory because I think a lot of the things that we had for the previous issue, if we go ahead and scrap what we did here and get genome ID taxonomy that we probably don't really need a lot of these things, right? So sadly, I'm gonna get rid of all this stuff. And so what we started with was the metadata. So that's gonna be good. The species subspecies lookup, I'm pretty sure we don't actually need. This makeup tax was because we used these numbers that we had to add because they weren't in the summary data from the RRNDB, which is also this stuff here. So I'm gonna go ahead and get rid of all of this stuff and I don't think we need it. So that species lookup file, species subspecies lookup file isn't really needed. I'm gonna leave this test statement in here. So if we return to our make file, looking at my make file, you will call that we did have this rule to get species subspecieslookup.tsv and we use this code of get species subspecieslookup. So I'm gonna change this file to get NCBI tax lookup. And so then we'll change this here and I'm gonna go ahead and rename that file. So you might be tempted to just rename it, but remember that this file is under version control with get. And so we need to treat it specially. So we'll use get MV code get what was it, species subspecieslookup. And we wanna change that to code get, I think it was what I say, NCBI taxlookup.sh. And so if we did get status, we see something new here, right? That it's been renamed. And this reminds me also that I need to create a new branch, a new issue branch. So I'll do get branch issue 26. Oh, gotta spell it right. Get checkout issue 26. Great. And then we do get status and we see that we're renaming the file. Good. And if I now go look at get NCBI taxlookup, I don't want taxcat. I want tax DMP, tax DMP. And I'm not sure what files output I'm gonna want. So let me go ahead and run these two lines. So I know what kind of output we're gonna get. So again, retrieve it from NCBI. This takes a couple seconds. While this is downloading and opening up, it's a good opportunity to go ahead and like and subscribe the Riffimona's channel. Be sure you click on the bell so you know when the next issue is released. All right. So we see we get all these new DMP files outputted. The two that I'm most interested right now are names.dmp and nodes.dmp. So let me copy these over and I'm gonna do MV on that. We're gonna rename these and I copied the wrong thing. All right. And so I'm gonna name this NCBI nodeslookup.dmp and I'm gonna call up nameslookup not nodes. Well, I copy this down. Oh, I've got it. This is gonna be nameslookup. Okay. And we can make these the new targets in our make file. All right. So this is gonna be, and I'm gonna make that TSP. I'll go back and change that. And this should be nodeslookup. And because we go across a line, we can put a backslash in there. And there's my backslash. Tab this in a little bit and that all looks good. Let me go ahead and let me go ahead and come into this file and change that dmp to TSV. They're not really tab-separated values files as we'll see here in a minute, but it makes me feel good to have it as a TSV. And that's important. Okay, make that. So we already downloaded it. It opened that up. Let me look at the timestamp here. So 927. Yeah, for some reason, the timestamp is getting kind of funky here because this is actually older than what I already had. Let me look at code 949. So this should be doing that, but I'm not sure what's going on. Let me go ahead and remove the zip file and anything that ends in dmp and rerun my make. And so this, yeah, let me look at the timestamps again. I see that text dmp is 927. So this is from a bit ago. I don't know why I'm getting an older timestamp. Yeah, I'm not really sure what's going on there, but this is not going to be an episode on debugging make. So we'll call this good for now and later on we'll perhaps come back and figure out what's going on with that. All right, so we've gotten our nodes look up and our names look up. Let's come back to our studio and we wanna go ahead and read in our nodes. So we'll do a read delim, data, raw NCBI nodes. References, so this is nodes look up. It's gonna take a few steps to get this read in and I forgot to run all my other stuff and like loading the tidyverse and I need to give it a delimiter. So if we look at head data references, we'll see that the delimiter are these vertical pipes. And so that's what we wanna use. And so we'll do delim equals vertical pipe. Other thing that we want, so if we do this, we see that we get a lot of white space in our column names, right? So there's a tab, a column name and then a tab and that's a problem. There's also a problem that we don't actually have column headings but we'll get to that in a second. So we can do trim ws equals true. So trim white space equals true and you can already see from these column headings that they've removed the tabs around the names. So we need to get these call names now. So we can use call names and we're gonna feed that in as a vector and we need to go ahead, I forget myself oriented here, back to our data references to the lowercase readme.txt and these are the column headings. Really all we care about is tax ID, parent tax ID and the rank but we still need to give names to everything else. All right, so we'll go ahead and copy and paste these in. I'm gonna put the column headings in quotes. Really it doesn't matter what I do here because these are kind of fields that I don't care about. All right, and then I wanna get rid of all this other stuff to the right and replace it with a comma. So copying and pasting speeds things up but sometimes you still have to clean up the text once you've pasted it. Just a few more here. Again, what we're doing is we're creating a vector that contains the names of the columns that we're gonna read in. And that's the last one. So we close that with a closed parentheses and these three are the ones I'm most interested in so I'll probably just leave them for organizational sake on their own line. Just kind of make it look pretty. All right, so we can now read in this read to limb. And I think I have one too many closed parentheses. Using our studio is helpful for figuring out those little syntax things that you might screw up along the way. So this read in things pretty nicely. All right, so we've got our nodes read in. Let's assign that to a variable and we're gonna wanna get names. And it's gonna be fairly similar where we're gonna do read to limb and then data references and then NCBI names, lookup and then do limb equals the pipe and then trim W space equals true. And I realize this now has gone off the side of the screen so just tidy this up a bit, whatever. Can't worry too much about spaces. All right, white space is not the topic of today's episode. I have to remind myself that. Get rid of that quote. All right, and we wanna get in the names, the column names. And so let me come and do head again. And okay, that's what it looks like. But again, we need to look at our read me file. Yep, I already have that open here. So the names.dmp, that's what it was originally has these fields. So we're going to, you don't need that closed parentheses but then we need call names equals C, open parentheses and we need tax ID, name text, unique name and names class. We'll get rid of all this other stuff and make it a vector. And let's go ahead and run that and it's complaining because why is it complaining? Because I think I'm forgetting a closed parentheses. Okay, so this reads in names. And so it's complaining because there's four columns, expected four columns and I actually got five columns. Let's look back at this head. And what you'll see is that if this is a delimiter, the pipe, then we've got a pipe at the very end. And so there's actually something missing from that because it's creating a fake column at the end. So I'm going to call that blank. Oh, not that one. Yeah, that one. So if we read this in, that doesn't give an error. And now if we look at names, we see that we have a tax ID, we have the name, we have the unique name, which I'm not really sure what we're going to do with that. And then we have the name class. So you'll see that there's like for bacteria, there's a scientific name, there's a blast name. Some of these have in part names. What we really want is the scientific name. And so I'm going to do a filter so that name class equals scientific name. And again, if I do names, I see that I've got a scientific name for everything. And really all I want is the tax ID, the name text, and I can get rid of these other columns. So I'm going to go ahead and do a select on tax ID and names text. And this will then get rid of everything else. Names text doesn't exist. Name text. Again, all the typos are baked in. That's the way I program. So if you have typos, don't feel bad. So again, now we have a table that's got the tax ID and the name for that. And coming back to nodes, I'm a bit surprised that when I ran nodes earlier, it didn't give me a similar error because I would think that that would have also had, yeah, it gave me the same error where we had, it was looking for an extra column. So I'm going to add a blank to that as well. And let's rerun nodes and it doesn't give us that error message, which is great. And what I want is to select the tax ID and the, I'm going to call it parent tax ID and we want the rank. Now I need to rename parent tax ID to be back parent space tax ID. And I'm renaming this because I don't like having to type those back ticks. I don't like spaces. And so if everything can be underscored without spaces then our lives will be happier. We then run this. And if we then look at nodes, we then see that we've got the tax ID, the parent tax ID and the rank. And so what you can imagine is if we just took this and we could then, we could join this data frame on itself joining tax ID to parent tax ID to kind of work back up the tree. And so that's what we need to do now with our metadata file. I'm going to go ahead and get rid of all of this for right now and save it. And so what we've got is our metadata, our nodes data and our names data. And I think that's enough to regenerate the NCBI taxonomy for each of our genome sequences. To illustrate what we're going to do with these inner joins, we're going to do inner or with these self joins rather with inner join is that we're going to do nodes comma metadata. And we're going to do, what are we going to join? We're going to join by and in nodes, you'll recall, I think that was the data frame I just had up, we have tax ID. And what we want to get for each of our genome sequences then is the parent tax ID. So we will do something equals something. And in our, so we want nodes. No, we want, we want tax ID. And in metadata, I think it's also called, well, I call that subspecies ID. That's fine. We could go back and change that to be tax ID, but whatever. So we're going to join those two together. And you know what, I'm going to do, I'm probably going to go ahead and clean that because there's a few things I want to do with metadata here that I don't need the RDP because we're going to scrap that. And while I'm here, I'm going to go ahead and call this tax ID instead of subspecies ID. And if I run all this now, right. It didn't like that because I removed it. So tax ID, we run that metadata. And so now we have the genome ID, the tax ID and the scientific name. But this reminds me that after running this, we have a column of our RDP data, which for these first couple hundred genome IDs, there are R and DB numbers. They're not actually genome sequence numbers. And that's because the copy numbers were determined empirically. I think I mentioned that in the last episode. So the RDP column for those is an NA. So I'll do filter not is.na, RDP. So I'm going to get rid of everything that has an NA for the RDP classification. And this then starts us with our genome IDs. Again, to clean up the dataset a little bit. And then the output of all this, if I look at metadata is that we have a genome ID, a tax ID and a scientific name. Now, if we come back to our inner join, we're going to take the nodes data and the metadata and we're going to join them on the tax ID column. I don't need to do tax ID equals tax ID, but we'll see where we're going as I run this. So what you'll see is that we have tax ID, the parent tax ID, the species, the genome ID and the scientific name. Okay, that's great. So what we're going to do is I'm going to take the tax ID and the rank, and I'm going to join those together to make a new column. That way, when I do my next join, I don't get confused between different tax IDs, right? So like the next thing we're gonna do effectively is inner join with nodes and what we've already got. And so we're gonna then do by equals tax ID equaling parent tax ID, right? And so this then starts getting really confusing and messy. I got a extra parentheses there. Let's see, let me try this again. And I'm forgetting a parentheses here at the end, so I'll just put that there. And so you'll see that these column headings start getting like Xs and Ys at the end. So again, to avoid that, what I'll do is unite and that brings things together. And I'm going to call this t underscore or tr for tax ID and rank. And I will then say, what am I gonna join together? I'm gonna join, get my column headings again. So tax ID and rank, rank. So this is the syntax. And I think I can do sep, yeah, sep, and I'm gonna use an underscore to separate those. So this should look familiar from last time where we could put in a separator and later we'll use separate to separate by that separator. Now when I run this, you'll see that we've got a new column of tr. I'm gonna go ahead and call this tr underscore a because we're gonna do a bunch of these as we kind of move back the tree, right? So again, what we get is tr a and we have our parent tax ID, our genome ID and our scientific name. And so now when we do this next level of the join, we see that we have our tr a, but then we again have our tax ID and our rank and our parent, right? And so we can repeat this unite to make that tr b on tax ID and rank, right? And we've gotta be sure to put a pipe at the end of that line. Ah, I've got too many pipes and not enough pipes. Okay. Again, we see the same type of thing. We've got tr a, tr b and the parent tax ID. What I'd like to do is we're gonna keep repeating this a bunch of times and I wanna count the number of times each parent tax ID comes up. And we're gonna repeat it until we only have one parent tax ID because that will indicate that every sequence has gone to the root, all right? So I'm gonna copy this down a bunch of times and I'm gonna change my tr to c, d, e, f. Let's see where that gets us. That's through six levels. And we see that we still have 85 parent levels. One is the root, two is bacteria. So many of our things are already back to the root. And I'm gonna again keep copying this down and updating my tr to be g, h, i. And we'll come back soon to clean up all these tr names. Okay, so we're down to five taxonomic levels or five parent tax IDs. So we're getting close. We go out to KLM. Ha, we made it all the way. Let me remove one of these just to see if we actually needed the m. Yeah, and we do need the m. All right, good. So again, what we did is we kept joining our data frames back to nodes to that parent node so that we could kind of build out the tree. And we kept doing it until we got to the point where we only had one parent node, which again is our root. If I remove this and run it out, what we see is that we've got our various tr names and our parent tax ID. So I wanna create a test. And I'm gonna call this, so I'm first gonna call all these inner joins tree. Now, I wanna step back and say there's probably a more elegant way to do this using something called recursion. This works pretty well for us. I'm gonna create two tests. So again, I'm gonna create tree. And we saw a test in the last episode. So I'm gonna say tree and I'm gonna do my count parent tax ID. And what I want is this to be a data frame that has one row, right? So I'll do n row, all right? And so that's one. So I'm gonna create this as test A. And I'm gonna do stop if not test A equals equals one. And run these, right? And so it doesn't complain, so it's good. So again, if the database gets updated and say I needed to go another level because the database gets updated, then it will complain and it'll complain here. And I'll know then I need to go back and add another level. The other thing I wanna do is I wanna make sure that everything in my tree is represented in my metadata, right? So we talked about anti-joins last time. What we can do is anti-join on metadata and tree. And think what we can do for tree is that we want the genome ID. Yeah, so we wanna make sure that all the genome IDs in our tree are also in our metadata. And so we'll do genome by equals genome ID. And then I'm gonna do again in row. And I'll do stop if not test B equals equals one. If I run test B, genome ID not found. I think this needs to be in quotes. And then if I run this next line on line 80, error test B not equal to one is not true. Okay, well let me look at test B and we find that it's got 32 rows. Uh-oh. So if I run just that anti-join part, I find there's 32 effectively tax IDs that aren't in my tree, that aren't represented in my tree. This reminds me that there's another file that we downloaded in data references from that tax DMP file called merged, merge.dmp. And if I look at merged nodes that this merges old tax IDs with new tax IDs. And so what I bet is if I take one of these old tax IDs or one of the tax IDs I couldn't find, I bet it shows up in merge.dmp. All right, so let me, where are you, our studio? Let's take this tax ID and I will use grep, which I think we've talked about in a previous episode. So that's my current node that it can't find in my tree data frame. And I'm gonna look for this in data references in merge.dmp and we find sure enough, this 62928 has been renamed. And so what I need to do is I need to, I need to bring in this merged file and then join that with my metadata file to update those nodes. All right, so this is another file we're gonna want. So merged.dmp, NCBI merged lookup. All right, and I'll run this back in my terminal so that if I look at data references, I now see that I've got, where'd you go? NCBI merged lookup. Okay, yeah, for some reason my timestamps are doing really funky things. I'm not sure what's going on there. Anyway, I guess I could always go ahead and add a touch on all these. So the NCBI star lookup and that will take care of that problem, it's kind of brute force. All right, so we need to come back and let me get the column names. So it's old tax ID, new tax ID. And come back to our studio because we're gonna wanna read in the merged and we'll do merged, read delim. And it's gonna be data references, NCBI merged lookup. Delim equals the pipe. We're gonna do trim, WS equals true. And then our call names are going to be, what'd I put? Old tax ID, new tax ID. Those probably aren't the names I really wanna use because they're not gonna make merging super useful. Oops, all right, but we'll work with that. So again, we've got this problem where it's got the blank extra column and we can then get rid of that blank column. And now if we look at merged, we get our old tax IDs and our new tax IDs. So what I'm gonna do is I'm actually gonna throw this back up ahead of metadata and I'm gonna add to my metadata data frame. To test things out because it takes a little bit of time to read things in, what I like to do is, as you've seen, not name the data frame that I'm creating. So I will do metadata and I will then do inner join and I'm gonna do what? Metadata comma merged by equals and then metadata is gonna be tax ID and on merged it's gonna be old tax ID and I don't need an equal comma, I need an equals there. So if I run that, join columns must be present in data, merged. Oh, so I think this should be a period and we're good. And we see now that we've got tax ID and new tax ID. And what I'm realizing is that this now only has 32 rows and it's lost everything else and that's because of how an inner join works. And so what I would rather do is a left join. So you'll recall a couple of episodes ago, I said never use left joins. Well, it turns out I guess I do. And so what a left join does is it keeps all the names of things in the data frame on the left, even if they're missing from things on the right. So the merged only has the things that have been merged that have been changed to have their IDs changed. So if we run this, we find the 15,000, that for many of these, the new tax ID is NA because it wasn't found in the merged data frame. What we'd like to do with this then is we wanna mutate the tax ID column. And we're gonna use a function called if else. So we'll use if else, if else not is NA, new tax ID. So if new tax ID is not an NA value, then after the comma, then we want tax ID to be that new tax ID, okay? Otherwise, so we have two sets of commas to create three fields of arguments for if else, otherwise we wanna keep tax ID, okay? And if we run this, we're gonna see that these haven't been updated. And we'll come back and see this later, but for now, I'm gonna trust that this is trust the process, trust that it worked and we'll select to remove new tax ID. And I'm going to add this to my metadata pipeline. And so now if I look at metadata, that all is good. And I'm going to rerun creating my tree. No errors there. My first test works. Test B, ah, and that doesn't work. So test B, so I see the value of test B is zero, which tells me that I actually created a data frame, not with one row, but with no rows. And so this test should be zero, okay? So my test failed the first time because I had stuff, my test failed the second time because I expected to have one row and it actually had no rows. So now everything is great. Okay, now what we want to do is that we've got our tree and our tree again has everything we'd want. We could get rid of this parent tax ID, but we've got all these TR values, our genome ID and our scientific name. What I want is my genome ID, my scientific name, and then a column for each taxonomic level. And I'm going to do parent tax ID, or I'm going to select to get rid of that, okay? Let me move all this up so we can see the data frame gets sped out. And again, I know I'm going fast. There's a lot of content in today's episode. If you go to the show notes, there's a link to a blog post for today that will show you how to get the files that, the file that I'm working on here. You can also always go to the GitHub repository to see what's going on, to see kind of the full package. All right, so the next thing we're going to do is we're going to pivot longer. We're going to take all these TR columns and gather them together to make a five column data, or four column data frame with our genome ID, our scientific name, the column heading and the column or row value. So we'll do pivot longer and what we'll do is, we'll do calls. And again, the reason I use that TR like we saw in the last episode, last episode is I can use start, starts with, and I can do TR underscore, okay? So this will give me all the columns that start with TR. And the names to, I'm going to do TR and values to, I'm going to do ID rank, okay? And this should work, it does. And so we've got the genome ID, the scientific name, the TR, the ID rank. I do not care about TR, okay? So what we can do is we can select minus TR, get rid of that column and now we have ID rank and everything is, looks good. That space in no rank really freaks me out so I'll chill out a little bit. So now what we're going to do is we're going to separate ID rank. We saw separate last time and so we'll separate ID rank into and we'll create a vector of column names. So we'll say tax underscore ID and then rank. And so now we have, if this spits it out, we should get, yeah, we got it. We got four names, genome ID, scientific name, tax ID and rank. And I'm seeing that space in no rank cause problems and that's because I forgot to put in the SEP and the SEP is the underscore. And this should work now. No warning messages, no errors and we have genome ID, scientific name, tax ID and rank. We're good. I'm going to go ahead and mutate. Nope, I'm not going to mutate. I'm going to go ahead and filter and we saw this last time. I'm going to filter rank to look for those rank values that are in, not the pipe, in percent, percent in percent. And it's going to be super kingdom. That's what they call a kingdom. Phylum, class, order, family and genus. Let me, before I do that, let me go ahead and do count on rank to see all the different types of ranks that are in this data frame. And so you'll see there's all sorts of things. Let me go ahead and print everything. Looks like there's about 24 different names here. You know, there's biotypes, there's clade, class, formus, specialist, all sorts of crazy stuff. But really I want to keep things simple and keep it to kingdom, phylum, class, order, family, genus and of course, what do we do here? And species, close parentheses. And I can get rid of this count row, we'll run that. And I'm going to rename, or we'll wait. Let's see, let me break this up a little bit so it was a little bit cleaner. And now what we can do is that we can do pivot wider. Again, we saw this in the last episode. So we can pivot wider and take names from rank and values from tax ID. And that should be good, great. And so again, we see our super kingdom, phylum, class, genus, order, family. I'm not sure where my genus or species went. So I see where my species went. On my filter, I use species, parentheses, quote, rather than parentheses, quote, and then parentheses. So let's try this again. And so we see now we've got super kingdom, phylum, class, species, genus, order, family, not the ideal order. We'd like to rename kingdom to be super kingdom. I don't know why they use super kingdom instead of kingdom or domain. And that gets our nice names. Now, I don't care about these numbers. That's not what I want. I want actual names to get plugged into here. So what I'm gonna do before this pivot wider, and so let's see, if we run this, we remember that up to that, through that filter, we get genome ID, scientific name, tax ID, and rank. What I'd like to do here now is another inner join. Doing like a hundred inner joins in this episode. And so I will then add in joining in our names, data frame, and do it by tax ID. And if I do that, I think we'll replace, we'll get our taxonomic names as a new column. Let's see, can't join on x tax ID with y tax ID because of incompatible types. Let's see, what kind of types are these? So this is a character. So my tax ID came out as a character, and you'll recall that before we had this problem with separate in the last episode, we can do convert equals true. And so again, if we split apart, you know if it was to underscore super kingdom when we separate it to becomes a character, super kingdom as a character. But if we use convert equals true, that too will become a number. Let's give this another shot. And we now see that it joined. We have our tax ID, our rank, and our name, which is great. And I am going to go ahead, that's good. And we can, I'm gonna get rid of my tax ID, and I will then bring back, got bin, my pivot wider. So names from rank, my values from names, underscore txt. And then we'll rename our kingdom, super kingdom to be kingdom. And I think we're gonna be in good shape. Name's tax doesn't exist. What? Okay, let's try this again. Name, tax, not names. Wonderful, this all worked really well. And so what we see is we have our genome ID, our scientific name, kingdom phylum class, genus order species family. And I'm going to write this out, but before I do so, I'm gonna write out the columns and the order that I want them. So we'll do genome ID, scientific name, kingdom phylum class, order family genus species. Again, the order doesn't matter to the computer. It orders to, the order matters to me and my sensibilities. And finally we'll then do a write TSV and it will go to data and where did it go? Let's go look back at our make file. Data references genome ID, RDP taxonomy. And get rid of that RDP taxonomy. And this is going to be what we output it as. Okay, so that's all good. And let me run it. And now if I do a head on this, winning, right? We have exactly what we had back in R. Now we need to clean some stuff up in our make file. So we want this. We don't want these two files. So what we want is, what did I call it? In our references, we called them NCBI nodes lookup. And we had names and we had merged lookup. So that's all good, I believe. And I'm gonna modify this, get NCBI lookup to remove references star DMP. So we'll get rid of everything else that was a dump file and let's go ahead and leave that readme text file. I'm not gonna make it a target or anything, but I'm sure we'll need it at some point. So I'll go ahead and save that and update the outputs here in my header. And again, it was nodes and merged. So I think that should be really good. And we can get rid of that and we should be good here. I wanna go ahead and remove the NCBI, what was it, my genome ID, RDP taxonomy file. And that all looks good. And we see that we've modified our make file, we've modified our genome ID taxonomy R, we've modified our NCBI tax lookup.sh and we've renamed it from the get SPP lookup to get NCBI tax lookup. So again, using that get move, get MV allows you to keep track of the history of the file even going back to its former name. I think we're in good shape. Let's go ahead and get add make file and also that stage that I'm renaming has already been staged. So I don't need to add it. Code get genome ID taxonomy.r, code get NCBI lookup.sh, that's all good. Get status, those are great. Get commit-m and I'll say retrieve NCBI taxonomy for each genome closes number 26 and get checkout master, get merge issue 26. Wonderful. I now want to go ahead and make that target. So we'll do make data references. Why am I forgetting what it's called? Genome ID taxonomy.tsv. So again, it's gonna download the stuff, inflates it, runs all that, data references and I see I get my merged lookup. I think it did update everything and says everything is up to date. So again, something funky going on with my timestamps. I'm not totally sure what's going on. Oh, I know what the problem is. The problem is I didn't, no, I had those as dependencies. Ah, I think a problem here is that I put an extra backslash at the end of that last requirement. So now let me try this again and it's now gonna run the R code to build out everything. So I'm glad I checked it before I pushed it up. I can, let me get add my make file, amend that. That looks good. I'll quit this out. We're in good shape and I can do get push and close out the issue. Again, I know this episode had a lot going on. It's longer than a lot of the other episodes, but it really shows how we pulled together a lot of the concepts we've been talking about in the recent episodes. We talked about separate, we talked about unite. We did a lot of inner joins. We did pivot longer, pivot wider and it all came together for one problem, which was to take our tax ID for each genome sequence and recreate a tax taxonomics string for each of our genome sequences. It's kind of harder, it's hard to break it down into smaller chunks than that. And along the way, you see how we did some problem solving, how we created some tests to make sure that everything was accounted for as we went. And so now we have this file that for every one of our genome sequences, we know it's scientific name, we know it's kingdom, phylum, class, order, family, genus, species. And so we can go back now with our ASVs and group by any of those taxonomic levels to get us information about how unique an Amplicon sequence variant is to any of those taxonomic levels, as well as how many ASVs there are for any of those taxonomic levels. So again, this was a big step. I know it was a lot. Feel free to come back here to my code in GitHub, where if you go code, and then it was getgenomeid taxonomy.r, you can see everything that we did in today's episode. And again, there's just a lot going on in here and really encourage you to come back and look through this in greater detail. Now, I'd love to see what you're doing with these concepts that we talked about, these different types of joins, the way of pivot longer, pivot wider, separate and unite in your own work. Feel free to leave a comment below in the notes telling me how you're using it. If you have any questions, again, I know this is a lot of content. We did a lot scientifically and we did a lot with R. So please tell me if you've got any questions or things that you wonder if we could have done them a different way and perhaps we can experiment with those in a future episode. So keep practicing. Please tell your friends about these Code Club episodes. I'd love to expand the reach. I know a lot of people are already benefiting from them. So please tell your friends and we'll see you next time for another episode of Code Club.