 Welcome back for another episode of Code Club. It's great to have you here. In the last episode, we talked about joining data frames together that have the same structure. You'll remember that for each region of the 16S RNA gene, we have a file with accounts for the number of times each Amplicon Sequence variant, or ASV, shows up in each genome. Now, we used map underscore DFR to elegantly read in each file and depend the contents of each file together. This all worked because all of the files had the same column names and the same types of data in those columns. It was elegant because it didn't force us to repeat the same chunk of code over and over, and it allows us to easily add other regions as we progress with our project and perhaps decide to add another region. But what if we had different bits of information for the same genomes? For instance, imagine that for some reason we wanted to join those data together column-wise. To make that work, we would need to make sure that our data frames all had the same genomes or ASVs in the rows. Now, more likely, we would need to use a different set of methods, which we'll learn today called joins. The de-plier package has a number of joins for adding columns to data frames which describe similar entities. For example, let's think of two data frames. One listing the number of copies of the RN operon in several genomes, and another listing the number of ASVs in another set of genomes, and there might be overlap in the genomes that are represented in the two data frames. We would like to have a single data frame that has a column for the name of the genome, the number of copies of the RN operon, and the number of ASVs in those genomes. If you look at my example here, you'll see that the same genomes are not represented in both data frames. We can use the inner join command to join the two data frames together using the genome column. If the genome is missing from either data frame, it will not appear in the join data frame. There are other variants on the inner join function available in Deplier that treat missing data differently, full join, left join, right join. But I'm hard-pressed to remember when I've ever actually used them. Now, perhaps we had a bunch of genomes like we do here in our project, and it's hard to spot if any of the genomes are missing from the two data frames. To figure this out, we'll use a different join called an anti-join. In an anti-join, the rows from the left data frame that are missing from the right are outputted to the screen, or as a variable. We can switch the order of the data frames to see which rows in the second data frame are missing from the first. That's pretty slick, huh? In today's episode, we'll see how we can use inner join and anti-join to combine the metadata and taxonomy data that we downloaded from the RNDB, along with mapping data that we can get from the NCBI's taxonomy database. The goal is to create a file that will have columns indicating the name of each genome, perhaps the GenBank accession number, or the genome database accession number, that GCF identifier, along with the columns indicating the taxonomy data for the genomes. Then in future episodes, this will help us to group our ASV data by different taxonomic levels to assess how unique each ASV is to different taxonomic groups, and to see how many ASVs can represent each taxonomic group. Even if you're only watching this video to learn more about R, and don't have a clue what a 16S RNA gene is, much less an implicant sequence variant, I'm sure you'll get a lot out of today's episode. Please take the time to follow along on your own computer. If you haven't been following along but would like to, welcome. Please be sure to check out the blog post that accompanies this video, where you'll find instructions on catching up, reference notes, and links to supplemental material. The link to the blog post for today's episode is below in the notes. To get us going, I've already created an issue, issue number 24, which asks us to create a mapping file to translate between the genome ID and the taxonomic information. And so what we'd like to get out is a new file that contains the columns containing the genome GenBank accession number, the RDP taxonomy hierarchy, so things from the kingdom down to genus, the species name, and the scientific name, which will likely contain the strain name if it's available. Not every species name has a strain name with it. So we'll keep this information. And again, we'll see in future episodes how we can use this to group our implicant sequence variant data by different taxonomic levels. So this is issue 24. Let's go over to the terminal and I'll move to our project root directory. I'm on the master branch and I wanna go ahead and create my branch, issue branch. So I'll do get branch issue 24, get checkout, issue 24, and we're in good shape. Now what I'd like to do is let's go ahead and use Microsoft Excel. See, Microsoft Excel isn't worthless. It's good for many things. And I will open up in my data raw, which is where I think I have the RNDB, this TSV metadata file. And this has information about all of the files, all of the sequences. And so this is a tab separated value. So I'm gonna kind of go through this dialogue and finish importing that. I could have read it into R, but it's okay. And so it got parsed a little bit funny, I think. Yeah, it's kind of weird how it got parsed. Anyway, I'm just using it as a graphical way to look at the data. And what you'll see in the first column is the data source. Some of the data in the RNDB came from empirical studies where people looked at, kind of lab based methods of getting the copy numbers and others are these GCFs. The GCF files or accession IDs are the actual genome sequence data, which we've been working with as we've gone through our project. In the second column, if I come back up to the top here, we have our NCBI taxonomy ID data source. We have a scientific name here in column F. We have the RDP taxa based again on the genus name. In this column over here, we can see that we have the RDP taxonomy string. So this is our taxonic information. And what I'm looking for in an ideal world would be a column that only had the species name, but we're not gonna be so lucky. So if I look at this column that has the scientific names or whatever, you'll see that this is Escherichia coli, which is the genus and species name, but it also has all this strain information. I'd like to think that maybe we could find a way to parse this out, but I don't think it's so simple, right? So here's a Pasturella with subspecies that. So it's got subspecies, again, Escherichia had strain. It's just kind of a hot mess. And what we're gonna do is again, we're gonna think about how we can link this to other data to get the species information into a file like this. Now, another file that we have that we've gotten from the RDP, if I can open this up, maybe I'll do it through Excel, open and is this pan taxa stats NCBI. And so this is a tab separated value file and this contains summary output for each tax ID, right? So this B column in this R&D B 5.6, which I have behind here, links to this tax ID in column A of this pan taxa stats file. And then we see that the first column is species and that we get our names, our species names, right? The problem is, however, that not everything in B shows up in A of our pan taxa stats file because there's subspecies names, right? And so we need a way to link between a subspecies name and the species that it belongs to. What I'll just briefly also point out with this pan taxa stats file is this tells us for each species, so like say, I don't know, pick this one, Acinetobacter wolfii has one representative, it has six copies in its genome. And so again, we can kind of see all these stats. I'm not so concerned about these stats, we can recreate them on our own, but what I wanna use this for is to link this tax ID from the metadata file to this tax ID in the pan taxa stats to then get us the species, right? So what you're thinking about as I was talking about the joins, we kinda wanna be able to join these two files together. The challenge is that we're gonna need a third file to link this tax ID for the subspecies name to the species name. So to do that, let's go ahead and go over to the NCBI website. So ncbi.nlm.nih.gov and they have so much data and most people only use the web interface to access that data. If I come down here on the left to taxonomy, this brings me to the taxonomy database and you can search for all sorts of things in here, all sorts of organisms and find linked data that's a linked to ashrichia coli or homo sapiens or whatever you're interested in. But you can also download all this data, right? So we can, through the FTP site, if I click on this link, this brings us to kind of behind the scenes, the back end of the website where they make their data available. And I'm looking through these and what I generally look for are the readme files. And so, let's see, what looks interesting? I'm gonna go ahead and start back at, let's see, the NCBI taxonomy genus. So looking at this, this isn't necessarily what I'm interested in. It's kind of explaining how they name things. Let me look at this next one, taxcatreadme.txt. Again, I'm looking for something that will allow me to relate the subspecies tax ID number to the species tax ID number. And so taxcat dump compressed archive contains a single file. Categories.dmp, it contains a single line for each node that is at or below the species level. We're winning. The first column is the top level category. So what kind of domain it belongs to. The third column is the tax ID itself. And the second is the corresponding species level tax ID. Winning, right? And so as an example, this is like if they knew exactly what we were looking for, it's beautiful. So we see acetylobis sacrovorans is the species name. Acetylobis sacrovorans, I think they just wanted to get me to say that, 3, 4, 5, 15. So this is the strain name. And so what you'll see is that the entry for sacrovorans is this first one where the subspecies and the species have the same tax ID. But this other one that has the strain 3, 4, 5, hyphen 15 is the 6, 6, 6, 5, 10. So that's in the third column and that maps to the species name. So this is what we want. We want this tax cat file and in it contains this category.dmp. Categories.dmp. So again, what we want is this tax cat, tax cat.zip. And so we're gonna download that and decompress it. We've seen how to do that in some of the earlier episodes. And so instead of kind of reinventing the wheel if you will, and what we can relate this to is kind of the get rrndbfiles.sh script. And we can, if you'll recall, those files came to us as zip files. This file that we're gonna download is also a zip file. And what I'm gonna do is create a new file and encode and this will be get species subspecieslookup.sh. And I'm gonna copy this from my rndb script to my new script. Go ahead and close the rndb so I'm not gonna accidentally write over it. And we will see the inputs. There's no input. The output is a file that will be data references. SPSPPlookup.tsv. We don't need this target or path processing information. And the path is going to be data references and the URL. I'm gonna come back to this taxcat.zip. Copy that link and paste that in there. That looks good. Let me just make the font smidge smaller. Hopefully you can still see that okay. We will then unzip and again, this is gonna go to data references and this is going to be taxcat.zip. And I think we need the path. Yeah, so this is gonna be data references taxcat.zip because again, this is gonna get dumped into data references. So we do need the path. We need to tell it where we wanna put it. You'll recall that this all was to be some error tracking. If the unzip didn't work, we didn't want it to silently fail. We wanted to give it a loud fail. And instead of touching the target, what we'll do is we'll move data references, categories.dmp to data references, forward slash sppspplookup.tsv. And that should be good. I'm gonna copy this so that I can create my make file rule and maybe, let's see, up here where I'm still getting, try to give it some organization, try to put kind of like the downloads up at the top for the reference materials and things like that. And then it's gonna be the code get species spplookup.sh and what we'll then do is the recipe to carry that out will be this and I need to make this executable. So do chmod plus x on that and then I'll run that. And so this downloads so far so good. Looks like everything worked. Let me look at data references and we'll see that I've got my sppspplookup.tsv. That's in there. Gonna go ahead and look at the top of that. So head, data references, sppspp, that. So again, we've got three columns and something I noticed here is that there are no column names. So when we read this in to R, we're gonna need to add column names. When we do that, it's not a big deal, but it's just one more thing to be mindful of. All right, so I'm gonna now go ahead and open up RStudio and we can do that with the Schlossproj, Rproj file. I'm gonna go ahead and create a new R script. Top of this, I'll put library, tidyverse. And what I'd first like to do is to read in the metadata and we'll do metadata, read.tsv, data, raw, rndb. Raw, rndb, yep, I got the wrong one. And from here, we want this 5.6.tsv. Generally, as I'm kind of developing my pipeline, I don't wanna assign it to a variable immediately because I wanna see what the output looks like. Ah, I forgot to load the library. Okay, as we look at this, let's give some more room for the console and we see that this stuff all read in with these different specifications. I'm cool with that. And I'm gonna rename some of these columns so they're easier to work with. So I'll do rename and let's see, NCBI tax ID. I'm gonna rename that as subspecies ID equals that. I'm also going to change RDP taxa, to be RDP, all right. And I, of course, also need my data source record ID because that's gonna be the genome ID is that. And I'd also like to get the scientific name, the NCBI scientific name. And so we'll call that scientific name is that. Now let me run all this again and for some reason it's still remove that, run all this. And what we see is we get a lot of stuff back. I don't really need all this. So I'm gonna go ahead and simplify it and do select. Let's see, genome ID, subspecies ID, RDP and scientific name. And this then gives us a four column data frame. Again, remember, at the top we have these R&DB genome IDs and these were empirically determined not from sequencing to get a sense of what it looks like at the end, I could add a tail and we see kind of what things look like here. And so we've got the genome ID, which links back to our sequence data, the subspecies ID, which we're gonna use to map with that SP, SPP map file. We've got the RDP taxonomy. Actually, I think this is the wrong one. This is only showing the genus name, not the full thing and the scientific name. So let me think. What we wanted actually was the RDP taxonomic lineage. We might like to use the NCBI taxonomic lineage, but this shows up as collogical because it's not populated in the table. So we can't use that. And I don't think it really matters for our purposes. And so here now we see that we've got domain and then if we looked at this for proteobacteria to be phylum. So we've got what we want. We'll go ahead and remove that tail part and assign this to metadata. Go ahead and load that in. So we've got our metadata loaded. The next thing I'd like to load is our SPP lookup. And that's gonna be read TSV. And it's gonna be in data references. And we call that SPP SPP lookup. And again, before we get too far ahead, I ran that and we see that we get that it parses and it gives the columns B7 and because seven was in the third column, it gives it seven one. It's not what we want. Get rid of that assignment arrow for right now. And we can do call names and we can then give it column names. So we'll then say domain and the first column is the species ID and the third column is the subspecies ID. This is kind of running off the screen. So I'll go ahead and put an enter there and run that. And we see that we have domain subspecies ID and subspecies ID. Great, run all this. And so now we have species subspecies lookup. So the third part we need is that taxonomy name to get the species name rather from that Pantaxa file, right? And so we will then do read TSV and it's in data raw R and DB and this was Pantaxa stats NCBI.tsv, running that. We see we've got tax ID, rank and name. And that all looks good. The tax ID, we might prefer it to be like species ID but we won't worry about that for right now. I think what I'll do is I'll actually go ahead and filter this to only have rank equal species. And you'll see that we now only have 4,829 rows versus up here we had like 6,819. And instead of name, I'm gonna change that to be species. Rename species equals name. And running that we now see that we've changed the name there and perhaps we'll do a select where we'll get tax ID and species. Just kind of keeps our data frames simple and I'll go ahead and call this tax. All right, so we've got our three data frames which are going to be the metadata, the species lookup and the tax taxonomy data. So we wanna join all these together, right? So if we look at metadata, we see that we've got a genome ID, a subspecies ID, the RDP and scientific name. If we then look at our species lookup, we've got species ID, subspecies ID. So let's start there and we can do an inner join and we will do metadata, comma, SPP, SPP lookup. And so what this is gonna do it's gonna join the two data frames together. It's gonna look for a column that has the same name in both data frames and what you can kind of see in this output that came to the screen briefly was that it's joining by subspecies ID. If we wanna be explicit about what we're joining by, we could add by equals subspecies ID. One of the nice things about renaming your columns is that it's pretty smart that I could figure out the columns for you. And so again, we do that inner join that brings it all together for us. We can go ahead and select some of the data frames we want. So we want the genome ID. We no longer need the subspecies ID but we need the RDP, the scientific name, domain, or we don't need domain and species ID. Again, we could have done this with negative signs and we've been less typing, but this works. And so again, we've got genome, we've got the taxonomy hierarchy, the scientific name and the species ID. Wonderful. So now we wanna do another join to our taxonomy data. What we could do is create what we have here on 21 and 22 as a single data frame and then do another join. But what I'd like to do is do another inner join directly as part of this pipeline because I don't need the output of 21 and 22 for its own sake. What we can do then is you'll notice here in inner join that we have a data frame on the left, one on the right, and the inner join is looking for a data frame on the left and the right. And so the data frame that we're gonna put on the left is a period to indicate the data flowing through the inner join. And on the right, I'm gonna go ahead and put tax. And we can run all this and you'll notice that we get an error because by must be supplied when x and y have no common variables, arg. So what columns does tax have? Well, it has tax ID, whereas our data frame here has species ID. So again, we could rename our column names like we've done earlier today, but what I wanna show you is a different way to do this. We could do buy like we had above. And the syntax we'll use is to say C, the concatenate function with parentheses, and then in quotes, equaling something in quotes. And the column, the information in quotes is the column name for the data frame on the left. And the second set of quotes is the column name for the data frame on the right. So for the data frame coming through the pipeline, again, this is going to be a species ID and on the right is going to be tax ID. And we run all that and this works, great. So we've got the genome ID, the RDP, scientific name, species ID and species. We're not gonna need the species ID, right? So we could go ahead and drop that. But something that occurs to me is we'd like to double check that everything that we have in our joined metadata and species lookup is also in our taxonomy file. And you'll recall that to do that, I mentioned we could do anti-join. Again, we run this and we get out a data frame with 26 rows actually where the information that's in our metadata doesn't map to that summary data from the taxonomy data. And so that's a bit concerning. And so what we'd like to do is perhaps figure out what these are and then add them back to our taxonomy data. And what I will do is I will add to this account function on the species ID to find out which species IDs are missing. And what we find is that there are four species IDs that are missing. And what I'm gonna do is I'm gonna create a data frame like tax. So if you call tax has a tax ID and a species name, well, I'm gonna create one that's got these four species ID names. And what we'll do is I'll come back up here and I'll also call this makeup tax. And we can create a table with the table function. I'll put this in here. And right now all I want are to get these ID numbers. Do that, put these all in quotes. Oops. And again, this is species ID. And we need the name, right? Or no, the species, sorry. And what I'm gonna do is I'm gonna go ahead and let's see, if we take that whole string and we come back to the taxonomy database, I plug those in up here. Is it gonna give them to me? No, it doesn't like that quotes. Wonderful. Now, it's probably all out of order. So I'm gonna open these up as different tabs and figure out what number corresponds to each. So let's see, Vibrio Atlanticus is 693153. 693, that's the last one. All right, so that's Vibrio Atlanticus. Good there. This Canadatus is 412965. That's the third one. And let's see, what's Elastippes? That's three to eight. Okay. And then this other one is probably Frankincela Orientalis. I love bad trail names, they're kind of cool. Makes me wish I would have paid more attention in Latin class. Anyway, we'll go ahead and close that. And so now what we've got, well, we've got an extra cop, parentheses. Let's see, let's see, extra comma 1234. Ah, I don't have a comma after my first thing. Okay, so that works. That works. And we can look at make up tax. And we've got this new smaller data frame that has four species IDs that we're missing from our tax. And what I'm gonna do is I'm gonna then do bind rows onto tax. So I'm gonna do, again, like we saw with the interjoin, the period comma. So the period represents the information, the data frame flowing through the pipeline to that point. And I'm gonna bind by row wise to the end, make up tax, run that, that looks good. Now let me redo my anti-join, this data frame. And let's see, why isn't this not working? Ah, I suspect the problem is that I've got make up tax as character rather than as an integer. So I'll remove these quotes, run that, run this, and then do that. I think the problem is that I screwed up my tax ID. That instead of species ID, this should be tax ID. And if I come back up here, this is tax ID. Bonus points to you, if you notice that before I did. Now let's do our tax ID, tax interjoin. And then we get out our interjoin having no rows. Excellent. I'd kind of like to use this as a test. And what I'll do is I'll say test equals this. And what I wanna make sure is that as output to this, I wanna say n row equals equals zero. And so my test is going to determine whether or not the anti-join is empty. Because say I come through and I use a new version of the RNDB if it's ever released later this fall, I'd like it for it to yell at me if there's a problem. And so we'll say test true. And what we can do is we can say stop if not test is true. And not, not, not, right? And so what would happen is that it would run through here and the program would keel over and die and yell if test was false. So if there are rows in here, then it would say false. All right, so let's go ahead and take this down here. And so we will then instead of the anti-join because at this point we're ready for it to be the interjoin we can then run all this and we know that we've got the same thing, the same information in both of the data frames we're joining here. The information I'd like to keep to output I will use the genome ID because we need that to join on our RTIBL file, right? We want the RDP, we want the species and we want the scientific name, right? Genome ID not is. So I'm gonna write this out as a TSV to data references and I will call it genome ID taxonomy.tsv. That should be good. And again, if we then look at data references we sort it by time modified. I see this shows up first. And again, if I do head on genome ID taxonomy we see we've got all the information and we've got the information we expect on the backend. Now you'll see this taxonomy string from the RDP is a bit of a mess. And in the next episode, if you come back so be sure you like and subscribe to the video and click on the bell so you know when the video is released. We'll talk about how we can take this column and split it into multiple columns so that we then have columns for everything from the domain or kingdom all the way down to species or even the strain, okay? So be sure you come back. What we can then do, let's go ahead and clean this up a little bit. So I'm gonna save this as into code. And I'm gonna call this, let's see, get genome ID taxonomy.r. And in my files, I'm gonna go ahead and look at previous R script I had to get my shebang line and all this other stuff. And again, the file name is get genome ID taxonomy.r. And so the input are gonna be several different files. So this will also be important to gather the files that we are using. So when we make our rule in our make file. Let's see, so there's that. There's this, the lookup file. And there's also the pan taxa stats file, right? Good, so we'll save that. The output then is TSV containing the genome ID along with taxonomic information. And so that'll be good. And I'll go ahead and put that in there. What the actual name is. So this data references file, right? Come back to my make file. Let's put this here. And the script that we used was code, get genome ID taxonomy.r. And the other prerequisites for this are all, where'd you go? Right here. So lots of copying, pasting. Really cuts down on typos. As you can see, if you've watched any of these videos in the past, you know that I am very good at typos. Bump that back. And we need to put backslashes at the end of each of these lines to say that the lines all go together. That's great. I need to make this R script executable. Make sure everything is saved. And chmod plus x that. And now I can, I'm gonna go ahead and test the make file by trying to make this. And maybe I'll double check what it's gonna try to do. It looks like I made a couple of changes along the way. And this is probably gonna trigger, may trigger generating a bunch of other stuff. That's okay. We'll go ahead and do it right. Run that. And we're good. I'll double check it's up to date. We're in good shape. And get status. And we see that we've modified the make file and made two other files. So I'll go ahead and do get add make file and then code get genome. Code get sp. Okay. Get commit. And then I will say create a genome to taxonomy mapping file. Closes number 24. Get checkout master. Get merge is 24. Our studio is freaking out because I've moved things. Close that. And merge that. We're in good shape. And then we'll do get push. And that should get up to GitHub. Close out these browser tabs. And we see that it went ahead and closed it. So again, there's a lot going on in this episode. But the thing I really wanted to focus on and emphasize today was those joins. Now the inner join is really powerful for taking different pieces of data and merging them together. And you can imagine the next step might be to take our genome ID taxonomy file and join that to our count tibble file, right? To join those together so that we could then do a group by species or do a group by any other taxonomic level. And so again, these joins are super powerful and it really makes dplyr just such a useful function. I'm sorry, a set of functions, a useful package. It's just really helpful. And also we talked about anti join which allows us to see what's missing on the what's found in the left side that's not found in the right side and vice versa if you flip the order of those columns. So go ahead and play around with those joins with your own data. I'd love to see perhaps how you're thinking about blending together different types of data. I've used this in the past when I've got say observations or say I've got species richness data. That's perhaps been generated by mother and I've got metadata and I wanna build a plot where I say plot the species richness and I wanna color it by say patients diagnosis or by some type of treatment group, right? So being able to merge our data from one source like say mother or from experimental data with metadata or attributes of those samples is really powerful for grouping things by making plots and any other thing you can think of. So play around with that. I'd love to see what you're coming up with and how you're using it. Go ahead and put it down below in the notes to this episode. If you've got any questions, also feel free to put questions down there. Be sure that you like this video, hit the thumbs up, subscribe to the channel, click on the bell so you know when that next episode comes where we'll talk about separating out that RDP column to the different taxonomic levels. Keep practicing. I believe you can do this stuff. You can. We'll see you next time for another episode of Code Club.