 Hey there welcome back for another episode of Code Club. I'm your host Pat Schloss. In the last episode We used interjoin and anti-join to get the species name for each of the genomes in our R&DB data set We did this by joining our metadata file with a mapping file from the NCBI that linked the subspecies taxonomic ID With the species ID. We then used another file from the R&DB to link the ID to a name All this information will be useful to us as we try to see how many Amplicon sequence variants Also called ASVs can represent a single species and how many ASVs are found across multiple species That information will be critical as we try to evaluate the claim that ASVs contain species level information But I'm concerned also that we might find the ASVs in multiple genera families and even possibly lower level taxonomic groupings To determine the specificity of ASVs at each taxonomic level or rank We'll need to collect the data for those lower level Taxonomic grouping in our metadata file We have a column called RDP which contains the output of classifying each of the 16s R&A gene Sequences against the ribosomal database project or RDP database For each genome the data in that column contains the name for each taxonomic rank Along with its rank the name of a rank like Phylum class genus Separated by a semicolon the name of the rank and the actual rank are separated by vertical lines also called pipes What we need is to take the strings in the RDP column and Separate them by the semicolons to separate the data for each rank We also need to separate the data for each rank by the pipe character So that we know which rank each name belongs to so this is what we're gonna do in today's episode We'll learn a new function called separate which will allow us to Separate the strings in the RDP column by semicolons and by pipes to create new columns in our metadata file Along the way though, we'll see that the data aren't formatted as nicely as we might like and we'll need to use the pivot longer Function we learned in an earlier episode as well as its complement pivot wider Ideally, I'd only cover one big function per episode, but it's also important to show how these functions fit together Besides, I think you can do it your big kids Even if you're only watching this video to learn more about R and don't know what a 16s rRNA gene is I'm sure you'll get a lot out of today's episode Please take the time to follow along on your own computer if you haven't been following along but would like to welcome Please check out the blog post that accompanies this video where you'll find instructions on catching up Reference notes and links to supplemental material the link to the blog post for today's video is below in the notes Let's go ahead and see if we can split apart that RDP column into our different taxonomic levels I'm at my home directory again. I can use my alias to come to my project root directory I see I'm on the master branch I'm gonna come over and I filed a couple of issues here in good hub The first is for today. So splitting apart this one here issue 25 splitting apart the RDP column Into different columns for each genome. I also made one that will tackle on the next episode for tracing back the NCBI taxonomy So I figured We've got the RDP. We've got the NCBI information. Let's see if we can't also trace back the NCBI taxonomy But we'll save that for the next episode. So coming into today's issue of separating the RDP sequence classification into different columns for each genome We see that we're gonna modify the get genome ID taxonomy dot r script that we wrote last time to split the RDP column from That data references genome ID taxonomy that TSV Into separate columns indicating the kingdom phylum class order family genus for each genome And then we're gonna rename the output to be genome ID underscore RDP dot underscore taxonomy and I'm wanting to do this because I know that in the next episode I'll do NCBI and so that I'll output as NCBI taxonomy dot TSV And then I want to make sure that we update our make file. So again, this is issue 25 I'm gonna go ahead and create that issue branch. So get a branch It's 25 Get check out issue 2025 Good, I'll go ahead and open up my project in our studio Here we are. Let me double-check Here we are. Let me double-check our working directory get WD and we're in the right place. Okay I'm gonna go ahead and open up my get genome ID taxonomy file and We see this here on the left side of my window To get going I'm gonna go ahead and run some of these earlier steps that we We did from metadata down through Test and testing and we come through and everything works well if test had been false. This would have failed Let's see what happens if I do test if not false It says error false is not true, right? So it would have said error true is not or test is not true and it would have stopped the script there But it did and so we're in good shape All right, so this is where we're going to be looking at this chunk of code For now, I'm gonna go ahead and take this part where we write it out to the file and I'm gonna bump it down a few rows to get it out of the way and I'm gonna run this chunk of code and What we get Is this tibble and you'll recall that the first Several hundred lines of this of the tibble start with genome IDs RNDB v3 and then a number and These were made and these were The data we're getting is coming from a database that isn't so interested in the sequences But in the number of copies of the 16 s gene in a genome the RN Operon has the 16 s 23 s and 5 s genes And so because these were determined experimentally empirically they didn't get a sequence and so they didn't run it through the RDP There's not a classification and So these aren't really that interesting to me So to get rid of these and so I I can see the RDP column for you know Something actually has data. I'm gonna go ahead and use filter and I'm gonna filter out things I'll use is not na on RDP And so if I run this I'm gonna get all of the 205 rows that have RDP as na now. I don't want those I Want where is na RDP is? True right and so filter keeps things that are true So if I put an exclamation point before the command it's going to flip the logical argument So things that are false will become true True will be false and so filter will keep things that are true And we can go ahead and run this and see that we then get back 15,571 genomes and we have their RDP taxonomy. So good. That's Not a big deal, but helps to clean up the output a little bit so The next thing that we want to do is to separate out this RDP column now because This is going to get pretty messy pretty quickly down here. We already have as much information width wise that Our studio will let us handle for the time being I'm going to go ahead and remove the species and scientific name again only for testing purposes So I'll do select and I'll do minus species minus scientific name Now don't let me forget to remove this line in the future And so again, we see that we've got genome ID and RDP great so The command that I want to really focus on today is called separate so separate allows us to take data coming through our pipeline and Tell it what column we want to separate. So we'll do call equals RDP and Then into which columns we want to export it into that we want to separate it into and so this Becomes a vector and we can give it names of those new columns. And so I'm going to say kingdom phylum class order family genus and This is kind of running off the screen. So I'm going to put in a line break to make it easier to read and then Let's run it like this and see what happens and so We get some morning messages that we'll talk about here in a minute but separate tries to be really smart and really help us out and what it's doing is it's looking for a common element across the string that it can use to separate things and You recall in my introductory comments I commented that we're going to separate by the semi colon to split apart This initial chunk from this chunk and this chunk and this chunk But what it's done because we didn't tell what to tell it what to separate on here It's separated by the pipe as well as the semi colon Okay, so we want to be explicit about what we want to separate on and at this point What I'd like to do is to sep on the semi colon And if we run that Um what we're seeing we're getting there, right? So it's split into the different taxonomic levels that looks good But what you'll notice is at the beginning there's a space And so the separator we actually wanted to use is semi colon space We run that And now what we get are kind of the nicely fairly nicely formatted Names and it's split it by the taxonomic level taxonomic ranking Now if we look at these warning messages, we see two types So the first says expected six pieces Additional pieces discarded in 1693 rows Okay, so that would mean that there were more than six levels for about 1700 sequences In the second warning it says expected six pieces missing pieces filled with na in 685 rows So that would mean that some of the sequences some of the genomes didn't have taxonomic information All the way up to the genus level or perhaps it's missing one of the middle levels So let's explore this a little bit further and see if we can clean it up One idea that we might think about would be to say well, what if we You know if we're missing levels Perhaps we could say well if you have a phylum level, but not a class level Well, we could look for phylum and then add A class that's like na or an unclassified class Let's look at this a little bit further and see what's going on One of the commands that we can use to look at a specific line in our data frame is slice So we can slice Four and that was the first one that tells us That there were additional pieces so we'll go ahead and do slice four And what we get out here is phylum class order family genus So those are discarded. So maybe we want to look up a level So let's go ahead and Move our slice up higher Before we do the separate Right And so what we get here is this But maybe what we'd like to do is get the whole thing and so to get the whole thing we can actually then do Pull rdp and that will return the rdp column as its own vector And so this is what's contained in This cell of the data frame And so we have bacterial domain ectinobacteria phylum ectinobacteria class ectinobacteria Subclass ah, so we've got a subclass and a suborder So that's interesting So we've got more than kind of the traditional linean taxonomy and so it appears that we might have These two extra levels going out to the eighth level. Okay So that's something to be aware of one thing we might do is if we knew things like str replace we could Search for this pattern and remove things that have suborders or subclasses to force everything to be six columns Okay, so that was one of the errors we were getting and so that clarifies that a bit that We had additional pieces because We have these suborders and subclasses What about these ones that were missing pieces that were filled in with nas? So let's look at say 65 And what this shows us is that in this sequence we have the domain the kingdom phylum class But no order and we have a family and a genus Let's look at another That's just because my eyes went there. Let's go to 231 run that and It's getting very similar. It's the same same type of thing Let's do 108 and so this is has kingdom phylum class order family, but no genus And so this this is giving me a little bit of the willies because we know the genus name Of this organism because it's got a genome sequence, right? So if it's got a species name, it's no doubt got a genus name and so that that's missing Eh, maybe it makes me a little bit reluctant to use the rdp So anyway, so we've got two problems We've got the problem that we have extra levels texanomic levels Of these suborders and some classes. We don't necessarily know which sequence has each We could search through all of our rd through that rdp column Find the rows that contain suborder subclass and then remove that We also have cases where we're missing levels in the middle. So like we're missing I forget what it was even now. We're missing the the order level for this cyanobacteria But we're also missing things at the end and I suspect that there are genome sequences from like candidate phyla Or yeah, where we we just don't have much information kind of further down, right? So this might be fairly common and as we saw from our error message up here That we seem to have These the suborder subclass problem in about 1700 sequences and that we're missing information in about 700 so instead of going through the process of Using string replace to remove or to add in different texanomic levels I think what I'm going to do Is I'm going to go ahead and Split this apart split apart like we did Maybe give it more generic column headers And then we're going to use pivot longer to take those Those columns and concatenate them on to the end of each other. So let's give that a shot so Instead of these named levels, I'm going to go ahead and put in eight levels because I think I think Uh, those are the only levels that we that we have so Let me go ahead and remove these and what I'm going to do is call it rdp underscore a And why I use that rdp underscore? I'll explain uh in a bit c d e F so that's three six. So we need two more So let's go ahead and run this that's eight levels that should cover everything so again It says expected eight pieces Additional pieces discarded in 199 rows So maybe we'll look at these one more time to figure out what's going on there And then expected eight pieces missing pieces filled in so those would be things that only had six levels, right? So we see a lot more there Let's look at seven nine six Kind of like we did above here and see what's going on there. Whoa. This looks huge So we have bacteria phylum class order family genus Ah, but then we have three vertical pipes And then we have the same thing again So what I think is happening here, and I'm cheating a little bit because I did some research Is that what they did was they took the 16 s sequences from these genomes? And so if this sequence had multiple 16 s genes they classify them, you know, they get different classifications for the different Operons then they reported them together separated by that vertical pipe And so you'll see that one classified down to pectobacterium and one only went to anaerobacteria ACA And so that's 199 of the 15,000 some Genomes have this problem for now what I think I'm going to do is I'm going to change that rdp column and make it An na domain and and what we'll probably end up doing is just kind of throwing away Yeah, I don't know what we'll do for now for this episode what I'm going to do is throw it away I'll think about it and maybe we'll come back and deal with it in a future episode So let's see so let's go ahead and say We're going to filter and Filter where we can do a string detect So str detect we give it a string So it's going to be rdp and the pattern that we want to find is The three vertical lines, but that vertical line means a special thing in a pattern. It actually means or So to match the actual vertical lines, we're going to Use those double back slashes and if we run that We then get back the hundred and I guess there's 208 so maybe Yeah, so there must have been a few that had that triple back But that when you looked at the total number Of columns that were generated by separate was still under eight So there's 208 genomes in here that Have kind of a nebulous classification and Again, it seems weird that you would classify it classify the sequences back to the rdp when Their sequence genomes we should know what they are So i'm going to think about what we'll do for this in a for a future episode But knowing that the next episode we're going to do with the the ncbi taxonomy that might be the route we take Anyway, um, these are the things you learn as you start digging through your data So again, we don't want to string detect So again, we can use that exclamation point to not match string detect To not match this pattern and pipe that into separate And this should work and we should no longer have that message that we have too many columns So it tells us that it it filled in missing pieces with na in those 14078 rows Excellent now um What we have at this point as you see is that we have the genome id the rdp columns out to h um for all of our samples and What we're going to do is we're going to pivot longer And the columns we want so calls Equals and what I'm going to use is a special function called starts with so I could write out rdp a b c d e f g h But instead what I'm going to do is starts with and I'll say rdp underscore So I want to get those columns that start with rdp underscore And I'm going to then say Names to and I'm going to call that rank and values to I'm going to say A name And so if we run this We again get this uh complaint that missing pieces filled with na um, what we can do Before I well, let me tell you what we see and pivot longer is that we now have instead of a Nine column data frame. We now have a three column data frame where we have the genome id We have the rank and we have the name so we're winning right so I hate warning messages And so we can add another argument to separate And that is to use fill And then write the value write And what that will do is that will automatically put na values on the right in the columns on the right If it needs to add information and if we run that we see that our warning message goes away We also now see that we have these na values and what we can do is Remove those na values with filter name Is sorry not is that na a name And so again that will get rid of these values where the name was na And I'm just kind of thinking Yeah, that that should work well And now when we run this we see that our data frame has removed Those rows from the data frame where the name was na What we'd now like to do is get a column where we have the taxonomic name as well as the taxonomic level Can you think what we might do here? Yeah, we're going to use separate again. Okay So we'll say separate And here we're going to do separate name And we're going to do it into and we'll say tax on name And we're going to do tax on rank okay, and I didn't give it a delimiter to separate things on And I'm seeing that it's kind of complaining about things And I think that's because there were extra delimiters that we saw here Um, so I think it's maybe trying to to separate like proteobacteria On that backslash or on those quotes so I'll go ahead and give it the sep And we're going to separate on the vertical pipe And that works great, right? I see a misspelled tax on rank there And I also have this rank column which I don't need so I'm going to do a select minus rank And then pipe that and Great. We have our genome ID our tax on name our tax on rank. So I don't really like these quotes in the names So what I'm going to do is I'm going to add an str mutate So mutate changes columns remember? tax on name And I'll mutate that to be str replace all And what we're going to match our In tax on name, uh, we're going to pattern I'm going to say, um I always forget what to do with back with escape characters. So I think we can do backslash double quote and then replacement equals uh A pair of quotes and so I'm using single quotes here to wrap the double quotes And and we'll see if this works Whoa, um Replacement is missing with no default I put the parentheses an extra closing parentheses Ah Something weird happened. Let me go ahead and escape out if you get that plus line in our studio hit the escape key and you'll get back to a good prompt I'll go ahead and rerun this So for some reason I see what the problem is. I'm missing a closing parentheses Okay So that should work and again, I've got this plus line. So I hit escape to get out Come back up here rerun that And great that actually worked. I didn't have to worry about all the crazy escape characters. And so you'll see that like The phylum level names no longer have those double quotes on them Maybe another thing I'll do um Is add another Tax on name another Modification of tax on name. I don't really like having these spaces in the tax on names So what I'm going to do is str replace all So str replace replaces one instance of that. So it would have only removed like the opening quote mark Um str replace all removes all occurrences. So I'm going to do tax on name pattern equals Uh a space and I'm going to replace that with an underscore And if we run that then our bacilli aca One Should have an underscore and it does so that's great. Okay So now we've got our genome ID our tax on name and our tax on rank What I want though are these tax on ranks to be the columns and the tax on names to be the values But before I do that, I'll recall that we have a bunch of um Or not a bunch but a couple uh sub levels So I'll pipe that out to count and I'm going to count the number of times that each tax on rank occurs For now, I'm going to go ahead and this is just annoying me every time it goes down there And so we see class domain family genus order phylum subclass suborder, right? So um I want to get rid of these values before I Convert those names to column headings What I'm going to do is I'm going to do another filter And I can do um tax on rank In and I think we've seen this in a previous episode In a vector of names that I want to keep so I will keep uh, um domain um phylum class order family genus And now at the end of running that if I do count tax on rank I should only see these six levels And sure enough, we only have those six levels now So we've got things in good shape. All right So we've got our our our data frame here Um, I guess commenting that out didn't stop it from jumping ahead Whatever. I'm sure there's some setting I can change so that when I run a code chunk it stays there, but that's okay um and What we want to do anyway back to back to this is we then want to Make the column headings The values from the tax on rank column and the values the The the tax on name values. So we've seen pivot longer up here The complement of pivot longer is pivot wider Okay, and so the id the We can give the columns that we want to pivot longer um, and and the key here is that we want to give it Um columns that will allow it to know what's unique or um another way to say it would be that we Um, we wanted to know what rows to clump together as we pivot longer. Okay, but if we give it names from So what column to take the names from and what columns to take the values from Then it will use everything else as kind of that Unique key for clumping together the different rows. So our names from is going to so that's the name of the columns So that's going to be tax on rank And our values from will be tax on name Okay, so when we run this we should then get a genome id column a column for domain Uh phylum class order family genus. It's not going to be in that order. It's going to be an alphabetical order, but but that's fine So we'll see if this works. I am feeling good domain phylum class order family genus Winning so it did I did it put it in the right order. That's good. So I suspect it's using um Yeah, I'm not sure why it got that right, but hey, that's awesome. Um Yeah, so uh, so that's good. We've got the genome id the domain phylum class order family genus one last thing I want to do Uh, you'll see that people use domain or kingdom I prefer kingdom or that's just I don't know that I prefer that's just why you always use So I'll do rename So rename kingdom equals domain And this will change the column heading For domain to be kingdom. Okay, great. Now remember we kind of Hit a bunch of information way up here and you were supposed to remind me to put it back in um And where did I do that? So that was right here the select command So I'm going to comment that out and we want to make sure that everything works with all the data And no error messages. We see the genome id the species scientific name and then the kingdom phylum class order family genus wonderful Now we can finish this all off By outputting this and we're going to change this to be genome id underscore rdp underscore taxonomy. Okay So the the thing that still has me a little bit worried about all this is how we we removed um Those lines and really those genome sequences that have that Triple vertical bar And what what i'm doing here is i'm really removing All the information i'm also removing the the genome id for that and these actually might be samples that Are genome sequences that kind of prove the point that asvs aren't all they're cracked up to be Because these are going to be cases where we have 16s sequences that don't classify the same way at like the genus level And if they don't classify the same way at the genus level, well Then perhaps they're not unique to a specific genus And that's something that that we'd want to keep So instead of removing those for right now i'm going to change this to be a mutate And what we can do as we saw previously So we'll mutate rdp rdp equals str replace And we're going to do it on the rdp and the pattern is going to be those three lines But i'm going to do a period star Um, and then period star. So i'm going to match everything that's on that row for the rdp And i'm going to replace Uh with I'll do na vertical line um What uh domain, okay So this at least will keep the geome sequence in our data set if I run this That all goes uh, let me Come back down here and do a count on kingdom and run that We see that we've got um na 208 bacteria 15307 If I do it on phylum, uh, let me go ahead and print all the rows so we can do print n equals inf So this will print out all the rows from our data frame We see that we have an na here of 225 And so that's good One thing i'm noticing though is that the na here is red Which tells me uh, that it's actually being used as an na value rather than the string na And so if you recall back here from count kingdom That the na was black and so the black na tells you that it's being treated as a string or as a character Rather than as an actual na value what we can do in separate Is that Sometimes you might imagine that you've got a situation like this where we had say a number Vertical bar and then another number So because it's got that vertical bar in there It's coming in as a character or string variable and then when it separates it Those two numbers will also be string variables. Well, here we have an na Vertical line and then domain and so that's treating that then as a string variable And when we separate it the na is being treated as a string and the domain is treated as a as a string We can add an argument to separate to say convert Equals true and what that'll do is that once it's separated and create the new columns It'll then convert the columns to whatever makes sense So if it's a bunch of numbers that column will go from being a character column to being a numerical And it will convert our string na to the value na And so if we run this what i'm betting is that our na column now is red. Okay, so again We've talked about a few slick features with the separate function Again, we give it the column that we want to separate the names that we want to separate it into The key that we want to use to separate We can tell it to fill values to the right with na's if there's not enough information The other thing that we saw was using convert equals true to convert these separated values into Whatever is the most logical at least according to r We've also seen how we can integrate this With pivot longer and pivot wider along with a lot of the other dplyer functions that we've been using So this is great Let me Go ahead and remove this line 67 Bring up this right tsv Save my script We'll then come back over to Our terminal and let me go ahead and open up atom so I can modify my make file So here in my make file. I had this data references genome id taxonomy. So let me rdp taxonomy and This here all looks good I'm going to go ahead and save this And to test everything I can come back to my terminal and do make on that and Everything seems to be working well It outputted and if I do head data references um And this was called genome id rdp Taxonomy if I look at the head I now see that I've got my genome id the species scientific name kingdom phylum class order family genus If I wanted to I could have Resorted things to put species at the end, but who cares great. So this looks good um Something I did see in data references was the old version of this which was genome id taxonomy I'm going to go ahead and remove data references Genome id taxonomy This is not being tracked by git remember. So I don't have to do anything special to remove it And if I do git status, I see I changed two files my make file and my get genome id taxonomy I'll do git add Makefile git um and then code get Genome id taxonomy dot r git commit dash m and I will then say Separate out rdp taxonomy Closes number 25 git checkout master git merge Issue 25 That all worked and I will then Redo my make Because again when we merge branches it changes the timestamp So we can do make data references rndb or not Um genome id Okay So this is the last test I want to do to make sure this all works before I push the changes up this all looks good and Now looking at github. I see that this is closed I'm going to write a brief comment here even though it closed the issue to say One lingering problem is that some of the records Had more than one classification We turned these into na values although We know that's not correct We'll think about what to do and so I'll comment And we're good So for the next episode what we'll do is we'll take on how to parse out the ncbi taxonomy The more I think about this problem that I just commented on the more I think we're going to want to go back to the ncbi taxonomy and not use the rdp taxonomy again The the problem that we're running into is that Because many genomes have many copies of the 16s gene. They're not all identical. They don't all classify to the same entity When the taxonomy is for an organism It's not for an actual sequence the ncbi taxonomy is organismal the rdp taxonomy As being applied by the rndb is actually For the each individual sequence you might recall That when we originally looked at the metadata file there was a column for ncbi taxonomy But it was full of na values. There was nothing in there So we'll have to go about doing this ourselves And so this will allow us to see our good friend from the last episode inner join And then the opposite of separate which we talked about today, which is unite which allows you to bring columns together So stay tuned for the next episode To get word of when that happens be sure that you like and subscribe this video Subscribe to the channel click on the bell icon so that you're notified when the next episode is released I'd love to see how you're putting these commands into use as we go further and further You'll see that i'm kind of digging back from previous episodes to work with functions in the current episode and integrate them with other tools That's really uh to me the joy of programming is the ability to use all these different functions in different contexts And in different situations to solve problems So keep at it Feel free to leave a comment below in the notes telling me what you're doing if you have any questions Or if anything's not clear i'm happy to go back over this stuff We will see things like separate pivot longer pivot wider as we go forward uh in future episodes So don't be afraid if if this is all new to you and you're a bit confused Stay with me like and subscribe. We'll see you in the next time