 I believe you can answer your own data analysis questions. Do you? You should. Stick around for another episode of Code Club. I'm your host, Pat Schloss, and it's my goal to help you grow in confidence to ask and answer questions about the world around us using data. A lot of bioinformatics is getting data from one format into another. Sometimes that conversion can be pretty sophisticated, and other times, pretty simple. For example, I have a file with a bunch of aligned sequences, and I want to know how many times each sequence occurs in each genome that it comes from. I know that the header for each sequence in my Fast Day file has the information I need to know which genome the sequences come from, but I don't know which field it is. To figure this out, I'm going to use some of my favorite tools. These tools will include grep, which we saw in the last episode, as well as three new tools, including cut, sort, and unique. Much of what we'll do today, you could also do an R or Python. However, using these bash commands will allow me to get my answer in a single line of code, whereas R or Python will require a lot more effort and overhead. The cut command cuts text out of a line based on how we define different regions of that line. For example, we've seen that the header line starts with the genus and species name of the organism that was sequenced. We previously used said to put an underscore between the genus and species names. Well, we can use cut to extract the genus name by pulling out anything that occurs before those underscores. The sort command sorts the lines that we give the command. We can also sort lines based on specific fields within the line. For example, we could sort header lines based on the gen bank RefSeq accession number or on the genus name. Finally, the unique function deconvolutes a set of lines to return the unique lines. We can also ask it to count the number of times each line occurred in the original set of lines or to return to us the lines that were duplicated. Hopefully, you can see from my description of these three commands that we could count the number of genus represented in the data set, perhaps the number of 16S gene copies per genome, or numerous other questions about our data. We'll start by using these commands to help us figure out which genus is the best represented in the RNDB, and then transition into using these skills to help us create a unique identifier for each genome. This will allow us to use mother's countSeqs function to count the number of times each unique 16S RNA gene sequence appears in each genome and the number of times it appears in more distantly related genomes. If we can generate this information, we can use it to quantify how specific Amplicon sequence variants or ASVs are to a genome, species, genus, or any other taxonomic level. Even if you're only watching this video to learn more about Bash commands and don't know what a 16S RNA gene sequence is, I'm sure you'll get a lot out of today's video. Please take the time to follow along on your own computer and attempt the exercise. Don't worry if you aren't sure how to solve the exercises. At the end of the video, I'll provide solutions. If you haven't been following along but would like to, welcome. Be sure to subscribe to the channel and click on the bell icon so you know when the next episode is released. Feel free to leave a comment, even if it's just to say hi. Please check out the blog post that accompanies this video where you will find instructions on catching up, reference notes, and links to supplemental material. The link to the blog post for today's video is below in the notes. To motivate our use of cut, sort, and unique, I'm gonna do a little bit of exploratory data analysis on the sequences that we've been using in our previous episodes of Code Club that were taken from the RRNDB. Recall this is a database of 16S RNA gene sequences showing or providing to us the unique sequences that occur in each of the genomes. Many bacterial genomes actually have more than one copy of the 16S RNA gene. And one of the questions that's motivating the arc over many episodes of Code Club is how unique are those sequences to each genome? And one of the first questions is how many gena are represented in the database and which genus is the best represented? For today's tutorial, I'm gonna work with you all using our aligned V19 sequences. And you'll recall that if we look at data V19, that we have this RRNDB.aligned file. As we talked about last episode, we can count the number of sequences using grep greater than sign because that's the first character in the header of each sequence. Again, if we do head, data V19, RRNDB.aligned, we get a lot of output. But if I scroll back up to the top here, you'll see that we have this header line that starts with the greater than sign. And the next sequence that occurs here, bacillus mycoides, also starts with that. So we can count the number of sequences by counting the number of greater than signs that start a line. And we saw that we could do grep, quote, greater than sign, data V19, RRNDB.aligned. And if we pipe that to WC-L, we can count the number of lines. And we see that there is 76,574 total sequences in our full-length V19 dataset. Excellent. Now, as I also showed you in this header output, again, the line starts with the greater than sign. We then have a genus and species name. Some of these I've noticed, these two don't. This doesn't, but some of them will have a strained name appended to the end with, that appended to the end of the genus and species name, again, separated with underscores. We don't need to worry about that just yet. Instead of counting it, let me output it to head so we can see the first 10 entries that come. And yeah, so sure enough here, we see the salmonella enterica and subspecies enterica, serivar and invernous strain, blah, blah, blah, right? There's a lot of other information here. We also see, we talked about this also that there are these vertical pipes that demarcate different fields within the header. And so we have the organism name. This GCF is the assembly accession from GenBank and the NZ number is the RefSeq accession number from GenBank. Information about which chromosome or which genetic element it came from in the genome and then the coordinates in the genome where the gene was identified from. What I wanna know is this genus bit of information and how many different genera occur in which genus is the most abundant. And as I talked about in the introduction, we can use cut to identify fields and to pull that field out of the full listing or all of these lines. And what I could do is instead of using head, I could use cut and I will use dash F to tell cut which field I want. So I'm gonna take field one and then hyphen D to indicate the delimiter and in quotes, I'm gonna put an underscore. So the delimiter to define my fields in this example is going to be the underscore. And so everything wherever you see an underscore that's gonna define a different field for each line. And what I'm taking is field one. So I'm taking the first field that results by cutting our lines at the underscore. Now, there's many other options with cut. You can cut based on the number of characters, number of bytes, all sorts of other things. I personally find using this delimiter to define fields works really well with the data that I'm commonly interacting with. Let's go ahead and pipe that to head so we can see what type of output we get. I like to use the head because again, there's like 76,000 sequences. And if it just vomits all the data out to us on the screen, it's kind of hard to figure out what's going on. Also, I can get a sense of what I'm doing by only looking at the first 10 lines. And so again, I use head to help me kind of develop these types of pipelines. And so sure enough, we now have a listing of the genus name for every sequence in our data set. What occurs to me, however, is that what I want to do is I want to count the number of genomes in the data set. I want to count the number of general represented by genomes and that if each sequence maybe occurs multiple times in each genome, then I'm gonna have multiple names here for each genome. And so what I need to do is to first kind of get the unique genome assemblies, but I want to keep track of who it came from. And so then once I've got the unique assemblies, I can then take those unique assemblies and I can say which genera do those come from and then I can count those genera. So maybe need to back up a step and instead of going straight at the genus, what I really want is this GCF field, okay? So what I'll do instead is I'll modify this to do grep. Again, I want to keep this grep function so that I can always get the header row. And what I'm gonna use as a delimiter instead of the underscore will be the pipe. And I will take field two, and then again in quotes I'm gonna put the pipe character and then I'll ship this out to head and I see now that I get my assembly numbers. My assembly accession numbers, but I don't have my taxonomic information. So what we could do instead is to say one comma two and what this will give us is fields one and two as defined by the vertical pipe. And then if we run this, we see that we get both our taxonomic name as well as the assembly name. And this then will be a unique combination which I could then say unique and then send that to head. But the problem with this is that unique only compares successive rows of data. So what we need to do instead is we first need to sort the data and then unique it. So I'll sort and I'll send this to head and we can see that there are multiple genomes now that we've sorted it that repeat themselves, right? So this Nostoc Azale occurs four times in this assembly which tells me that there's four copies of the six SNS gene in that genome. For right now, I don't care how many copies there are in the genome. I only care about which genera are represented. We could do sort unique head and what we'll see now is that we only get one copy back of each of those lines, right? So now we only have one copy of Nostoc Azale in the output. All right, so we've unique the genomes and now we wanna know for the unique genomes how many genera do we have, okay? We're getting there. Well, believe it or not, we're gonna use cut again in the same pipeline. We could do cut like we saw earlier, F1 for the field one and then we'll delimit or demarcate our line with that underscore. I'll pipe that out to head and what you'll see now is that we only have the genus names represented. Again, we'll sort, although I suspect these are already pretty well sorted and then we'll unique. And so again, instead of having all these aceto vectors, we only have one aceto back to represented. But I don't want only the listing of uniques. I wanna count the number of uniques. I wanna count the number of time each unique sequence appears. So I can add an argument to unique, which is hyphen C and then pipe that out to head so we can see our counts. And what you'll see is the output then puts the number in the first column followed by the genus name. And so we see aceto bakters represented by 28 genomes in the RNDB, whereas aceto bacterium only occurs in one genome. I'd like to sort this output so that I can see which 10 genomes or 10 genera are the best represented in the dataset. Well, hopefully you should be saying, well, I wanna sort it again, right? So we're gonna sort and we can use dash N to sort numerically. Although in this case, I don't think it really matters whether or not we use N because everything is gonna start with a number. Every line is gonna start with a number. And I forgot to put the head on it and what you'll see is a sorting and ascending order of the different genera. And we see that Escherichia, Salmonella, Bortotella, Bacillus, Staphylococcus, Tceptococcus, Pseudomonus are the best represented genomes in the database. Now, I'd prefer to do this with head, but this is only gonna give me the top, the 10 rarest genera in the dataset. And so what I could do instead of sort dash N is I could add dash R and this will do a reverse sort. So it'll put the most abundant ones first followed by the rarest, so do a descending sort. And this then shows us that there were 979 Escherichia genomes in the RNDB. Well, how many total genera are there? Instead of head, I could pipe this all to WC dash L and we'll see that there's about 1,279 different genera represented in the database. So there's 1,279 genera represented and Escherichia, Salmonella, and Bortotella are the most commonly represented genomes in the database. This only goes to show that our genome sequencing is still heavily biased to clinical isolates and pathogens. Looking at the pipeline that we developed, we see that we used cut once, twice. We used sort three times, and we used unique two times. There's no rule saying you can only use these functions once. What we're doing with the pipe is we're sending data from one function, say from grep, into cut, into sort, into unique and each step of the way, we're sending that data through the pipeline to get an answer that we're interested in. As you saw me develop this pipeline, each step of the way I'm using head or I'm outputting the data to the screen to see what's going on in the pipeline and that's what I encourage you to do. If you wait for the exercises, you'll get a few extra opportunities to practice developing these types of pipelines on your own. Now, besides being interested in the total number of genera and which genera were best represented in the R&DB, the other reason we're looking at this right now is because I'd like to make a unique identifier for each organism, each sequence and each genome in the dataset because what I'd like to do is to create a table where I have each row represented by a different sequence, a unique sequence and each column be the genomes that that sequence was found in and the values in the table then being the number of times each sequence is found in each genome. This is a table that we can easily make in mother using the count.seqs function but to get there, we need a grouping file that tells us which sequence belongs to each genus or each group. So to get there, what I'd like to do is simplify the header a little bit to identify, again, what is the unique identifier that we can use to represent each sequence? Again, we'll do this with our cut sort unique procedure but first what I'd like to do is go ahead and file an issue about creating this file. So my issue will be create a table indicating number of times each unique sequence appears across genomes. And so we will use mother's count.seqs function to count the frequency of sequences across genomes. And remember, within a species among Escherichia coli, I might not expect there to be different 16S sequences across those different strains but perhaps I would, right? And so perhaps the same sequence would appear across all of the strains of E. coli, perhaps only across some of the species of Escherichia and across some of the enterobacteria. But we don't know until we go ahead and look and that's the purpose of this project. Some of the things that we're going to be mindful of is to create a grouping file to relate each sequence to each genome. We need to identify the unique sequences in the database and then we need to count the number, number of times each unique sequence occurs across the genomes. I can never spell occurs correctly. So we'll submit that issue. Now that we've created the issue in GitHub, I'm going to go ahead and double check that we're current on our master branch in our local repository and then create the branch for issue 17 in my local repository. Get status, I'm on the master branch. Everything is up to date. Get branch issue 17. Get checkout issue 17. And I will go into Adam and I'll create a new file that I will save as in code. I'm going to create the file, count unique seeks.sh and I'll create my shebang line. Coming back to the terminal, again, one of the first steps that we need to take care of is identifying a unique identifier that corresponds to each of the sequences. As I've said, the header row names are very long and I'd like them to be as short as possible. And just to remind myself how many sequences there are, we have 76,574. So when we create those identifiers, that's the number that we're going to need to have. So if I had done grab like this, so if we say create identifier based on the genus name, well, we only get out 1279 genera. That's not going to be a unique identifier. Let's change this instead to be the GCF and the coordinates because hopefully that the genome, the assembly, as well as the coordinates of where it occurs will put us in good shape. So that's going to be fields two and five with the vertical pipe as under the limiter. And if we sort unique those and then count how many rows we get out, ah, 76,570. It appears that that's not a perfect identifier. I'd like to know which are duplicates because that kind of surprises me that that there's duplicates with the assembly as well as the coordinates. And so that dash D will show us the duplicates. And what we see is there's four, there's four assemblies that evidently have two 16S copies that start at position one and go to 1470. I'm kind of curious what those are. So what I'll do is I'll copy this and I'll grep on that field from data v19rndb.align to see who those are coming from and what they are. And what I see is that they're coming from, or at least this first one is coming from rotobacter spheroides and that there's two chromosomes. And yes, rotobacter spheroides has two genomes, two chromosomes in its genome. And evidently the way this group that sequenced the genomes numbered the basis is that for both chromosomes, they started with a 16S gene. And so that's that. Let me double check that. It's true with some of these others. Again, grep, cut, sort, unique are really helpful for figuring out what's going on in our data when things don't quite seem right to diagnose problems as well as to do these basic data explorations. And again, we see this as another rotobacter spheroides where they've got two chromosomes and they start each chromosomes numbering with the 16S copy. It appears then that the assembly and the coordinates are not sufficient that we'll also have to include that the rough seek ID. So we'll need three fields. And if I come back and modify this to be two, three and five, if I leave on that unique D, there's no output indicating there's no duplicates. And if I replace the hyphen D with piping it to WC-L, I should see now, yep, that I've got 76,574 unique identifiers as well as 76,574 sequences. So we've got one-to-one correspondence of our identifiers. What I wanna do now is to use sed to extract fields two, three and five to create a new file that first column will be fields two, three and five. So the sequence identifier and then the second column will be field two, the genome assembly that three and five, the different copies correspond to. And I will also then modify my fast day or my alignments file so that they have the updated names. And so then with that grouping file linking our unique sequence identifier to the genome they came from, we can then count the number of times the unique sequences show up in those fields. And again, we'll go back over to Adam and we'll do that all here in our countuniqueseqs.sh file. I'm going to create a variable target that will be coming to us from make. And we'll go ahead and create the rule here in a little bit. But as I'm developing this, I like to put in temporary variables like target and eventually I'll replace a hard-coded name with that dollar sign one that we're taking from the input. And so this will be coming from V19, RNDB.countseqs, a count table. All right. And so that's the target that will be generated. We might also take as a target unique.align. So the file that contains the unique sequences from our data set. So we'll come back and make sure that we can use those targets to make things generalizable. I'm going to have a input alignment file which will be data V19, RNDB.align which we've already seen. And I'll have a temp align file which will be data V19, RNDB.temp.align. Again, we'll come back and we'll clean these up based on the target variable. And we'll do said hyphen e quotes s and then our three forward slashes. So for each line, we're going to start with the header row and we're going to match that and then we're going to have a field and then the vertical line and I'm going to use the escape character and then we'll have another field. And we can repeat this so that we have five total fields. So one, two, three, four, five, five total fields. And again, what I want out are two, three and five. And two is here, three is there and five is there. And I will replace that then again with the greater than sign and it's going to be field one or memory slot one, memory slot two, memory slot three and we will feed into that our aligned file. And again, to test things out I'm going to go ahead and copy and use the head and paste those into my terminal to make sure I get the output I'm expecting. And I see that I spelled a line wrong and so if it doesn't get any input then it just sits there. So I've got a typo here. Now when I run that I get output and coming back up here I see that it's put all three fields together but it doesn't have the vertical lines to separate them. So I'll go ahead and put those vertical pipes back in. I'm going to clear my screen with command K so it's easier to find the output. And I now see that I've got the name, the assembly accession, the ref seek accession then as well as the coordinates. And one thing that occurs to me is that at the end it's got a minus for the negative strand or a positive for the positive strand. Those hyphens I know are going to cause problems for mother and I don't really care which strand it's on. I'm going to go ahead and remove those final two characters and I can do that by putting an underscore and then a period at the end of my search field and so that will match at the end and I could even force it by putting a dollar sign to say anchor at the end where you've got a underscore and then any character. And now if I run this, I see that the strand is now removed and those names for sequences look pretty good. So that works great. I can now pipe this instead of looking at head, pipe that to dollar sign temp align and we've, if I run this, I can now get a temp file and I'll look at the output of, I'll look at the contents, sorry, of data v19. I'll go ahead and put LTH on that and I see, yep, we now have RNDB temp align and it's actually a little bit smaller, three megabytes smaller than RNDB align, which is great. Now what we need to do is create that grouping file that allows us to indicate that this is the sequence identifier and then which genome it's going to. To do that, we will then do grep to get the header row on temp underscore align and we'll then pipe that into a sed. And again, I'll use the enhanced regular expressions and we will try to match on the greater than sign and we'll do our same kind of thing with those fields and I need those escape characters and what I want to preserve is that first field. Now, if I run this, let me show you what it looks like, parenthesis not balanced, what do I screw up? The problem is that I've got my escape character before the parentheses and so that doesn't work. Okay, so we rerun that and okay, so what am I getting out? I'm getting all three fields out rather than just the first field. Why would that be? So field one, field two, field three and so I needed the escape character before that first vertical line. If you only have a bare vertical line in there, that's effectively saying like or match this or this and so again, that escape character will mean actually match the vertical line. Now when I run that, I do get the genome name, the genome assembly accession but what I want is I want two columns. I want one column that's actually this and I want a second column that has this information. Now, one of the cool things we can do with these parentheses is we can put parentheses around multiple things and the contents of the first parentheses starting on the outside goes into slot one and the contents of the second set of parentheses goes into slot two and so you can have nested parentheses but again, the parentheses that it sees first will go into slot one and then slot two will be the second set of parentheses and now if I run that, I think we'll get what we want and so yeah, so we get the sequence header space and then the genome assembly accession. So this will be great for our group file then and so I will output this to data v19 rndb.groups, maybe I'll just say temp.groups and I see that my temp.groups file looks good. Again, I could do head data v19 and that all looks great. Now what I'd like to do is run all this through mother. We're gonna do a couple of things. We're gonna unique our sequences using the unique.seqs command in mother and then we'll also then count the sequences. Again, using our sequences as well as that grouping file and we can do this as I think we've seen before by calling on mother and then running unique.seqs fasta equals dollar sign, temp align that will unique the sequences and then we can do count.seqs, dollar sign or I guess we haven't just find this as variable yet throw this in so groups or group equals that and I also then wanna say compress equals false. The output of count.seqs tries to compress things but if we say compress equals false it'll come out as a data table and maybe I'll go ahead and put in a enter there to separate things and then I'll come and run these in my shell script. This might take a few minutes so I'll jump ahead. That all took about six or seven minutes to run on my computer. Let's look at the data that's been outputted and we see that we've got our various temp files that we've created. So this was the renamed file. This was the groups file that we created. This is the unique sequences. We'll wanna save this so we'll probably end up renaming this and then this is the count table that we want. These other temp files we don't really need. So we'll want to do some amount of clean up and garbage collection. So we'll do a move on this to data v19 rndb.unique.align I'll call it and then we'll also wanna move the count table file or remove the temp from that and then we'll wanna remove these other files. And again, if I put in the forward slash star I'll actually get the whole path so I won't have to type anything. So we'll copy this over and we'll copy over the groups file which we no longer need and we'll copy over the temp aligned file and that should be good. If I run these in my terminal I now see that I've got my count table as well as my unique.align file and we're in good shape. Now the challenge then is how do we generalize this for other regions so that we're not hard coding to data v19? Well, I'm gonna create a stub of a file name and I will do echo dollar sign target and I will then pipe that into said dash e and I'm gonna wanna take everything up to rndb and throw away everything else to make that dollar sign back slash one. Let me double check this works and then echo dollar sign stub gets me nothing. My target isn't defined. Okay, that works. So now I've got my stub and I can then define these other variables based on that stub and so my align file will be dollar sign stub.align and my temp align will be dollar sign stub.tempalign and so that all should work. We also have this groups file and I'm gonna create actually a stub temp which will be dollar sign stub.temp and so then this will be tempalign and I can make this to be stub underscore temp. That's good and I can make temp groups with dollar sign stub.temp.groups and that's all good and I'll put that here. That's good and then my groups will be here too. Again, these variables make it easy to generalize what we're doing. So again, I can replace this with say dollar sign one, bring data in from a make file for any region now and it will update with what we're trying to do. I'll replace that with stub temp and basically anywhere that I have rndb.temp, I can replace with dollar sign stub temp and then over here, this is going to be dollar sign stub. So I think that's good. I'm gonna go ahead and create a rule in my make file. Again, I'll use that percent sign to indicate that we need to match whatever region it gives us. We'll rndb.unique.align and data forward slash percent data count table and this will be depending on using code count unique seeks.sh as well as data forward slash percent forward slash rndb.align backslash there and then we're also using mother. So we need to make code mother mother a dependency and we will then do code count unique seeks.sh and then dollar sign at. And this reminds me that I need to make code count unique seeks.sh executable that and why is it complaining? And I see that for some reason, Adam stored my shell script in my project root directory. So I need to move count unique seeks to code. Now if I do my chmod, we should be in good shape and now if I do make data v19 rndb.unique.align I should run and we'll be back in a brief moment. All right, so identified a small bug in my code. At the end here, I have stub temp.temp.groups.temp.align. I'll go ahead and remove those and that should make everything good to go. After that finishes running on my computer, I'll go ahead and close the issue, merge the issue back into the master branch and then rebuild those files or I guess build for the first time those files for my other regions and then make sure everything works and then finally push my changes to the code up to GitHub. In the meantime, I've got three exercises for you to work on using that v19 rndb.align file. So the first is asks you how many genomes there are in the archive in that file that are from the genus thermus? So how many genomes, not how many sequences? Second, which organism has the most copies of the 16S RNA gene per genome? So we know that there are some with only one, there's some like E. coli with seven, but what's the upper limit on the number of copies per genome? And the third question asks after the species coli, which Escherichia species has the most genomes represented in the rndb? We only think of E. coli, but there are other Escherichia species. What are those and what is their abundance? Go ahead, pause the video, work through these on your own and once you've gotten through them, press the play button and I'll show you how I work through them to get a solution. The first question asks how many genomes are the rndb from the thermus? We'll again start with grep and we'll pipe this and I could make this a little bit easier by instead of grepping only on the greater than sign, grepping on the thermus. And to start, let me do head. All right, so this shows us a variety and again I can do WCL to see how many total sequences there's 31. All right and much like what we did when we were counting the different genera, I'm going to get both the genus species name as well as the genome assembly accession by doing cut F12 and the vertical bar delimiter. This returns the correct information of course. I can then sort and unique. This then shows us the unique combination of the genus species name as well as the genome assembly and this actually is the answer. So I can then do WC-L and it shows us that there are 16 thermus genomes represented in the rndb. So if you got that, well done. If you did it by a different approach and you still got 16, well done, that's the goal. If you didn't get 16, maybe compare what I did and what you did and see if you can find those differences. Next is which organism has the most copies of the 16S gene per genome. It's gonna be very similar to what we've been doing. Again, I'm gonna do a cut on fields one and two with the vertical bar delimiter and remind myself of where we are with that. And what I can then do is I'll sort and unique and count because this will tell me how many sequences there are for each genome assembly and will retain the genus species name. And what we'll then do is sort and we can do a reverse sort. So the most abundant things are first and the rarest things are last. Let me head it. And I see that there's a photobacterium damselae, never heard of that before, that has 21 copies of the 16S RNA gene in it. I know that a lot of bacillus and streptomyces have a lot of copies, but I didn't know that photobacterium damselae had 21 copies. That's pretty remarkable. Finally, after the species coli, which Escherichia species is the most genomes represented in the R&DB? I think what I'll do is kind of that thermos trick again and we'll do grab Escherichia and I will then again cut on fields one and two, check those out. So of course, lots of E. coli in there and I will then unique, sort and unique, head that. All right. And so these starts with Escherichia albertii and we can then count on those. So unique, we'll then need to do another cut actually because this is both the genus species name and the genome assembly accession. It's unique, but we need to cut field one on the vertical bar delimiter to get back the genus species name, sort unique C and then I'll do a sort reverse. I forgot to put in the delimiter here, typing too fast and what I see is actually I have a lot of strains of E. coli. So I need to modify my cut and let's see. So actually, yeah, I'll cut on the vertical bar. Let me delete this to start over a little bit and that works for albertii, but maybe if I'll use tail, we'll see the others, right? So I'm gonna do another cut to look at the first two fields when cutting using underscore. So cut one and two and then on the underscore and let me look at tail because this is where I've noticed a problem. And so yeah, sure enough that removes the strain level information and I can do sort unique dash C and I wanna do another sort reverse and we see that sure enough Escherichia coli, we have 957 genomes, but albertii has 15, frigonsoniii is five, some unnamed species one and then marmite has one. So again, hopefully going through these exercises helps you to see how you can use cut sort unique grep to answer different questions about the data you have and that you can see you can reuse these within the same pipeline to get to your result. Also point out like I showed you here that using head and tail are very useful to getting a sense of the output as you're going through without having to see all 70 some thousand lines being outputted to the screen. So if you got the hang of these well done, if not keep practicing with these three exercises and if you want further practice start asking yourself other questions and like which organisms have the most chromosomes? That would be another interesting question that you could take on and perhaps show us in the notes down below in the comments what you found for the organism with the most number of chromosomes in the RNDB. Thanks again for joining me for this week's episode of Code Club. Be sure that you spend time going through the exercises on your own to help reinforce your new skills using cut sort and unique. As we've seen today, these tools work well with grep and WC. We can also use what we learn to create said statements to modify our text files. It would be great if you could take the ideas we've worked through today and think about how they relate to your current projects. We often get data presented to us in unexpected places like the headers of sequence files. Being able to parse that information to extract what you want is a valuable skill. I'd love to see how you are adapting what I have covered in this and other Code Club episodes into your own work. Also, feel free to ask any questions you have in the comments below and I'll do my best to answer them in a future episode. Please be sure to tell your friends about Code Club and to like this video, subscribe to the channel and click on the bell so you know when the next Code Club video drops. Keep practicing and we'll see you next time for another episode of Code Club.