 I believe you can answer your own data analysis questions. Do you? You should. Thanks for coming for another episode of Code Club. I'm your host, Pat Schloss, and it's my goal to help you grow in confidence to ask and answer questions about the world around us using data. In last Monday's episode of Code Club, I did something that I do regularly. I screwed up. Rasmus Kierkegaard noticed the problem and was kind enough to point it out in the comments to that episode. Part of my motivation for making these videos and my general approach to teaching is to help normalize mistakes. There are a lot of YouTube tutorials on how to use grep, sed, and every other bash command you can think of. The problem is that the commands are presented in a way that's far too abstract. Those tutorials don't show how those commands interact with the output from other commands. They're highly edited, and they don't show the typos, goofs, or how to identify and solve problems. Not mine. If you've watched along, you've seen numerous typos and redos through the way. These are not part of the act. They're the reality of how I and every other person analyzes data. What I have found is that experience brings the ability to diagnose and solve those problems. This process is also missing from most of those tutorials. In today's episode of Code Club, I'm going to show you how I investigated the problem Rasmus pointed out and resolved it. I'm sure I would have found the problem eventually. But by doing these tutorials publicly, we're able to figure it out much faster and to do it together. Thanks, Rasmus. As we're doing steps in a data analysis, we can get a little cocky. We think things are going to work the way we expect. You may recall in last Monday's episode that there were a few sequences that started and ended with periods to indicate missing data. In one of my bash scripts, I wrote a sed command to convert those periods into hyphens to represent gaps in the alignment. As we're going through my solutions to the exercises, I found a bug in that sed command and thought I'd fixed it. I even checked the output. Unfortunately, I was flustered and rushing at that point in the episode. Instead of checking the output for a region that had those leading periods, I accidentally checked the output of full-length sequences. Of course, those sequences did not have the problem with the periods. Today, I'm going to present the approach that I should have taken. It's related to a concept that's commonly used in programming called test-driven development, or TDD. It isn't as widely used in data analysis, but there are ideas from test-driven development that we can draw to make our own analyses more robust. The idea is that we start with a set of tests that fail. We then write code to generate output that passes the test. If we later find a solution that produces the wrong output, we add that situation to our set of tests and modify the code so that the test passes. Because modifying the code can cause other tests to fail, every time the code is updated, the tests are rerun. This sounds a bit like make, right? Programming languages, including R and Python, have frameworks that make test-driven development much easier to execute. I'm not aware of any such framework for Bash. Today, I'm going to present the approach that I should have taken in last Monday's episode. It's related to a concept that is commonly used in programming called test-driven development. It isn't as widely used in data analysis, but there are ideas in test-driven development that we can draw from to make our analyses more robust. The idea is that we start with a set of tests that fail. We then write code to generate output that passes the test. If we later find a solution that produces the wrong output, we can add that situation to our set of tests and modify the code so the test passes. Because modifying the code can cause other tests to fail, every time we update the code, we rerun the tests. This sounds a bit like make, right? Programming languages, including R and Python, have frameworks that make test-driven development much easier to execute. I'm not aware of any such framework in Bash. In today's episode, we're going to find the problem that Rasmus identified, we'll create a set of test sequences that trigger that problem, and then we'll modify our code to resolve the problem. Along the way, we'll learn more about sed and grep. Even if you're only watching this video to learn more about Bash commands and don't have a clue what 16S RNA gene sequences are, I'm sure you'll get a lot out of today's video. Please take the time to follow along on your own computer and attempt the exercises. Don't worry if you aren't sure how to solve the exercises at the end of the video, I'll provide my solutions. If you haven't been following along, welcome. Be sure to subscribe to the channel and click on the bell icon so you know when the next episode is released. Feel free to leave a comment even if it's just to say hi or if you would like to be like Rasmus and ask a question. Please check out the blog post that accompanies this video where you'll find instructions on catching up, reference notes, and links to supplemental materials. The link to the blog post for today's video is below in the notes. As I mentioned in my introductory comments, received a comment on last Monday's episode from Rasmus Kickergard, pointing out that he had concerns about how I was using my sed command to remove those leading periods from my sequence alignments. He was concerned that those leading periods, if there was say six of them, would be replaced by a single hyphen. And certainly looking at the way it was written that I wrote it in the bash script, that would kind of be what you would think would happen. And I could have sworn that I attested this and that it worked. So let me figure out what actually happened, see if I can recreate what Rasmus is concerned about. If there's a problem, we'll go in and solve it. I'll go ahead and move to my documents. Schloss are an analysis directory. Everything is good. And let me start with doing the test that I thought I had done in those exercises. And that was to do grep, and then look at the beginning of each line and say, I wanna know, does anything start with a period? And again, we need the backslash because otherwise a period represents any character. And then I could do data, v19, rn, db.align. And sure enough, nothing comes back, right? So at least for the v19 now, it seems that all of our sequences start with a character or with a gap, with a letter like ATGC or a gap character. But what about v4? I have this sinking suspicion that when I did it in the exercises, I tested it with v19, not with v4. So if I run this, ah, this is where the problem is. Sure enough, these three sequences all start with a series of periods, and those periods were not removed from the aligned sequences. So that's the problem we need to address. Rasmus was concerned that those periods were gonna get removed and replaced with a single hyphen. It actually looks like nothing has changed in these sequences. I'm gonna go into GitHub and file an issue. And I will say the, for the v4 region, leading periods were not replaced. And so we need to replace leading periods with hyphens. And we could perhaps say C number 16. Yeah, and so this will create this issue and this will now create a link between issue 16 and issue 17, and you can see here, issue 18, sorry, that there's now a link back to issue 18. And so we have kind of integration of our problems, if you will. I'll go ahead and create a branch for a branch issue 18, and we're on branch issue 18, and everything is clean on the branch, and we're good to proceed. In Adam, I'll go ahead and open up the extract region.sh script. This was the script where we were pulling out that region and then cleaning up those leading periods. And it appears that this line 49 would be the offending of the branch issue 18. Where I was trying to convert those leading periods into hyphens. What I'm gonna do is I'm gonna rerun everything up to that spot to have the file that exists at that point and then play with different options to understand what's going on. So if I create target equals data v19 align, and I'm gonna start with v19 because it occurs to me that when I was testing it up here, I was testing it using the output of the script. So perhaps the script did exactly what it was supposed to be, but only worked for v19 and not v4. Now in my script, I'm gonna go ahead and run everything from region down through mother and recall this takes a couple of minutes. So I'll be back in a second. So that ran successfully and where we are in our data v19, let me do that ls-lth get the most recent file is this pcr.filter.fasta file. I'm gonna go ahead and run a test on this to see if there are sequences that at this point in the script. Again, this is before the said line after running mother, but before converting those periods to hyphens. I will then do that data v19 and then I'll copy this thing. And what we're doing is the caret greater than sign will match things that start with that greater than sign. And what I actually want is grep-v which will match things that don't match don't start with that greater than sign. So I'm getting the sequences rather than the headers. I will then pipe that to another grep to find anything that contains a period, okay? And if I run that, nothing comes back. So as I suspected, the v19 was, we got lucky and that none of the sequences started with a period. Everything started at the exact position we thought they would and ended at the exact position we thought they would. What we need to do now is repeat this, but for the v4. And so I don't have to recreate this. I'm gonna copy this test here into my bash script. And this whole thing is the path. So dollar sign path and that looks good. So I'm gonna repeat this now with a new target. So I'll do target data v4rndb.align and I will now run all this other stuff from region down through running mother. And I'll go ahead and run my test here. We'll be back in a moment. So we ran all the way through mother and then we ran that test statement. And sure enough, we see that the data coming into the set has those leading periods. What I'll then do is let's go ahead and run the set. I'll copy and paste that in. And I'm gonna modify my test script here a little bit and put this in. And we don't want the filter.fastday, we want the filter.test.fastday. And we'll copy and paste that in. So we ran set and now we're testing it. And sure enough, those sequences come back with those leading periods. Rasmus made a suggestion in his comment that perhaps we should just change all of the periods in the file into hyphens and then in the headers, we could convert them back. Before we check that out and perhaps some other options, what I'm gonna do is something I should have done originally, which was to create a test file. It's just too hard to be looking at 75,000 sequences to see if they all worked. So what I'd like to do is to create a test file that contains examples of things that I would wanna make sure that they get converted from periods to hyphens and to make sure that nothing screwy is going on. We could do this manually by making up sequences or cherry picking sequences to kind of subset from the overall data set. But what I'm gonna do is as I mentioned with test driven development, one of the things you do is you take the cases that fail and then you make those part of your test suite. Well, I would like to make a fast day file that contains these sequences. But the problem is I don't have the sequence headers with them. What we'll learn then is we'll use grep with a special argument, we'll use hyphen capital B to get the rows or the lines that occur before that sequence match. Another way to get these without that long test script might be to say grep and then starts with a period. And we will give it this path. I guess I didn't need that final grep. Test that fast day. So this is another way to get those three sequences that started with the period. And that other grep search that I had that I have as part of the test, I like having this initial search against the greater than sign that starts on the header row because if you remember back to last Monday's episode, one of the bugs I introduced put a character before the greater than sign. So this is actually part of the test but as I'm developing my data set I don't need that full pipeline. So again, as written this current grep search will return these lines that all start with periods. Well, if I want the line that occurs before these lines that match, I can use the capital B and a one. This then will return the sequences as well as the line above them or their sequence header. One of the other things it will do is that it'll separate each of these matches with a pair of hyphens. What I'm gonna do for now is I'm going to pipe this or redirect it more appropriately said into test.fasta. And again, now we have three sequences that show up in our that failed our test because they still start with periods. And I'm gonna go ahead and I'm gonna do a search for this, why can't I do this? Search for this sequence, the header, so any header that contains that name and I want the sequences that occur in the line after it. So what I can do is grep hyphen A. So you can think of hyphen B as before, hyphen A as after. So the line after and I will plug that blastocloris veritas and then the genome assembly accession in there. And then I will search over that test.fasta file and you'll see that we get back three sets of sequences. This middle one is the one that we've seen before because it started with a period. These others start with a T. I'm gonna go ahead and rerun that and I will append this output into test.fasta. And remember we can append by using two greater than signs. And now if I look in my atom at test.fasta, I can nicely clean this up by removing those pairs of hyphens. And something that occurs to me is that I've got three cases in here of sequences that start with periods. I don't have any cases of sequences that end with periods. So on this last one, I'm gonna replace these last four characters with four periods. I'm also gonna take this sequence that starts with periods and create a case of a sequence that both starts with periods and ends with periods. And so I've got in here a good sequence that starts and ends with a character. I've got a few sequences that start with periods. I've got a sequence that ends with periods and I've got a sequence that both starts and ends with periods. So this is a pretty good test file that we can use to test our said command. Coming back to our said command here, I'm gonna copy this down because we're gonna make a few examples. And I'm gonna say test.fastday and I'm not gonna redirect it. And so let's give this a shot. This should give us the output that we've already been seeing where it doesn't change the leading periods. And sure enough, you'll see that there are sequences in here that still have those leading periods as well as the trailing periods. You'll note that we didn't even try to deal with the case where there were trailing periods in Monday's episode. I think because I couldn't find any sequences that had them. So that might not be as such a critical concern, but let's think about how we could change this. So Rasmus's suggestion was to change every period with a hyphen, okay? And we run that. And what we find is that, well, the first character turns from a period into a hyphen, but that's because I forgot the G operator at the end of my said search. Now, if I use the G, that replaces every case in the line, not just the first. And so what you'll see is that I now have, I guess it's this case here, the leading characters are all hyphens and the trailing characters are also all hyphens. But one of the things that it does that Rasmus actually commented on in his note was that the hyphens in this range for the genome coordinates, the periods become hyphens. Also the version number for the accessions, those periods became hyphens as well. And I think, yeah, up here, this, this is a sub-species subtilists, there was a period here after the P and that was turned into a hyphen. So as Rasmus suggested, we could perhaps go back through and convert all those hyphens into periods, but I'm sure there are cases where there's supposed to be a hyphen, say, in a name and converting that to a period would get messy. So is there a more elegant way to do this? And what I'd really like to be able to do is to tell said only modify the lines that match a pattern. So match a line that perhaps contains a period in it or starts with a period or ends with a period and then only run the said on that. So we're not running said on the header. How do we do that? Well, Google is our friend. Said on lines that match pattern. And what I find is I've looked at these pages before because these are web pages that tell you a lot about said and it comes back to this Grimor tutorial that I always seem to come back to when I'm doing a search about how to do something in said. I'm still learning a lot about said and what I'm sharing with you today is something that I've only recently just in the last day learned about said. So looking at the table of contents, what I'd like to do is kind of look through this and see if there's anything that pops out to me about matching a pattern and then changing that line. And so what I'm seeing here in this section is addresses and ranges of text. And so restricting to a line number, patterns, ranges by line numbers, ranges by patterns. And so you can see I've already been here. If you go to the pattern section or to this general section that we can limit said's operation to specific line numbers, ranges of lines, or to lines that have a specific pattern. And that's what I wanna do. And in fact, this first example that they have in this tutorial is beautiful because it looks a lot like what we're trying to do. So in this said, they're looking for lines that start with a pound sign. And then in that line with a pound sign, they're replacing any numbers, okay? Well, we kind of wanna do the same thing. We wanna match on a line that starts with a period and then replaces say periods with hyphens. Let's give this a shot. I'm gonna come back and I will do a pattern and we wanna start with a period. So lines that start with a period, we then wanna replace those periods into hyphens. Let's give this a shot. And what I find in doing that is that again, like Rasmus' suggestion, that turned everything from a period to a hyphen. And what I wanted to do was not just match lines that contain a period, but that start with a period. Now, if I put in that carrot in my pattern search, I find that these sequences that started with periods are now hyphens. I still have my periods here for my version numbers and in my coordinates. But one thing I see is in this last sequence that ended with periods, but did not start with periods, those are still periods. And I want those to also be converted from periods into hyphens. And we had an example somewhere. Yes, we had an example here of sequences that ended in periods being converted into hyphens, but because that sequence started with periods, it replaced every period in the line. Now, we could write another set to do this to look at the end of the sequence, but that's not the easy way to do it. The easier way it might be to do it would be instead of looking for sequences that start with a period, might be to say, let's avoid lines that start with the greater than sign. So for every line that doesn't contain a greater than sign, we're gonna replace periods with hyphens. Well, how do we say don't use lines that start with a greater than sign? If we do this, then we still have all our periods in our sequences, but our header was changed. Well, something that we could do is we could put an exclamation point before the S. And what this means is to kind of reverse the search, reverse the pattern that we're looking for on that line. And it will say, don't match lines that start with a greater than sign and then run this find and replace. So if we run that, we find, let me double check what's going on, because when I looked further down here, there was a way to reverse the restriction with a exclamation point. And so you could do said and then exclamation point P. So that should work. I wonder, they've got single quotes and I wonder if this is some funny thing with double quotes. So let me change it to single quotes as well. That actually worked. So wouldn't you know it? It's the quote that seemed to matter, although I'm noticing nothing's changing, except the things in my header are changing. Ah, I didn't, I told you, all these mistakes are honest. I forgot the exclamation point before the S. That did the trick and that seems to work. So again, the pattern we used was within said, we say we want to look for lines that start with a greater than sign, but every line that doesn't match the greater than sign, we want to run this find and replace. So perhaps it's better to only use single quotes when using these said functions. I'm kind of surprised that that matters, but you live and learn, right? So this is one way again to get the desired output. Another way that is a little bit more portable to other options, if we use grep, you know, we've talked about, we can start lines with a greater than sign and say test.fastday, that this returns the headers, right? Well, and if we don't want the lines that start with that, we can use hyphen v and we get back our sequences. Well, if we return to this, another way to denote a character that we want to match is to put it into square braces. So what this would do, perhaps we want to match sequences that start with a greater than sign, and maybe we also want to match sequences that start with a period. By putting them into two characters into those square braces, it tells grep or said to match anything in this position that's in that position, right? So if we run this, you'll notice that we get all of the header rows as well as any sequence that starts with a period, but that isn't a search that we're really interested in doing. What we want to do is we want to match the lines that don't start with the greater than sign. So again, start with that and to kind of move things along, I'll put the greater than sign in the square braces and I will put test.fastday. Now, if we don't want to match something in that square brace, right now it's going to match any line that starts with a greater than sign. We can put a caret and I know we have two caret here, but what that caret means inside the square brace is don't match what's in the square brace, okay? So if we did don't match the greater than sign on test.fastday at the beginning of the line, what we get back are all of our sequences. We don't get the headers. If I did don't match the period, I'm going to get back my headers because they don't start with periods and I'll get back my sequences that start with a base because they don't start with a period. But again, what we want is don't start with the greater than sign. Now, coming back to our said, what we could do would be said and we could put in our pattern being what we had in that grip, which was at the beginning of the line don't match the greater than sign. And then we can run S and then we will do the period and replace that with a hyphen and we're going to do it globally across the entire line and I'll put that to test.fastday. So again, we're looking for the lines that don't start with the greater than sign and on those lines, we are replacing a period with a hyphen and again, I put in double quotes, maybe we should be using single quotes and sure enough, that solves the problem. So all of our sequences end in hyphens, all of our sequences start with hyphens if they don't start with a base and we're in good shape. So again, this was one version of said that we used, another version of said that we used is here. I kind of prefer the way with the carrots mainly because it's more portable across grep said other places that you might be working with regular expressions. This exclamation point S is another thing for me to keep track of in my head and as I showed, it seems to be sensitive to those single or double quotes. Let me come back down and see if this actually is sensitive to double and single quotes. It seems to work, but it does not seem to be sensitive to double quotes unless I'm using that exclamation point. So I'm not sure what's going on. I'm gonna stick with the carrot. And so what we can do then is we can modify our said to do start, so we're gonna do a pattern, right? So pattern will be the start not with greater than sign and we will then replace any period and we need that backslash. Otherwise we're gonna convert everything to a hyphen and we need to put a G at the end because this is the global replacement across the line. I can then run this on my full dataset that runs. Now let me run my test, which is this grep and everything works and double check that this is V4 Yep, it's V4. So that worked very well. What I'll do is I'm gonna comment out these greps and I'm gonna run the full thing now and make sure it works at the very end. And I can do that with make data V4 rndb.align and be back in a moment. This completed, no error messages. Let me go ahead and run my test grep on the output of this to make sure everything looks good. So I'll do grep minus V, starts with a greater than sign, data V4 rndb, align and then I'm gonna grep for any line that contains a period. Woo, awesome, it passed. Since the V4 region data worked well, I'm gonna go ahead and run this also on the V19, the V34 and the 45, double check that everything works well there. I'm sure it'll be great. If there's a problem, you'll hear from me about it. So I'll do make data V19 rndb.align and I will do data V34. So something I'm pointing out here is that you can actually build multiple targets at the same time. Because we also had the episode on Thursday, after running these and making sure they work, we'll have to go ahead and regenerate those targets. I could have gone directly to building those targets, but this is fine. So what have we learned from this experience? Well, I think it's important to think about that we have these really big data files and it's easy to think that we can debug them working with those big files, but the reality is that they're too big. And they may not always have the problem cases that we wanna be on the lookout for, right? So I did a great job of testing using that V19 dataset, but it didn't have the problem that I should have been concerned about. The V4 did, however. So it helps to make that test suite of sequences that we anticipate they're being problems with. We can then have a test, which was, for me, that grep statement that pulled out sequences, lines that started with a period, and we can then iterate by trying different options with said to make sure that we get the desired output. It takes more time to develop that test, but as I see in the long run here, it saves time because I could have detected that problem early on and kept iterating until I got it right, like I ultimately did today. The other thing I wanna point out is that there is no shame in Googling to find answers. Again, part of coming up with experience and developing experience is learning how to Google in a way that gets you the answers you want, and that's gonna take practice. And then once you get those results, knowing how to sift through the results that you get, you're gonna get 100,000 results or more, how do you know which one to look at? Well, start with the first one and look for cases where people are describing what you're trying to do. If you're Googling for things like said frequently enough, you'll see similar links pop up like I have for this tutorial, this with the yellow ugly background that always seemed to come upon. Well, so return to those resources and look for help there frequently. And again, the value of keeping everything under make means that I don't have to worry about keeping track of my dependencies and prerequisites, that I can easily regenerate these with one simple command from the command line. I really, going back to that episode on make, really agree with Carl Broman, who said that make is central to everything he does for reproducible data analysis. It has just been so helpful. We think we're only gonna run the function or the script once, but I've probably run this extract script a dozen times now. And it's just been really invaluable to have that under make. As always, I have a set of exercises for you to engage with, to help develop your skills further and to practice on your own. So in the first one, I want you to create a pipeline to generate a fast day file that contains the 21 copies of the V4 region of the 16S RNA gene from photobacterium damseli. This was the organism that we found in the previous episode that had the most copies of the 16S gene and its genome. The second is to write a said statement to unalign the sequences in our V4RNDB.aligned file. So how would we remove the gaps and how would we know that it actually works? Finally, I'd like you to create a test statement, but you can add to extractregion.sh that makes sure that none of the sequences have periods in them. And if they do, I'd like the script to fail. Go ahead, pause the video, work through these exercises on your own, and when you're done, go ahead and press the play button and I'll show you how I work through the exercises. As I thought, everything went through. Let me just double check my greps to make sure these work. So I will do grep-v. I don't want to match lines that start with a greater than sign, and I'm going to run all of these through. So I'll do data star RRNDB.align. So that star will match any of my directories that contain RRNDB that is in the data directory. And then I will run all that through grep to then look for lines that contain periods, right? So I'm going to hit control C, because this is out of control. And what I forgot to do, hmm. So I wonder why this is upset. Ah, it's upset because I got too fancy. The output of this initial grep contains the filename and then the sequence. So maybe I'll come back and I'll use one of my fancy cuts from last time. So I'll do cut field two and I'll delimit on the colon. How about that? Looks good. Looks like nothing's coming from output. There's a lot of sequences and a lot of files to go through these five, four different regions. And it's not turning anything up. Excellent, so that worked well. Again, I got a little fancy here. We could have looked at, we could have run four different greps. And this worked well. All right, to the exercises. In the first exercise, I ask you to write a pipeline to create a fast day file that contains the v4 region from this photobacterium dames ally genome that had 21 copies. So what I'll do is grep and I'm going to grep on this genome accession and I will do data v4 rndb.align and I count the lines to make sure we get 21, 21. Great. And we talked about, so this is going to return the header. If I want the sequence as well, I could do the B or A because A is after one. So one line after will get us the sequence. And this is sure enough returning the header along with the sequence for photobacterium dames ally reads. Now between each set, between each pattern that it matched, it's putting in a line with two hyphens. I can get rid of those with another grep. So I could say grep hyphen v quote. And then I'm not going to just, if I put in hyphen hyphen, that's going to match all of my sequences as well as those hyphen hyphens. What I want is lines that start with a hyphen and end with a hyphen and only contain hyphens. If I run that, I get back the clean output and then I can redirect this into photobacterium dot fast A and we're in good shape. We could do grep on photobacterium fast A and we should have 40, we have 21 sequences, which is right. And we could do WC dash L on photobacterium fast A get 42, we're in good shape. For the second exercise, I asked you to remove all the gap characters from our sequences. And how would we do that with set? Well, to test it, I'm going to go back to using our test dot fast A file and I can do said. And again, I'm going to use that pattern match, right? So I'll do pattern match, things that don't start with the greater than sign to not look at the headers, but look at the sequences. And then I'm going to do it. I'm going to search for hyphen and replace with a period. And we're going to do a global search and a close quote that and I'll do test dot fast A and we'll run this and it seems to have turned, ah, I have periods on the brain. It turned all my hyphens into periods in my sequence and what I want to replace it with is nothing, okay? And sure enough, that removed that, but I'm seeing the periods at the beginning of my sequences. And that's again, because test dot fast A starts, contains the period still. What I could do is I could make that square bracket and put hyphen period in and that will remove both the periods and the gap characters to give me my unaligned gaps. And then I could run this and do data v4 rndb dot align and I could output this to data. I'm going to put it my home direct because I don't really want to save this and I'll end up deleting it. I'll do rndb dot fast A and once it's done, I'll go ahead and run head on rndb dot fast A to see what the first 10 looks, first 10 lines of that file look like. And that looks great. I could do a grep to double check. So grep, again, we don't want to match things that start with a greater than sign on rndb dot fast A and we'll grep for anything that contains a hyphen or a period. And I forgot a closed quote. So I'll hit control C because I've got that prompt there. And I'll go ahead and put in the closed quote. This runs and for some reason, I'm anti-matching my anti-match. So I don't want that hyphen v and we're good. Nothing matched. Again, what I had previously was I had a hyphen v. So I was saying, don't match things that don't start with the hyphen, which means match things that start with a hyphen and return things that contain a hyphen or period, which is all the headers. So we didn't want the hyphen v in that case. So this was the solution to test it. Finally, I want to create a test statement to my extract region bash script to make sure that my output files pass the test. And what I will do is I'll make another if statement and I'll do if and I want to kind of say, basically the number of matches not equal to zero, then I want to fail, then I want to do my echo fail. Sequences contained periods. And I'll do an exit one. And then I'll close that off with a fi. But I need to kind of come up with that number of matches. What I'll do is I will take this graph and again, this will output the number of matches to this filter test.fasta. And I can capture that and store it as my test. So I'll say test equals that. And so I can put in here dollar sign test, not equal to zero. And to test this, I think this should work, right? But I don't want to get cocky. So what I will do is I will make sure my path, I don't have that set. Okay, so what I'll do is I will make my target, datav4rndb.align. And I'm going to run the top of this script again, all the way through mother. And yeah, all the way through mother, but before the said statement. And to test it, what I'm going to do once it's done running is to copy the output of the mother file to filter test.fasta. I'll basically copy this file to this file and see what happens. I'll go ahead and maybe I'll just go ahead and do that here. So I'll do this. Remove the greater than sign to redirect. And I'll do cp and I'll comment this out. So basically, instead of said, I'm copying everything over and this should fail. So I'll go ahead and run make on this. So I'll do make datav4rndb.align and this should fail. So this ran through the output looks kind of weird. I'm getting some syntax errors I see here. I'm getting all the alignment sequences kind of vomited out to the screen. And it's saying code extract region line 64. Go to line 64 here. Yeah, so if test not equal to zero and what occurs to me is that the output test value is all those lines of code. And what I want is a count. So grep-c will give me that count. And so if test now equals or if it's zero, then it won't go through this. If it's not equal to zero, then I'll go through this. Save that. Fair with me. Give this one more shot. Should work. And sure enough, that did in fact fail showing that our test passed. And again, if we remove that copy line so that we don't have the error put that in then everything should work swimmingly. So I'll go ahead and rebuild all of my targets from both last Monday as well as Thursday's episode. And I will commit my changes and close out the issue and merge that with the master branch and push it up to GitHub. Thanks again for joining me for today's episode of Code Club. Be sure that you spend time going through the exercises on your own to help reinforce your new skills using test-driven development, grep, and sed. As my own experience has shown, adopting ideas from test-driven development will help us make our testing more systematic and robust. It would be great if you could take the ideas we've worked through today and think about how they relate to your current projects. I'd love to see how you're adapting what I've covered in this and other Code Club episodes into your own work. Also, feel free to ask any questions you have in the comments below and I'll do my best to answer them in a future episode. Please be sure to tell your friends about Code Club and to like this video, subscribe to the Riffamonis channel and click on the bell icon so you know when the next episode drops. Keep practicing and we'll see you next time for another episode of Code Club.