 Hey folks, welcome back for another episode of Code Club. We've been doing a lot of build up getting going on our package that we're tentatively calling Filotyper. If you don't like that name or if you have another idea, please let me know down below in the comments and I'll take it under consideration. What Filotyper is going to do is it mainly is going to be classifying sequences and in particular most likely 16s ribosomal RNA gene sequences using the naive Bayesian classifier. Along the way, we might introduce other algorithms for doing classification, but the RDP classifier is really where my heart is at in this project. So again, if you don't care about this specific application or what this package is going to do, please keep watching anyway because I think you'll still learn a lot. I'm doing this because I want to learn how to make packages and this is a convenient example because some friends asked me if I would go about doing this because the ribosomal database project, the RDP website, went down sadly last summer. And so we want to build an R package to do that. In the last episode, I developed this outline of the different steps we needed to do to get a classification. The items on the left side correspond to our unknown sequences. The things on the right are what we need to do with our reference sequences. And so as you can see, both columns have a common feature that needs to be done, which is to collect eight nucleotide words and possibly other sizes as well. So in today's episode, we're going to use a process that I'm going to talk to you about called test driven development to go ahead and build that feature where we will collect all possible eight nucleotide words from our sequences. All right, so to get going, we are going to start in my finder window. And again, because we have created our package, also as a R project in our studio, the most convenient way to get to the right place in our working directory within our studio is to double click on this filotyper.rproj file. And then this will launch our studio into the right place. And we can see up here in the upper left corner of our console, it gives us the path. I'm working in my filotyper directory, which is off my desktop, which is off the home directory. It's not critical that you have it in the desktop or whatever, if you're following along. But we do need to be in this filotyper directory. We'll also see that across the upper right corner, I have a build tab and a get tab get is synced up to GitHub. I guess I could go ahead and pull things down. Everything's already up to date. So I don't worry about that. Under the build tab, I could go ahead and do test. Everything passes there. We have one test that tests in our script that came in by default. Actually, it doesn't really do a test. The only test it does is to make sure that two plus two is four. What we got with in our directory within the package is a kind of a demo script, a hello world script that's got some other useful information, including how to install the package, check the package and test the package. People have been asking me if I will be working on developing this package in something other than our studio along the way, like maybe VS code. I'm not really planning on it. Maybe we'll get there eventually. But so the only difference would be the use of these buttons to test things, right? So like I did test, we could do check to make sure everything's looking good. Hopefully these things all pass. Alternatively, we could also do this down at the prompt, right? So we could do test. And this runs the tests and passes everything I could do check open close parentheses, and that will run and make sure everything looks good too. So the first thing I'm going to do is go ahead and sadly get rid of this hello dot R script. I'm going to go ahead and do that. I'm actually going to do it in the terminal tab using get. Sometimes I find the terminal is easier to use than the GUI. This will also placate perhaps people want to learn how to develop a package in some other system than our studio. So what I'll do is get our M and then in the R directory, we have hello dot R. I'm also going to remove the test script for hello dot R. And then I'll also remove in our tests directory. And there we have a test that directory. And then there we have test hello. If I do get status, I see that these two are ready to be deleted. So I'll go ahead and commit that and then do remove hello world scripts. All right. So that's all good. And this script should be deleted. Yeah, we get an error message or alerting us that it's been deleted. So I'll go ahead and clean that out, right? And so I want to create a new R script that I'm also going to be testing. So you'll recall in a couple episodes ago, we implemented this test that system, so that it's all set and ready to go, we need to write some tests. So what I'll do is I'll go ahead back to my console then, and then I'll do use underscore R. And here I'm going to give it the name of my R script. Now this might change along the way, but we'll roll with this for now. And so I'll go ahead and say use our camers. And that opens up my camers dot R script, I can then do use underscore test with camers. And this then will create a R script for me. That is basically what I had for the hello test, you'll notice also that this file, its name starts with test hyphen, and it's the same as my camers R script. One thing I want to point out is that this use test and use R function is coming from the dev tools package, I have dev tools getting loaded automatically when I start our studio, you could do the same thing by doing library dev tools, or you could give dev tools colon colon use test, all that gets a little bit, I don't know, tedious. And so I went ahead and put a required dev tools into my dot R profile file, you can see that in the video I made about two episodes ago. So the approach that I'm going to try to take to writing my code is what's called test driven development. The idea of test driven development is that you write a test, it will fail. And then you go write the code to get the test to pass. Okay. And so it's a little bit counter perhaps what you did in school, right, where you would, you would go do the work, and then you would take the test and see whether or not you pass, right? Well, this is kind of like giving a student the test, telling them what they need to do to pass, and then wanting them to pass. It's not like a novel idea for education. Maybe we should think about that. Anyway, I'm going to write a test. And so the first function that I want to develop is what I talked about earlier, which is extracting all of the k-mers from my sequences, right? And so this is a fundamental part of the naive Bayesian classifier. And so if I have a sequence, say I call that x, and I'm going to, you know, randomly put in some ATs, Gs and Cs. That looks pretty decent length, right? I want to be able to extract all the possible, say, eight nucleotide words from that sequence. And so I'm going to go ahead and start with a test, or I will say test underscore that, and then in quotes, I'll say, can extract all possible eight mers from a sequence. Okay? So that's the name of the test that will go in quotes. And after the closing quote, but before the closing parenthesis, I'm going to go ahead and put in a open curly brace and enter some space, right? So this is going to be a little bit odd format. Again, we have test that the name of the test, kind of a description of the test, followed by a comma, and then curly braces to define the body where I'm going to put the test. And so I'm going to go ahead and take that x, and I'm going to write some tests based on that, right? And so what I could imagine doing is then having a vector of eight mers. And so I could then say expected k-mers, right? And this will be a vector. And so I'm going to take this sequence, and I'm going to get all the possible k-mers out of that, eight mers, right? And so I'll, like I accidentally hit some key that restarted R. Hopefully that doesn't matter. I'm going to go ahead quickly here and copy this down, kind of offsetting the sequence one base with each line. Finally, all right. So this is my input x, my output is what I expect, which will be k-mers. And so I'm going to write a function. And the function I'm going to call get all k-mers, and I'm going to give it x. And then I'm going to have k-mer size equals eight. And so then I'm going to assign the output of this to k-mers. Maybe I'll call it all k-mers. So now I'm going to have a test, and I'll say expect equal. And I'm going to then give all k-mers. So the output I expect to get is expected k-mers, okay? So I'm going to save this and run it, and it should fail, right? So if I do test, it runs, it fails on me, right? And it says, so this is the name of the test, right? It can extract all possible eight-mers from a sequence. There's an error. Couldn't find the function get all k-mers, okay? Cool. We're gaining on it, right? Believe it or not, we're gaining on it. And so what I can do up here in k-mers then, is again, I have this function, which I'll go ahead and copy that so I know what I'm calling things. And this will then be a function. And we'll give it an x and a k-mer size. I could leave it this way. And that would mean that the default k-mer size would be eight. I'm going to go ahead and take that out, because we might want to test different k-mer sizes. And then we'll have a body, okay? So we've now defined get all k-mers. It doesn't do anything, but we've got a body. And so now I can run test, and it runs, right? So it no longer gives me that error that it can't find the function, but it's saying the actual is null, right? So this is outputting nothing. And so the expected then is this character vendor. So we have to do the hard work now of getting all the k-mers, right? And so as I think about this, though, I know there's a variety of ways in R that I can get a substring out of a bigger string. And so there's a command I know I could use, which would be sub str. But that's going to get me one chunk, one k-mer, for a start position and an end position. And then I would perhaps need to loop that over all possible start positions of my k-mers, right? So I could have something like n k-mers. And this would be something like n char on x, that's the length of x, minus k-mer size plus one, that's the total number of k-mers. And then I could do something where I'm going to loop over all my possible k-mers, right? So I could do an s-apply. And so this is really nitty gritty base R syntax. You could do this with for loop. For loops tend to be a little bit controversial in R, sometimes for good reasons, sometimes not so much. But anyway, I could do s-apply over all possible starting positions of my k-mers, right? And then I could then send that starting position to another function, which would be say like get k-mer. And I could then think about giving it, you know, my sequence, I could give it the start position. So that would be like a start, and then like a k-mer size, right? So we obviously don't have a get k-mer function. And so I need to go ahead and write that function. And I'm going to go ahead and copy this. And for the time being, I'm going to go ahead and comment out this code, because I know it's going to fail, but I need to do something before I can run this code, right? And that is develop and test this get k-mer sequence. So I'm going to create another test. So I'll do test that. And again, I'll say can extract specific k-mer from a starting position and size. Okay. Again, same type of syntax, where we've got test that the name of the test, and then the curly braces. And in that curly braces, we're going to define our test. And so again, I'll go ahead and grab this x, right? And I'm going to run get k-mer on x as my sequence. I'm going to give it a starting position and a k-mer size. So maybe what I'll do then is like k-mer size equals eight. And I'll do k-mer equals that, right? And maybe if I put in the number one for the starting position, I know then that I expect equal k-mer and this first string, right? So I'll do that and that, right? And then I could, I could, I could do a couple of tests like this, right? Maybe I'll do three. And so I'll do one, five, ten. And so one is that two, three, four, five is here. Six, seven, eight, nine, ten is here. Cool. And good. So again, we're going to run this and it's going to fail, right? I'm going to go ahead and save my R script there. And so again, let's run the test. Again, it fails because it can't find the function get k-mer. So we'll come back here and we'll do get k-mer function. And we'll have x. And then we're going to have the start position and we'll have the k-mer size to find the body. Again, I can run this test again and it's going to fail three times now instead of just once because it could find the function, but it failed all three of these tests, right? So now we need to go ahead and write that test. So again, there's a couple of different ways we could do this. The approach I'm going to use is sub-str. It comes to us from base-r. So one of the considerations we should have when writing a package is to try to minimize the number of dependencies. So we don't want to remove all dependencies and make writing the code hard, but we don't want to expect our users to have to bring in big things of data, right? So there's a string r package that could probably also do this too. I might hold off on doing that, or we might bring that in later, but for now I'm going to use the sub-str function. And the sub-str function, if I give it x and I give it say 1 and 4, let's see, it doesn't like my x. Let me load an x variable here. All right, let's try that again. Then what this does is it starts at position 1 and gives me four nucleotides. So it goes from the start position to the end position, right? So if I do like 2 comma 4, it's only going to return the values at 2, 3, and 4. So the syntax for sub-str is the string, the start position, and the end position, okay? And so our string is going to be x, our start position is going to be start, right? And then the end position we could think of as being like start, and then I'm going to do plus k-mer size and save that. Now, you might be thinking like you're basically renaming the substring function, right? Fair. But I feel like giving it this name get k-mer and then when I use it up with my s-apply makes it easier to read, easier to understand what's going on, right? Let me go ahead and test it and see what we get. So we got two fails, right? And so if we go back and look through this, let's see, if I look back at my tests here, I find that the actual does not equal the expected. And so my actual is this and my expected is that. So I think we also see that for this other one, for some reason it passed one of the other tests. So which test did it pass? I guess it's passed the last one, right? And if I had to guess, that is because this is the end of the string and it couldn't get a nucleotide off the end of the string, right? Maybe that gives us an idea for another test that if I had given it 11, that it should have thrown an error, or it really should have thrown an error in this case, but it didn't. And so what we have here is a case of what we might call a fence post error. So if you have a distance between here and there and it's 100 feet apart, and you want to put up a fence post every 10 feet, then you might say, well, I need 10-foot fence posts. Like, no, you actually need 11 because you need the one for where you are and then every 10 feet or whatever it is, okay? And so we're getting that fence post error because I am adding an extra post, right? I need to subtract a post, so to speak. And so if I subtract one, this now should pass. And so that all passes. Everything comes out rosy. I'm going to add another test, as I mentioned, because I think that should have failed. So I'm going to make this 11 and expect an error. And so the syntax here is expect error. And actually, I want to put the function call as the body or the argument to expect error, right? So I'll go ahead and do expect error on that with a closed parentheses, save that, and then test again. And again, we get an error because we did not throw an expected error. So to create the error, I'm going to add a stop if not function. And I'm going to put this at the top of my function so that if it throws an error, then it doesn't bother with the rest of the function, right? And so what I'll do is this. So that should be the end or shorter of the sequence length, right? So I'll stop if that is bigger than the length of x. Okay, so if the end position of my camera is larger than the length of my sequence, it should throw an error. And then our, our test will catch that. Okay, so we'll go ahead and now test that again. Alright, so that's not throwing the error that we expected. So again, if my start is 11, and my camera size is eight, and we have our sequence, right? Then let's see, what is this value? This should be 18, right? And my length of x is one. So this is a common thing that I frequently do. So length, corresponds to the length of the vector, the number of elements in the vector. Instead of length, what I want is n char on x, which I actually used up here, right, but forgot. So instead of length, I want to do n char. I'll go ahead and save that. And let's test again. So this throws an error, as we can see, which we'd expect it to catch because we had expect error, right, here in our test here in Camers. And so I think the problem is the stop if not that it doesn't like that I'm using stop if not. I think it would prefer that I use stop, I frequently use stop if not to test a condition like this. And so instead, I think what I need to do is say, if this is true, then stop and send out an error message. So we can do that. So we'll do if this is true, then we'll do stop cannot extract Camer beyond end of sequence. And maybe we'll go ahead and put this on a separate line. All right, so let's go ahead and save that. And let's test again, that pass, right? Very good. And so now we have our tests passing for the get camer function. And we've got a little bit of type checking thrown in to make sure that we're not going off beyond the end of the sequence. If we wanted to be a little bit more careful, we could also add a condition to make sure that we're not getting a start position less than one, right? But maybe we'll hold off onto that until that's actually a problem. All right, now we need to return to get all camers. And so if we come back to our test, and I'm going to go ahead and uncomment this code, save it. And let's test to remind ourselves what the problem was. Again, it fails in our can extract all possible eight measures from a sequence, right? Can't extract camer beyond the end of the sequence. Wow, it got us our error message. So let's see what it was doing. We're again, let's see. So we're not really giving it a start position. I'm not sure how it's actually running this. That's okay. So what we'd like to do is to run s apply and s apply is a mapping type of function where we have this vector of values, we're going to send it to get camer. And we're then give it the arguments that we're going to send to get camer. And so we're going to give it x as the x value, right? Perhaps we could give these better names, right? Like, maybe we should call this sequence. And instead of x here, we'll put that sequence and put this sequence. So I've changed this, right? And so I changed sequence for x. So now I can save this and test it and make sure that part works, right? I'm kind of changing a few things at the same time. But I see that those four tests passed. So changing x to sequence didn't break anything here, right? So here, then I could say like sequence equals x. And then my start position is going to come from here. So I'm going to leave out start. And I'm going to put camer size to be eight. Okay. And so again, what this is going to do is it's going to call get camer what we have down here. For the sequence argument, it's going to use the value x. For the camer size, it's going to use the value eight. And then for the start position, it's going to use this. Okay. And now I'm getting a red x here. And I think it's telling me I have an extra parenthesis. So go ahead and save that. And now let's test it and everything passed. And I get this very nice message saying your tests look perfect. And boy, if that doesn't just make me feel great. So again, what we're doing is test driven development, where we write the test first, and then we develop the code to pass the test. And again, what we're doing here is kind of simulating how we would like to run the function, along with specifying the output that we want to get. If I wanted to keep this a little bit more kind of orderly, I might say that our input is going to be this sequence. And these are going to be the all possible camers. And with this camer size, right, and then I'm going to expect that all the values are going to be equal, right? And, and so forth. Same thing here, right, where we kind of specify the input, run the function, and then expect equal on the output to make sure that that passes. And in this case, expecting an error, right, we could add more tests to this, right? So I could also do expect length on all camers. And I could expect that that length will be length on expected camers, right? And so that way, what this is going to do is it's going to take all camers, the output of this up on line five, and expect the length is going to be the same as this, right? So again, if I save that and run the test, everything passes, and we're in good shape. So whenever I want to add a feature to any of my functions, I need to first write the test to make sure then I can then fail the test, and then write code to pass the test. This is very different from how we typically write our code. Normally we write our code, run it and see if it looks right. So I'm going to write one more test up here. And I'm perhaps going to do camers of nine. Okay. And so let's go ahead and do camera size nine. And I'm going to speed this up a little bit, because it's going to take a little bit of time for me to get all of the nine possible cameras here. You saw me do it before with the eight. But I'll quickly go through this. And I will be back with you here in a second. All right, so I have my all possible nine mirrors. And I'll go ahead then and copy down the same tests we had from before. And I hope this test passes. But it's not it's failing, right? And so I get one fail on the all cameras can extract all possible eight mirrors from a sequence. And what it's saying all cameras not equal to expected. And what I'm finding is that my expected nine mer, I'm actually getting out the eight mer that I had before. And so what I'd like to do now is go back here and look and see where I might have a problem. And immediately, it jumps out to me that I have camer size equals eight. Instead of putting camer size equals camer size, right. And so I had hard coded that camer size equals eight, rather than the variable. I'll go ahead and save that run the test again. And everything passes. Everything runs swimmingly. So I'd like to add another set of tests. And that is to make eight the default. Okay. And so I'm going to go ahead and copy some of these tests, test code down. And what I'd like to do is not have to give the camer size, right? So if I give all cameras on x, then I should get these expected camer values, I'll go ahead and run the test, this should fail, right? So this fails. Camer size is missing, right. And that then I need up here to solve that by putting in eight, as my default camer size, I can go ahead then and test it. That passes. I would also like to make this camer size default to eight as well. And so we can do that by grabbing, let's grab this like five. And we'll go ahead and remove that camer size equals eight, save this and test it. Again, it fails, because that argument camer size is missing. I come back here, and I go ahead and put equals eight as my default. And now if I test again, it passes. Again, hopefully you can see that I had a feature I wanted to add. I wanted to make the default camer size eight. I went back, wrote the code, wrote the test, and then wrote the code to pass the test. And so this is pretty good. We could always, like I said, add more tests, add more conditions. But at some point, we're kind of over testing. And we're kind of adding extra features perhaps that we don't really need. And so it's always best to write the test, when you have the need, right, for the new feature. There's a problem called feature creep where you keep adding and adding features when you don't really need that feature. So you might argue that at this point, I don't even need a camer size argument. All I really need is eight, right, as my default case size. But I'm kind of thinking ahead and knowing that I might want to try seven or eight, or try some of these different sizes to see how sensitive the accuracy is of the algorithm to that parameter. So I'm going to give myself a pass on adding this feature before we're quite ready for it. But again, the idea with test driven development is you only write a test for the feature that you need to add. And so you're adding very stepwise, very slowly. Whereas what I find when I write code without tests, is that I write all sorts of code to do all the things. And I don't have tests for all the things. And many of those things that I've added, many of those features I've added, I don't really need. And so now I'm writing extra code that's gonna perhaps be more buggy than it would have been if I'd have gone in a much more incremental fashion. Before we end for today, I want to go ahead and make sure our package is in good shape. So I'll run through the various checks we'll do document that looks good, we haven't added any documentation to our function yet, because I don't know that I want either of these functions to be outward facing to the user, I think I'm going to want them to be encapsulated within other functions. So we'll worry about that later. We can also then do load all. Again, I think everything is loaded, that's going to look fine. And then we can also do check to make sure everything builds a okay. So I see it's throwing an error with two warnings. And I believe that it has to do with the documentation for hello, I'm going to go head back to my terminal, I'll go ahead and do get RM. And I think that's in my man directory that there's a hello.rd. I'll go ahead and move that. And then let me go ahead and do document again. I think that's a okay. And then back to my terminal get status. I'll go ahead and do a get commit hyphen hyphen amend. Because I'm adding this as part of the commit to remove the hello world scripts, I haven't added my new other scripts. Go ahead and save this. And now if I do get status, I see those are there. Let me come back to my console and retry building it with the check. All right, so I'm getting two warnings. Missing documentation entries for get all camers and get camer. I know that I'm not going to worry about it for now. I'm going to go ahead and commit my changes. I'll go ahead and use the GUI for now. Clicking these boxes for whatever reason I find the boxes are just kind of lagging a little bit. And that's just really annoying, because the GUI is supposed to speed things up, not slow me down. All right, so I'll have create functions to extract camers from sequences. Okay, and then we'll go ahead and commit that. And now everything is up on GitHub that we have done today. So hopefully this makes sense. Turning back to our pseudocode, I think we've done a good job here of kind of looking at that step of parsing out the camers from our sequences. And now in the next episode, we'll move on to another part of our pseudocode here to start building out more and more of a mature package. So that you don't miss that episode, please make sure that you've subscribed to the channel. And I'll see you next time for another episode of Code Club.