 Hey folks, welcome back for another episode of Code Club! If you have been following along in recent episodes, you know that I am building an R package in order to help me to learn how to make R packages. This is something that my lab is starting to do more of, trying to make packages in R so that people can use all the great things we've been talking about over the history of Code Club to better analyze microbial ecology data and other types of data. Anyway, I had to pick something to make a package for, and so what I picked was a algorithm called a naive Bayesian classifier. And so, again, even if you don't care about what the actual application is, like that doesn't just make you excited, like it makes me excited, I think you'll get a lot out of this series of episodes, specifically how to build a package, and kind of some of the differences between taking a data analysis approach to writing R versus doing a programmatic type of approach like we are when we're making a package. As I go through, I try to articulate some of the different design considerations and coding considerations that come to mind as I'm trying to build this out. In the last episode, we talked about taking a DNA sequence of ATs, GCs, and sometimes Ns, and converting that to base 3, so 0, 1, 2, 3, and that what that allows us to do is that when we pull out a k-mer, say a 8 number sequence, or a string that's got 8 numbers in it, that being an 8-mer, which is the default k-mer size for this algorithm, we can take that base 4 8-mer and actually convert it to an integer that's base 10. What that allows us to do is that allows us then to directly index into a vector or a matrix to the exact spot we want very efficiently. Alternatively, if we had kept things in kind of the character space, ATs, GCs, and Cs, then we know that there'd be like 65,000 plus possible 8-mers. And so if I had an 8-mer that I needed to look up, well, every time I go to look it up, it's like going through a phone book trying to find someone's name, right? That can be pretty tedious and pretty time consuming versus if I told you to jump ahead, 253 pages, bam, you'd get there instantly, right? So that's what we're trying to go for. Well, in today's episode, that's a long-winded way of saying for today's episode, what we're going to do is build upon that to then build a vector for each sequence that tells us whether or not that specific k-mer was present in our sequence. Then with that vector for an individual sequence, we'll then build it out for all of the sequences in our database. And from that then, we will then go and calculate the total number of sequences in each genus that have a specific k-mer. That will allow us to calculate the word-specific priors and the genus-specific conditional probabilities. The word-specific priors were described in this paragraph within the method section of this wonderful paper. It's here on page 2 in the first column if you like the analog version of the paper, like I do. Anyway, the expected likelihood estimate for determining the Jeffrey Perks law of succession, whatever that is, determined by the formula, pi equals nwi plus 0.5 divided by big n plus 1. Basically, what this means is that if we take all of the sequences in our dataset, we can count how many sequences have each k-mer. So that wi we can think of as being all possible eight-character sub-sequences or those words, nwi is the number of sequences that have that. We're only counting occurrence one time per sequence. So if a sequence has, say, a stretch of nine a's in a row, well then it's going to have that eight-mer of all a's twice, right? So we're only going to count that once though. So we're going to count the number of sequences that have a given k-mer, add half, and then divide by the total number of sequences plus one. That half and that one, what it's doing is say we have a k-mer that is really rare and say it doesn't actually show up in any of our sequences. Well, we don't want the probability to be zero. Instead, we want it to be a really small number. So in that case, the nwi would be zero and we'd get 0.5 divided by a big number, which is again a small probability. With this word specific prior, we'll then be able to calculate the genus specific conditional probabilities. And this is the probability that a member of a genus contains a specific word. And so in the algorithm that was calculated or estimated by this formula, again, it's the same idea except m, instead of looking across all sequences, is looking within a specific genus to see in that genus, say like Escherichiae or Pseudomonas, how many sequences have a specific k-mer. And then we're going to add the pi, which we saw in the previous slide for those priors, divided by the number of sequences in the genus plus one. So this pwi given g then will allow us to then calculate the probability of seeing a sequence from a specific genus as we see down here in this last sentence. Again, in the last couple of episodes, we've talked about collecting those eight nucleotide words. What we're going to do now is to move on in this reference column to calculate the pi, the prior for each word, and then to calculate that conditional probability for each word given a genus for each genus and word. So that's what we have to do today. And we'll head over to our studio and get going on that now. I already have my k-mers.r and test k-mers.r scripts open up. You'll notice I'm putting everything into these two files at some point, but not today. We're going to have to kind of take a step back and perhaps do a better job of organizing this because we don't want to have a package with one r script. That's probably not ideal. And of course, we are still at kind of the early stages of developing this package. Anyway, we'll worry about that later. So I'm going to come down and create a new test. So I'll do a test that. So the first thing I'm going to work on today is accurately detecting each k-mer for a given sequence. So I'll go ahead and do accurately detect k-mers from a sequence. All right. And then, again, hopefully we're getting used to this syntax for our test of that statements. So I'll go ahead and grab this sequence because it's already converted to base four. And so I'll call this sequence. So from this sequence, I'm going to get all possible k-mers. I'm going to use the function that I've already written and tested. So I'll have to get all k-mers on the sequence. And again, if I load everything here, and I can load things quickly by doing shift command L, and I can do shift command T to test, these are two keystrokes that I have started to use to allow me to do the loading and testing all the easier. So this will give us our sequence and then all possible k-mers. I'll go ahead and call this my k-mers. And then I'll have my indices, which will be base four to index on my k-mers. And that's not happy. Oh, because I haven't run this line. All right. So now I have my indices. And those are those indices. Cool. And so again, if I think about what I want to do here, is I want to get a vector for this sequence. So I'll call this detected. And my function is going to be detect k-mers on sequence. Right. And so this detected will be a vector that is four to the eight. So that's four to the eight units long. So 65,536 elements that detected will have. Right. And so what I expect then is that if I sum up the values of detected, again, those should be zeros and ones that I should get the sum of those ones should equal the length of indices. Right. And especially if I take detected an index in indices. Right. So if I take detected and then put in indices, this will return a vector with only those values. Right. And so I could then say something like the length of this, right. And I can do expect equal length of that. And then the expected would be length of indices. And another test I might do would be expect equal. Like I said, we could also kind of invert that. So we could do length. Actually, let me, let me actually copy this down. And I'm going to invert it. So everything that's not an index will show up. And then I'll have a length of that. So I could also then say like four to the eight minus the length of indices, which would tell me everything that wasn't detected. Right. So this is looking at what's detected. This is looking at what's not detected. I think I'll add another test based on this one where instead of length, I'll do some right. And so if these values are all ones, then summing them up should be the same as the length of the indices. Okay. So these are kind of three tests that I want to check out and go ahead and run the tests on that and find sure enough that it fails. It runs the first test and then bails out when it fails. So it's having a problem because it can't detect k-mers. I'll go ahead now and create that function. So over here, I'll do detect k-mers. And I'll do sequence. And I also want the default k-mer size. So I'll do k-mer size. By default will be eight. So again, I'm giving it the sequence. I'm not giving it the k-mers. Right. And so basically what I want then is this stuff. Right. And so I'm going to go ahead and copy these two lines. And so this is maybe cheating a little bit because I'm writing my code in my test. But I understand that there's some people that do test-driven development just that way. They write the test and get it to pass. And then they copy that code over into their function. So maybe along the way we'll make another test that's kind of independent of this code that I've written. All right. So again, we have k-mers and indices. And now I need a vector that is the total length of the number of k-mers. Right. And k-mers, I'll say four to the k-mer size. Right. And again, in the future, we might do seven, six, nine, 10. Who knows, we'd like to have that flexibility in the k-mer size. And again, I need for my development purposes here, I need to go ahead and load that in. And again, what we've seen before is that n-k-mers should be about 65,000. And so that's good. Right. So to create a vector, I could do numeric on n-k-mers. So this then gives me that vector with 65,536 entries. It only shows me the first thousand and kind of quickly scanning through these. We see that they're all zero. Right. And so what I'd like to do then is I will call this my k-mers detected. And I can then do k-mers detected. And I can then index into that my indices and say that equals one. So I'll go ahead and save that and let's test. So this is failing. And I think I know what is wrong. It's returning, it seems to be passing the first test for some reason. Right. So it seems to be passing this. I'm not really sure why. So, but it's failing the second and third test. I think the problem is that I'm not returning a number of k-mers detected. So I'll do return k-mers detected. And so our functions return the last calculated value. And so I think here it's returning one rather than k-mers detected. So I think it's good to always be explicit. You'll see up here I've got a function or a couple functions where I don't return, have a return statement. And I think in these cases it's more obvious because the output of the pipeline is the value. So let's go ahead save this and test and see what happens. So that all passed. So that is a good sign. So I want to go ahead and rerun these tests. But what I'm going to do is add a n to the end of the sequence. Then everything else here should produce the same result, right? Because that n, that last 8-mer will generate an na value and then it will, what? It should kick it out when we do the get all k-mers, right? And so all these values then should be the same because we're adding a nucleotide basically to everything. Sure enough that works. Something that occurs to me though is that this is all kind of hardwired for 8-mers, right? So I do get all k-mers but I'm not giving it the k-mer size. The default there is 8 as well. So I think I'm going to go ahead and add another test. And here then I will do, let's get all k-mers and let's do k-mer size equals 7. And we'll have to add that elsewhere. So we'll do that here for detected. And then here this should be a 7. So let's go ahead save and test that fails. Good. And so here I need to go ahead and add in k-mer size as, so that's good. So I think that's the only place. So let's go ahead and test that and it all passes. Wonderful. So now we've taken a sequence. We've converted to base 4. We've extracted all of the n-mers, how our k-mers, however big that k is. And then we've converted that from base 4 to base 10. And now we have returned a vector that with each cell representing the individual k-mer and indicated whether or not that k-mer exists. Okay. So now what we want to do is do that across all of our sequences. So now I'm going to do another test. So we'll do test that and then we'll say accurately detect k-mers across multiple sequences. All right. So I'm going to go ahead and copy the sequence that I have been working with here. But to keep things simple, I'm going to go ahead and shorten it, maybe cut it about in half there. And I'm going to make this a vector of values. And instead of putting a zero or an a at the end of that sequence, I'm going to make it a c, right? And so it's kind of hard for me to think in this base 3 instead of nucleotides. Anyway, so we'll have that. And again, to keep things simple, I'll do k-mer size of three. Because with eight, we're going to get a matrix or a vector that's got these like 65,536 rows versus if we do three, then we'll have 64 rows. So I'm expecting to get out of this function that I'm going to write as a matrix. So do matrix. And we're going to have I'm going to seed it with zeros. And I'm going to expect it to have n row equals 4 to the k-mer size. And n call equals two, because I have two sequences, right? So I'll call this expected. And so then my expected the rows, I'm going to use my functions to build out those indices values, right? So this is going to get a little bit funky in here, but roll with me here. So we'll do get all k-mers on sequence one. And we'll we'll do size k-mer size, right? So that will give us what? Well, I guess I need to load this and this. And then here, this will give us those values. And so then I need to shift these to base 10. So then we'll do base four to index on that. And so let's double check that this all works. Cool. And then those are going to be equal to one. And then I'll copy this. And the only thing we'll have to change is sequence two, right? And so now if we look at expected, we're going to expect 64 rows and two columns. Sure enough, that's what we get. And we can see that something didn't work quite right. That this column two is empty. So that's not a good sign. Ah, and I see the problem here is that I'm putting it into the row. But I'm not telling it what column. So this needs to be in column one. And this needs to be column two. And so now if we look at expected, I think this should look better. Sure enough, everything looks good. There should be one camer, right? These are off by one, right? Because I changed that last camer value. So that all looks good. Now we'll do detect camers across all sequences, right? And then we'll give that sequence, right? Because I could have called that sequences, whatever. And so then I'll call this my detect matrix. Then I expect equal expected and detect matrix. And I misspelled equal here. And again, if we test this, it's going to fail because it doesn't know this function, this function name is is kind of long. Maybe I'll remove the all to because the sequences will say kind of the all, right? So we'll go ahead and call this the function. And then we'll come back over here. And we'll do that function sequences. And then camer size, by default, again, we'll make eight, create our body here. So I could imagine doing something like detect camers sequences one with camer size. Right? So if I go ahead and load everything. And then I run this sequences not found. So let's see, so I call the sequence sequences, I'll update all these sequences. All right. Okay, I think that'll work. So again, if we load sequences, now it's loaded. And now we do detect camers on that. This is one value, right? And so basically what we did back out in the testing script was something like this, right? And so we could imagine looping over all possible sequences. And so you could do this with a for loop, but there's a better way. And so we'll do that with supply. So do supply. So do sequences. So for each value of sequences, we're going to put that through detect camers. And so we run that. And we get what looks like a matrix, right? And so we get a whole bunch of those rows back. And actually, it's by default using eight, which is why there's so many values here, right? So again, to test this, I'll do camer size of three. And so it's returning a matrix like it's got 65,000 rows, right? But all of my camers are in the first 64, I believe, right? And so the problem there, actually, I don't know where they're going. Anyway, is that my supply isn't using the camer size. So I'll go ahead and do camer size equals camer size. And that will then pass the camer size argument into detect camers. Now when I run this, yes, I get 64 rows. And I get something that looks kind of like I would expect it to look. So let's go ahead and save. And I'm going to go ahead and comment this out for now, so that it doesn't get run with my test, I'll go ahead and test it. I'm getting an error, but I think I'm getting confused and trying to understand this error, because I have my values flipped here in the expect equal. It wants the expected in the second spot, not the first spot. So let's go ahead and rerun this and see if the errors make any more sense. So come back up here again, we're failing. And so the actual has 65,000 rows. The expected has 64 rows. And so again, why is that happening? And so the output of my function here is 65,000 rows, right? And that's because I'm not giving it the camer size. So I'll do camer size. Let's save that and test it again. So it's failing now because the dimension names is a list. And again, if we look at the output from this, that we find that we've got column names, right? Whereas there were no column names on our matrix. And so there's an argument we can give a supply, which is use dot names equals false. And again, if we run that now, we see that those column names are gone. Let's go ahead and save and test and hopefully it'll pass. Sure enough, that passes that did the trick. So now we have the ability to generate a matrix where our rows are our index values for our camers, and our columns are the different sequences. The next thing we need to get are those prior probabilities for each word. And so to do that, we're gonna have to sum across each of our camers, and then add 0.5 and divided by one plus the total number of sequences in our collection. So we'll go ahead and create another test. So I'll do test that. And then we'll do calculate word specific priors. All right. And so we'll go ahead and take, let's go ahead and take these again. And I'm going to add a third sequence, right. And it's going to be the same as the second sequence, because I want to make sure that there's some redundancy and some difference in the different camers that are found. And so now I'm going to again do detect matrix. And we'll go ahead and just copy this down, right. And again, if I do camera size three, and then detect matrix. So I'm going to pick a few rows from this that I'm going to be most interested in. And so I'll do 26, 29 and 30. Right. So 26, all three, 29, only one. And then 30, only two and three. Okay. I don't know why I wrote out three. Anyway, so kind of the math that we can think about with detect matrix is that we want to sum across the columns, right. And so an easy way to do that with base R is apply. So this is like s apply but apply. And so it takes a matrix or data frame, and then a dimension that you want to operate across. So one is rows and two is columns. So if I do one comma sum, it will then run a sum across each row of the data matrix. And so here we go. We see all that if I had made this two, then I should get three values, six, six and six. That's ominous. Let's go ahead and stick with one, where we again have 64 values. And so then I can go ahead. That is the camer count. And again, I will add then 0.5 that adds 0.5 to each value. And then I will make that the numerator and divide by one plus the number of sequences, so I'll do length sequences. And then this gives me my word specific priors, right. And so this is my expected, right. I'll then create a value that I'll call priors. And this will be then calc word specific priors. And I'm thinking, do I want to give it the sequences or the detect matrix? I think I want to give it the detect matrix, because I think with a large number of reference sequences, detect matrix is actually going to take a bit of time to generate. And so I don't want to have to regenerate detect matrix multiple times for the priors as well as for the conditional probabilities. So instead, I'm going to give it the detect matrix. I don't need to give it the camer size, because that's already baked into the number of rows in the data frame, right? I could manually calculate the values for 26, 29 and 30, right? So it's an all three, that's going to be three plus 0.5 divided by one plus three, right? And so that is going to be something, maybe I'll wrap this in parentheses so I can copy and paste it into my console, right? And so this is 0.875, right? And then if I repeat this, but for one of the values, it's going to be 0.375. And then for two and three, so two, it's going to be 0.625. And then if I also do like, I think 64 was all zeros. So if I do 64, none, then that's going to be like zero plus 0.5, right? And that's going to be 0.125. All right. And so again, we can then do expect equal. And again, I could do expect equal, expected versus priors. But again, I manually did these. And in a way, I feel like doing this in the test is a little bit of a cheap way around the test. And so manually calculating it like this, I think is a little bit added benefit. So we'll go ahead and do priors expected, right? So that would be if I use the code, I could also do expect equal priors. And then I'll do 26. And I'm going to copy this a few times, four times for 26, 29, 30, and 64. Again, this is probably a bit redundant. But you can see a couple different ways of generating the test. Maybe people that might have more experience with test driven development can tell me what they think of these two different approaches to testing, whether it's kind of manually calculating things by hand versus writing the code. So we'll go ahead and test this and see what we get. And I introduced a typo here. Let me try that again. So again, it's fails because I don't have the function. So now we need to go ahead and generate the function calc word specific priors. Alright, and then we're going to give that our kmer matrix, right, which I was calling the detection matrix. Alright, and I'll go ahead and copy this code that I wrote back here. So the one thing that I am using here that I don't have access to is sequences. So this should be end call on detect matrix. So now this should pass. And I guess I, yep, my parentheses all look good. Go ahead and test that and should pass. Wonderful. Okay, so this gets us our word specific priors. And now we need to get our conditional probabilities. I'm going to go ahead and load everything. And so then we'll generate a new test, test that calculate specific conditional probabilities. And I'm going to borrow some of this from up here. Right, so I've got my kmer size, my sequences, and my detection matrix. The other thing I'm going to need to give it are my genera. And so I'm going to just call this a B and B. And I think that should work well. So again, these two sequences are different from this sequence. And so they go into a different genus. Again, this is all kind of test code. And something simple, so that we can kind of manually work through the, the calculations to make sure that what we're doing with the code actually generates the right values. So I'll go ahead and run these lines. So I have it all loaded. So again, the calculation, so we'll have mwi, which is the count within the genus plus pi, which is the prior, right? So that added together divided by m plus one, where m is the total number of sequences in the genus. So I'm going to get the priors. And so again, my function here was, or I'll go ahead and just copy this line. And I get my priors, right? Good. And the interesting values are up here. And maybe I'll go ahead and copy this whole thing. All right, copy that down. I don't need this just yet. And so again, if I have two genera, and I'm using mwi, then let's see. So 26 is an all three. And so the, I'm going to have a two and a one, right? So I'll have genus A has, has the camer in one sequence and genus B has it in two sequences, right? And so then I also have the m, which is going to be one comma two, right? And so one of the nice things about R being what's called vectorized is that I can do something like this, right? So that's going to add a half to one and two. And then it's going to add, well, I don't want it to add three, I want it to add one, right? And so then this is going to add one to one and two. And then I can get the ratio, which will be 0.75 and 0.833. Okay. Cool. And for only one having it, so only A has camer 29, then again, I can think about this formula the same way. And so only one has it. And so here this is going to be zero, right? And so that should work, right? And that gives us this. And then only two and three having it will be, again, copying this code down. It's going to be only B having it. So it's going to be zero and two. And again, if we run that, we get 0.25 and 0.833. Very good. So again, these are going to be my expected values. And so this last one that doesn't have it at all in either of the genera, we'll do this, so it'll be zero and zero. Again, these values and the denominator, the m values are staying at one and two because that's how many sequences we have in the genus. So that doesn't change. And so that gets us 0.25 and 0.16666, which makes sense compared to the two previous cases, right? So this was what we got when it was absent from a and this is what we got when it was absent from B. Cool. So these are test values. And so I can go ahead and I think copy this code directly in here for my expected values. And I'll do that here real quick. All right. And then I'm going to have, I'll remove that. And I'll do my conditional prob. So I'll call this calc genus conditional prob. And I will then give that detect matrix. Okay. And again, if we test this, it's going to fail. Oh, it doesn't fail or it does fail because I can't find this function, but I didn't put this in here. And so I need to modify this actually too. So I'll go ahead and put a comma here. And so this will be row 26, both columns. And so again, it's going to fail as it says here because we couldn't find the function. So we'll go ahead and grab that and come over here, function detect matrix. And I think the other thing I need to give it is the word specific prior, right? And so go ahead and give it that. And here we'll also give it the word specific priors. Maybe I'll put this down a line so we can see everything without having to score left and right. All right. And do the same thing here. Cool. All right. And so again, what we're thinking about. So another piece of information I realize we need are the genera, right? And because we're going to need to know how to aggregate the columns of detect matrix. So there's a few pieces of information that I need to get from these three values. I'm going to need to get, well, really one main thing, I'm going to need to get the total number of sequences in each genus. And I'm going to also need to aggregate the values, the columns and detect matrix by their genus, right? And so I can start with, like, table genera. And this gives me A and B. But it's a weird data structure that I don't quite understand why it does this, that it's a table format. And it's got dimension names. I'd rather it to be a vector. And so what I could perhaps do is send this to as dot vector. And that returns it as a vector. And so I will then do genus counts. And so now I know I have two genera, right? And so the output of my conditional probabilities. So I'll do genus conditional prob, I'll do matrix. And I'll give everything a value of zero initially. And my n row will be the number of row on the detect matrix. And the number of columns will be genus counts, actually not genus counts, length of genus counts. I could have for n row, also used length of, well, word specific priors, not Calc word specific priors. Let's see, I'm probably trying to go a little too fast here. So this is priors, right? Okay. And so this is initializing my matrix. And again, if we look at this, we see that we've got two columns, right, for each of the two genera. And we then have our 64 rows for the different possible k-mers. And so what we want to do next is to aggregate the values in detect matrix to aggregate and calculate the number of sequences in each genus that had a k-mer. And so I'm going to use a for loop. So for loops get a bad rap in R. Generally, for loops are a problem if you are growing a vector or growing a matrix. They're also a problem if you have nested for loops, if you have a for loop within a for loop, those are bad. So what we're going to do here instead is that we've already created the matrix, right? And so we're not going to have to expand the matrix. And so when you expand the matrix, it goes really slow because it's constantly looking for new places to put the new matrix that you've created. And so I'll do four i in one colon length genus counts. I'm going to go ahead and call this n genera on length, yeah, length genus counts. Because I'm using this a couple times. So I'll go ahead and plop that in there. And then here. All right. And so we're going to we're going to index over our detect matrix, right? So we'll do detect matrix. So I'm sitting here realizing that my genera is coming in as a string rather than as an integer. And so this again is the same type of problem that we had with our sequences. So I'm going to go ahead and recast these as integers, right? And we'll have to figure out something later where we when we're reading in the genera, we can convert it to a number. And so this needs to be an integer, not a character. Okay. And so then genus counts, if I look at genus counts, one and two are the values in one and two, right? The number of times we see each of those. Okay. And so then let's see, that all works. Okay. So we're going to then use detect matrix, all the rows and the ith column. I'm going to add that to let's see. Maybe before I call it genus conditional prob, I'll call it genus count. And so then we'll add that to genus count on all rows. And then I is going to be indexed into genera to figure out what genus it is, right? So we'll do genera, I. So if I is two, then genera, I will be two, right? If I is three, then genera, I is going to be two, right? And so we're going to then be adding this to the existing value of this. But we need to assign it back to this, right? Okay. So that then should get us our genus counts. And so let's clean this up a little bit. And now if we look at genus count, let's see what we've got here. So we've got our 64 rows. And it's not quite doing what I had anticipated. We're basically getting out the detection, not the genus count. And so I see the problem that I'm indexing over the genera, not the number of sequences, right? And so n sequences is going to be length gena genera. All right. So sometimes these things get a little bit confusing. And so I'll go ahead and plop that in there. And let's go ahead and start this again. And now if I do genus count, again, I have 64 rows. And I see sure enough, I've got these with the correct number of sequences that had that camer in that genus. Cool. So now we have genus count. And now we need to go ahead and do that conversion that I had here, right, to get the conditional probability. And maybe I'll go ahead and throw this up here. Maybe I'll put it up there. At some point, I need to come through and do a better job of documenting my code. Because at this point, there's really no documentation. And that's a really bad example to be setting. So again, we have genus count, we're going to go ahead and then add the word specific priors. Okay. And so again, that's going to be in the numerator. And it doesn't like my word specific priors because word specific priors is the same as prior. It doesn't like prior. Why don't you like priors? Oh, priors. I didn't hear you correct me on that one. All right. So now if we do this, sure enough, we get the two added to each other, right? All these were like zeros. And so point one, two, five was the prior for a camer where it wasn't found in any sequence, right? And here we see those others. And so good. And now we need to divide it by m plus one. And so m plus one is going to be the genus counts plus one, right? And so if we run that, then we get some kind of funny results, right? So this 64, I would not expect to be 0.41 and 0.41, right? I would expect it to perhaps be 0.41 and 0.6, or something like that. And so one of the challenges with vectorized code in R is getting the dimensions right. So this is a 64 row by two column matrix, whereas this is a one row by two column matrix, right? And so we need to go ahead and transpose with the t function the numerator, right? So now we have two rows and all these columns. And then we can then divide by the vector. And so now we get something that makes a bit more sense, right? And so that's good. But we need to like retranspose it, right? And so then we need to take the whole thing and transpose it. So let's go ahead and save that and let's test it and see if it passes our test. And so it's failing. And I noticed that I did something silly when I was calculating the expected, right, that I have this 0.5 here. And that was part of the formula for the word specific prior. And so I need to modify that to be 0.875 here. And then what did I have up here 0.375. And then 0.625. And then what 0.125. Test again. And so I'm still getting a fail on my last test. And it's liking the first value, the first actual unexpected, but not the second. And I believe that is because I left in a two, rather than a zero, because 64 kmer 64 isn't found in either of the genera. Now it should pass. It passes. Wonderful. So I have found using test driven development to be just really fascinating. What ends up happening are I guess two things worth commenting on, I think, is that I get code that is a lot more concise and in my mind better designed. And that I would probably have made a bunch of these functions all one function. And instead, I feel like the code is a lot better self documenting and understanding what's going on. The other thing that I find is that as I kind of showed in some of these typos that I introduced by copying and pasting on whatever, is that first of all, I'm prone to making errors myself. And second of all, by kind of going through this, I better understand what the code is trying to do. Okay. So again, I'd be curious what your experience has been with test driven development. So far, I really like it. I'm a little bit worried as things are going to get even more complicated, how we're going to go about doing test driven development. But for now, I'm pretty happy with it. And it's, it's comforting to me that we have all these tests. And they run fairly quickly. So again, in this episode, we did these last two points in the reference column of calculating the prior word specific prior, as well as the conditional probability for each genus for each word. And so we'll see where we go next. But hopefully you're enjoying this process of building out this phyletyper r package. So please make sure that you've subscribed to the channel, and we'll see you next time for another episode of Code Club.