 Hey folks, welcome back for another episode of Code Club. In this series of episodes, I am working with you to learn how to make an R package. Believe it or not, I have never made an R package, and so I am going through the great book by Hadley Wickham and Jenny Bryan called R packages. I'm not going through it chapter by chapter with you all, but I'm going through it and thinking about it in terms of a package that I'm trying to make, which is re-implementing a naive Bayesian classifier for putting taxonomy assignments onto 16s rRNA gene sequences. Even if that doesn't mean anything to you and you do not care about the example that I'm using to build this project, to build this package, I think you'll still get a lot out of these episodes. Namely, a kind of a difference in how we approach R if we're doing data analysis, like say doing statistics or data manipulation or data visualization, to what we're doing more so in this set of videos, which is more of programming within R. And so, as we go through, there's all sorts of different thoughts that come to mind about how to write good code, how to write efficient code, but most importantly, as we saw in the last episode, how to write code that actually works and passes our tests. So one of the things that came to mind to me as I'm thinking about this tool is that what we're doing is keeping track of camers. And by default, the ribosomal database project use camers of eight. So it should be eight nucleotide words that it's keeping track of and trying to see, you know, what sequences in our database, what sequences are more similar to my sequence based on those eight nucleotide words. One way that we could keep track of that as as what you might think of like a named vector. And so in R, I could make a vector with like, you know, a vector of ages, and I could name each of those ages, right. And so then like, I could say x bracket pat and get Pat's age, right. And so what I could do with the camers is kind of the same idea. But we would have four to the eight camers, and most of those would be zeros. And then we would say, well, this camer showed up one time in this sequence, one time in this sequence, right. And so that works, right. And that's again, kind of my initial approach. And that's really kind of how we started thinking about things in the last episode, when we started breaking our sequences down into camers. But what occurred to me is that that's going to be really slow. Because say I'm looking for, you know, the the eight nucleotide word that has like AT AT AT AT AT, that eight, I don't know, whatever, say we're looking for that word, it's going to be kind of slow to do that. It's going to be kind of like looking up a word in a dictionary or a phone book. These are all analog examples that many of you perhaps have never seen before in your lives. I'm too old. Anyway, you can imagine if you've got a dictionary or a phone book, and you need to look up something within an alphabetical list, that that can be kind of slow, right. Alternatively, a process that's much faster is to know exactly the page I want to go to, right. So if I know that this word or this name is on page 354, I can easily jump to page 354, I say, okay, this is page one, I go forward 354, and bam, I'm on the right page, right. That is done in a single step. There's not a lot of kind of, well, I've gone too far, now I need to go back, I need to go forward, I need to go back, which is basically how a search works when you're looking up a string. But again, if you have a number, an index value, you can jump directly to that, and it's much faster. And so that's what I'd like to do. But I have this problem that I have these, these camers are strings. And so what this gets me thinking about actually is an old joke. And it's not really not a funny joke. But the joke is that there's 10 types of people in the world, those who understand binary, and those who don't. Right, it's not very funny. But again, 10 or 10 is binary for the number two, right. And so I begin to think about a DNA sequence. Well, I could think about a DNA sequence as being in base four, right? I've got an A, a C, a G and a T. I could imagine converting those letters to numbers say zero, one, two and three. And that will give me a string in base four that could then be converted to a integer in decimal base decimal, right, base 10. So in this episode, we are going to talk about converting between different bases, specifically for our example, base four, because we're going to have a T, G and C. If you go to Wikipedia for the Quartinary Numeral System, this is base four, quarter, court four, right, that it uses the digits zero, one and two and three to represent any real number. And so you can kind of look at this table that it has the first 64 digits or 65 digits from zero to 64 in Quartinary, which is the green shaded line through here, right. And so we could imagine if we have Quartinary of 10, that this is actually four, right. And so that makes sense because 10 is two in base two, 10 in base four is four and 10 is 10 in base 10. Got it? Cool. So anyway, when I look at this, I see a whole bunch of values that we can use to build tests to convert from Quartinary to base 10. So before we do that, though, we need to go ahead and convert our sequence into Quartinary. So let's go ahead and go do that now. Again, back in my, our studio session, I'm working within my Filotyper package. I will go to tests first, test that and test Kmers, right. And so again, what I want to do is I want to be able to convert a sequence into Quartinary format. Okay. And so we'll come down here and do test that conversion works between DNA sequence and Quartinary. Okay. And then we'll go ahead and open up some space here. And then we need to put into test. I'm going to start by grabbing this says my sequence that I'll test, right. And so we'll have that all do seek to base four. We could imagine thinking about other bases as we're building out these functions. But again, I'm working with DNA sequences. So I'm going to worry mostly about base four. Maybe if you want to develop a base eight converter or a converter even, that would work for any base. Go for it. But again, with seek to base four with X, I'm going to then call it base four seek. And then I'm going to have expect equal base four seek, and then actual type. So I need to make an actual here. So do actual. And this is going to be a character and to make it easier to do the conversion between the letters and the numbers, I'm going to go ahead and line these up for now. And again, a is going to be zero, C is going to be one, G is going to be two, and T is going to be four, or three. And so we're going to start with zero, because that plays well with kind of the properties of the Quartinary or these other bases. And I'm going to put them in alphabetical order. So I don't have to remember, right? I think naturally I do by the pairing like ATG and C. But to kind of remove that ambiguity, I'm going to put the bases in order. And so it's going to go zero, three, two, one, three, one, three, zero, two, three, this is hard, zero, two, two, one, zero, three, two, one. Okay, so let me double check zero, three, two, one, three, one. Nope, so it should be two. I heard you yell that time. The T should be a three. That T is a three. That T is a three. And then A should be zero. So that works, that works, that works. That works. The G's are twos. So there's a G, there's a G. There's a G, G, good G. And then C's are ones. And I think we're good. Okay. So we now have all this. And again, if we test this, the test is going to fail, right, because we don't have the function seek to base for. So step one in our test development, right, we wrote the test, it failed. So we'll do seek to base for function with the sequence. Okay. And now we want to convert our nucleotides for our sequence into digits, right? And so one of my favorite functions for doing this is char tr. And so char tr takes the old and replaces it with new for any sequence, right? So old equals something. New equals something. And then we have x, which is in our case, sequence. So the old is going to be ATGC, or I did it by my normal thinking, maybe I should just do it that way. We'll do AC GT, right? And then we're going to replace that with zero 123, right, and then sequence. And so if we ran that, I don't I don't have a sequence loaded yet. So let me go ahead as a test and make this a sequence. Now I run that I then get my my string, right? So this seems to be a pretty simple function. At this point, it's basically a wrapper on char tr, which I'm fine with it just being that simple. Because in some ways, this is documentation, right, the name tells you what's going on. Whereas this, it might not be so clear what's going on. So if we test this, that passes. Cool. Alright, so I have a couple other cases that I want to think about. And that would be what if what if we have an ambiguous base in here, then what happens? So I'm going to grab these and paste them down. And so if I put a N in the middle here, then I want that zero to also return as an N. Okay. And so we can again test this. And this passes, which is a little surprising, right? And let's let's let's see what happens again, if we take our sequence and put an N in here, and then run that through char tr, sure enough that N comes up. Okay, so that works. But what if we instead put in a R? Because so I think a GC ambiguity comes back as an R. So let's go ahead and try that. And that fails, right? So we want that R to become an N. So I'm going to come back here. And I think what I'll do is I'll take my sequence. And I'm going to again use char tr, I think. So let's see if this will work if we use char tr again. So if we do char tr, old equals and here we're going to put in a regular expression for anything that's not ACG or T, we're going to replace that with a N. Yeah, so that doesn't work. So the old is longer than the new, I thought that would work. So I think what I'll do instead is G sub. So G sub is a lot like str replace all, but G sub comes from base R and it'll substitute will find any character across the whole string and then replace it with some other string. So if we try this, and I've got the wrong arguments here, right? So this is the pattern. And this is the replacement. And so that did the trick, right? And so what we could do is we can use the base R pipe to pipe this into char tr. And then right here, we're gonna have x equals underscore. So the underscore is the placeholder argument with the base R pipe that basically the output from G sub will come into here. So let's go ahead and test and run that that passes wonderfully. So I have one other test that I want to try. And that's because I know that sometimes DNA sequences might be all lowercase, right? So I'm going to go ahead and try this again. But where I take two lower on my sequence, right? And the actual should be the same, right? So I still want that capital N in there. It's probably not that big of a deal. But I think it's just going to be more convenient for everything. All the characters like all the numbers and the N to be the same height. I don't know. It doesn't really matter. Test that. And so this then fails, because my actual comes back as all ends, because the two lower on sequence, as you might imagine, makes everything lowercase. And so this is matching anything that's not a capital A, C, G or T, right? So here then I can add to this A, C, G, T. But I also need to put an A, C, G, T here as well. Although you know what something I'm thinking about here is that maybe it would actually be better to add to this something that uppercases all of the sequences. So I don't have to think about it, right? And so maybe what I'll do is to upper on sequence, and then type this through here, x equals here underscore again. And I'm kind of missing my keys. So I'll go ahead and test that. So that worked well, and that passed all the tests. And so we're in good shape. We now have a tested function that converts a DNA sequence into a base for string. Now what we'd like to do, of course, is feed this into our function that we had, I guess over here in kmers are to get all kmers, right? So if I, if I go ahead and load this, so again, I have sequence, that's ATs, Gs and Cs and Ns, and then I do my function is seek to base for on sequence, I get that. And of course, I could, I could just pipe that directly into get all kmers. And I guess I put in the there. And it can place, placeholder can only be used for named argument. I haven't used the base are pipe a whole lot. So there we go. So these then are all of the four, the eight nucleotide words from my sequence, that's pretty nifty, right? So now what we want to do, of course, is convert these into base 10 numbers. All right, so I'm going to then come back and then say that test that, generate base 10 values from kmer kmers in base four, right? So I don't know that I want to convert all what 8, 9, 10, 11 of these kmers into the base 10 to decimal numbers. And so maybe what I'll do instead is get the numbers from Wikipedia and plug those in here, right? So I could imagine doing like x equals zero, maybe four zeros, and that'll do expected. And I'm going to say one, because I'm trying to generate a index value. And so are actually starts at one with indexing vectors, other languages like C use zero, which actually works much better with kind of these base conversions. And so I'm going to basically add one to everything so that I can use it as an index into my vectors or matrices or whatever I'm using to store these kmer counts. Okay. And so then we'll do base four to index on x, and that's going to be actual, right? And then I'll do expect equal, expected, actual, right? And maybe I'll do a few more of these. So maybe for this one, that I would expect to be 65, right? And then here, let's do zero, one, two, three. And then here, let's see. So it'll be like 123, right? Which would be this number, which in decimal is 27. So I'm going to save these. So of course, this fails because I couldn't find the function base four to index, winning, believe it or not. And so we'll do base four to index function on base for str. So it's going to come in as a string, right? And so then we will have base four string, right? And so I haven't really talked about how we do this conversion between, say, base four and a decimal number, okay? So we know from over in Wikipedia that if we have zero, one, two, three, actually, it's 27, but plus the one gets us 28. Okay. So that if we have zero, one, two, three, um, let's let's write that down here, zero, 123, as my base four string. So this will be my example, right? Then what I'm really doing when I do the conversion to decimal would be zero times four to the three plus one times four to the two, plus two times four to the one plus three times four to the zero. Okay. So if you haven't thought about exponents in a while, four to the zero is one, right? So basically, what we're saying is well, this is zero, plus four to the two is 16 times one, right? Plus two times four to the one, four to the one is four times two is eight, plus four to the zero, as I said, is one times three is three, right? And so that's 27 plus one. Okay. So this three is in the ones place, right? So four to the zero is one. The two is in the tens place, right? So that's 10 to the first power, or in our case, the fourth power. This third number then is in the hundreds place, again, in base 10. But in our case, it's four to the two, right? Instead of 10 to the two, it's four to the two. And then this is in the thirds place, the thousands place, which is here, right? And so again, you can imagine that if we had eight digits that we would have things going up to four to the seventh, right? For that very first value of an eight mer, okay? So this is what we want to use to generate our index value from our base four. This is this general idea. So one way that we kind of saw in the very first video of this series, that we could approach this would be to do str or str split on base four str. And I'm confusing base r and string r, so str split. I got to load all this stuff, right? And I'll go ahead and comment this out for now. And it wants a split, right? So we can do double quotes on that. And of course, as we saw in that episode, we can give it one for the first value of the list. And so then these are our numbers. And I could go ahead and pipe this to as dot numeric. And then I get these four values, right? We could make a vector of powers, which would be like n char of base four str, right? So n char base four str should get us four, right? But I don't want four, I want three, right? So it's one minus that. And so I could wrap that in parentheses. And then I could sequence down to zero, right? And so now if I do that with powers, I get these values, right? And so then I could do like four to the powers, gets me those, right? And then maybe I could do, let's see, let's call these prefixes. I'm sure there's a real mathematical word for these. I'll have to ask a friend. But let's go roll with this. So we'll do prefixes times that. And that works, right? And then we can wrap this in a sum on that, right? And that gets us 27, which is what we expected, right? So I'm going to go ahead and comment out this test code. And it fails because the actual is one expected to zero. And that's because I forgot to add one to my sum, because again, I don't want things to be zero based, I want them to be one based. So I'll go ahead and save and test. So one thing that occurs to me is that this is going to be a function that we're going to use a ton of, right? So if we have a, say we have a 250 nucleotide sequence that we want to classify, well, that sequence is going to have 250 minus eight plus one. So it's 243 k-mers, right? And so 243 k-mers for one sequence, I'm going to have to convert from base four to base 10. And I'm probably going to want to do that thousands or tens of thousands of times to generate the value. And I think my code here is pretty efficient. But probably not as efficient as writing it, say in C or C plus plus. And so thankfully, there is a built in tool in R that will do this for us. And we'll actually do it for any base that we want. And so that's str to I. So it's string to integer. And we can give it a string. So that's our base four str. And then we can give it our base. And so by default, it's base 10. And so or zero, which is effectively 10. So I could go ahead and put base four in here. And then if I test this, what we'll find is that it fails the tests again, because I forgot to add one. I'm going to have to comment that just so that I remind myself, but that sure enough passes, right? And so again, if I did str to I on any number, right? So if I gave it like an eight mer. And so say I did zero, one, two, three, zero, one, two, three, I guess that needs to be a string, right? And put that into base four, then that's going to be 6939. And if we do str to I on eight threes, so that would be like TT TT TT TT, right? And base four, then this should be the same as four to the eight plus four to the eight, right? Which is that, but it's going to be four to the eight minus one, right? So if I did four to the eight minus one, I get the same value, right? And so of course, we want it plus one, or just four to the eight, which is is that number, right? The other benefit of using str to I is that it's vectorized, right? And so what I could do is I could take x, and we could take these values. So I could take zero, zero, one, two, three, 1000, and four zeros. And then I could do expected 2865 and one. And yep, and then I could do expect or actual base four, to index on x. And I could say expect equal expected actual, and very good, that passes, right? We're getting out the vector of values. And that that's pretty cool, because then we could do it in a complete sequence for those k-mers, right? And so, you know, what we saw previously, right? We did a sequence. Well, let's let's do a different sequence. Let's do sequence without the end, because that's going to fail. We should fix that here before we're all set and done. I'll put a A in here, right? We'll do seek to base four on sequence, right? That gets us that we could then pipe that to get all k-mers. And we can then pipe that to base four to index. It's going to complain because I need to load everything. If I do load all and then rerun that, I then get all the indices for the k-mers in that sequence. All right. So that is awesome. I'm going to basically trust that those are good and that those are right. And so what I want to do, though, is build in a case where we have a sequence that has or that, yeah, something like this, right, that has an N in it, right? And so for this one, I'm going to add maybe three zeros and an N, okay? So that should then, I still want it to return this expected value. If it's got, if it's a k-mer with an N, then I want to drop it out. I don't want to have to worry about it. So we'll go ahead and test this. And again, that throws an error because it's putting a N A in here. And so what happens with str2i is that if you give str2i, say a four, and then we say base four, it returns an NA value. So N is not a value in base four, right? And so it'll chuck it out. I'm then going to wrap this string to I str2i na.omit function, which will remove any NA values from the vector. And so let's test that. So this is failing because expected actual not equal to actual expected. I think maybe I have the values flipped. But anyway, it's saying that there's an attribute of NA action on it. And so, and so I'll go ahead and then let's see if we pipe this to az.numeric. I wonder if that will remove the attributes. And so that all passes, I'm going to go ahead and remove this. And I'll go ahead and add a comment here that I want output to be indexed to start at position one rather than zero. So we're adding one to all base 10 values. Okay, so that'll be a reminder to me. And we'll be good. All right. So I think this is pretty good. As a kind of incremental step, again, this will make us more efficient in the long run. When we're going to update a camer value, or the number of times we've seen a camera that we've actually detected it in a vector or a matrix, if we can directly say what row, what column, what value of that vector we want to change the value of versus having a string like eight a's in a row, and then having to look for that, right? Like we could say, well, that's the very first value, but the computer doesn't know that it's going to have to go look for it. And so that looking for it is going to be slow. Versus if we say look in position one, then it's gonna say, Oh, position one, that's that's right there. That's much easier for the computer to find. So again, some of this again is how we think about things from a programmatic lens versus from a data analytic lens, right? And so I'm concerned a little bit at this point about speed, because we're going to be doing a lot of these operations thousands of times. And while they might take a millions, millions of a second, we do that many times, it's going to be seconds or minutes or hours that it could take us to run the code. And so I'm not super concerned with speed at this point. But it always is kind of in the back of my mind, thinking about how long things are going to take to run. And so again, this conversion from DNA sequence to base four, and then to base 10 is something that I think will really help us in the long run with the efficiency of our code. Okay, so play around with this, see what happens if you try different bases or, you know, see if you can kind of predict how you would go between different bases. Actually, a fun exercise for you would be to go from our base 10 numbers back to our base four numbers. See if you can do that. Anyway, while you're practicing on that, please make sure that you subscribe to the channel so you don't miss the next episode. And we'll see you next time for another episode of Code Club.