 Hey folks, welcome back for another episode of Code Club. If you have been following along, and if you haven't, why not? I have been building out an R package that will allow us or allow people who do microbial ecology to take 16s rRNA gene sequences and then classify them to figure out what type of bacterium that sequence came from. Now, many of you might be saying, who cares? What does that even mean? That's all gibberish. That's fine. That's fine. I kind of agree. But what we're really doing is we're learning how to make a package in R and we're learning to think with more of a mind of a programmer in R rather than a data analyst in R. And so as we've been kind of going along, I've been trying to kind of point out different places where we might take a different approach, again, between a programmer versus a data analyst. I think most of what I do is more of that kind of analytics mode where, you know, I'm summarizing data, making different types of figures, I'm writing reports, right? Whereas, you know, trying to write code that is going to have to be performant, right? So we're going to be doing some of these calculations, perhaps thousands or millions of times. And so it's going to need to be fast. And that we might, you know, we want it to be robust, we want to make sure that we're getting the exact right answer. Even if we get kind of weird input. So to achieve that, we've been using a technique called test driven development. I've got maybe three or so episodes under my belt now of using test driven development or TDD as it's called, and I'm feeling pretty good about it. And so what we're going to do in today's episode, as we return back to my pseudocode or just kind of this outline of the general project, is that I'm continuing to work on the right side of the screen, looking at the reference sequences. So again, we're going to have this reference database that is looking at k-mer frequencies that will allow us to take an unknown sequence and then say, well, within my reference database, what genus has the k-mers that are most like the k-mers in my unknown sequence, right? And so then once we know what sequence, what general in the database are most like my sequence, then we can begin to give a classification. So we have been working on various parts of this outline. And so what I want to do in today's episode is put it all together in one function that it will be a function that is kind of facing outwards towards our user that our user could use where they could give it their sequences, their taxonomy, and the k-mer size they want to use, and then generate that database that they can then use with their unknown. And then we need to turn to the other side of this pseudocode outline. Over here in our studio, I'm going to go ahead and run a test. Again, you can click on the button or you can do the command shift T, and it will automatically run for you. I kind of like the shortcut keystrokes. The problem is there's so many shortcut keystrokes that it's hard to keep them all track. But I find again, the command shift T is something that I use a lot to rerun these tests. Also, if I want to be able to get the functions that I have in my k-mers.r script to work, I can do command shift L, and that then loads my package so that I can kind of develop out the code as I'm using these functions. So let's go back to our tests. And what I want to do is go ahead and create a test that will say create k-mer database from sequences, taxonomy, and k-mer size. And I know I have probably more than just that typo. And again, we're using test that to do this. And so this is kind of the structure of an individual test. And as you kind of look back through here, you can find all the other tests. And incidentally, if you want to get caught up and follow along with what I am doing here, down below in the show notes, there's links to what this repository looks like at the beginning of the episode. So like right now, as well as I'll put a link for what it'll look like at the end of the episode. And so you can kind of follow along and see everything that I'm doing. Okay. So again, we're going to create a function that creates the k-mer database from sequences, taxonomy, and k-mer size. And so if you'll recall, we've been generating sequences based on base three notation. So where an A is a zero, a C is a one, a G is a two, and a T is a three. And so what I want to give it though are the actual sequences, right? And so here is an example of one of those sequences. But we, in the last episode, were calculating the various probabilities. And I think this has kind of the makings of what I want. But these data are still a bit more processed than than maybe I would like. So I'll go ahead and copy these down. And again, these are base three strings rather than the A, T's, G's and C's. And so over top of them, I'm going to translate one of them into the A, T's, G's and C's so that I can then give it, give this function the actual sequences, not the base three sequences. So this will be like A, T, G, C, G, C, T, A. And so just double checking that that's the G's are twos, the T's are threes, the C's are ones, and the A's are zeros. Okay, so that's good. I'll pop that in there. Go ahead and remove that. And then the only difference between this sequence and this sequence is the one at the end instead of a zero. And so again, that'll be a C. And then this is the same sequence, right? And so the other thing I want to do differently here is that I put the genera in terms of numbers as indexes, right? So here I'll go ahead and do A, B, and another B. Okay. So then I want to have a function that will be build kmer database. And the arguments that I want to give it are my sequences, my genera, and my kmer size, right? And this should then output a database. And I want that database to be a list, right? So I need to return back the, what do I need to return back, I need to return back the conditional probabilities for each genus, as well as the names of the genera. So again, what we're getting back is a table. And we saw this in the previous episode, right, where we're calculating these conditional probabilities, where we get back a matrix, actually, I shouldn't say table, a matrix, where each row is a different kmer. And so if we're using kammers of three, we're going to have 64 rows. More realistically, we'll use kammers of eight, so we'll get like 65,536 rows. And then the columns are the different genera. And so I need to know, I don't care really what those kammers are, because again, we kind of run this through a base four filter. But what I do need to know are the names of those columns, because I want to know what each column is related to or what each column represents, what genus it represents. And so I need to get back the genera in terms of the letters, not in terms of the numbers, because these seem like very, these are very simple genus names. But in reality, they're going to be far more complicated, and perhaps a little bit less predictable than like letters, right? So anyway, that's what I want the input to be. The output then I want to be basically like this, right? And so expect equal conditional prom. And so db, again, like I said, is started to say, is going to have the conditional probability, as well as the genus, the genus designations, right? And so this is going to be a list. And perhaps you haven't seen lists before. Actually, I can guarantee you have seen lists before, but perhaps you don't know it. So we'll go ahead and output the database as a list. And so this will be db. And then in here, in double square braces, I think we'll put in quotes, conditional prom. And so the db is going to have two slots, one for the conditional prom, and one for the taxonomy. And so again, I'll copy this down. And so how do I know that you have seen a list before? Well, a data frame is actually a list. So now you know, right? All right. And so then we'll also do expect equal db. And then I'll do genus. And then the first genus is going to be an a. And the second is going to be a b. This is what I expect the output to look like. I'll go ahead and run the test proved myself that it fails. Yes, I can't find this. So I'll go ahead and grab that definition. And we'll come on over to the kmers.r script, where again, we'll do build kmers database. And we'll then put in function, right? And and now we need to go about kind of putting together a lot of this stuff, right? So the first thing we need to do with our sequences is convert that to base for notation, right? And so the function that we had was where seek to base for. So we'll do seek to base for on sequences. So this as input, then we have a function further up here that was detect kmers across sequences. And so what we'll then do with this function is output that matrix that has, again, in our rows will be the kmers and the columns are sequences. And then from there, we'll get the word specific priors as well as the conditional probabilities. So okay, we'll go ahead and take this. And I think I'll go ahead and delete this commented out code. And so we can then pipe that. Yeah, we can, can we pipe that in there? Yeah, we should be able to pipe that in there. And let's see, then our kmer size will equal kmer size. Right. And so let me go ahead and come back here and load our sequences with our kmer size. And let's make sure that this looks right. Okay, we're off to a good start. We've got our 64 rows and our three columns for our different sequences. And sure enough, our favorite kmers 29 and 30 are the cameras that are different between the two genera of sequences. Okay, cool. And so from here, we now have our detected kmers. Right. All right, so now we're ready to take that detected kmers and generate the the word specific priors. And I think I can actually add that to my pipe, right. And so maybe for now, I'll hold off on calling it detected cameras. But I do need the parentheses here. So we run that. And this then gets us our word specific priors. And so again, this is reminding me that sure enough, I need to pull this out. And so what I will do is I'll go ahead and call this detected cameras. All right, and I'll end the pipe there and I'll then have priors, equaling this. And I'm going to feed it my detected cameras. Right. And so now priors equals that, which is good. And so then my priors are going to go with my detected matrix into this calc genus conditional probe. Right. And then I'll do detected cameras. And then priors. And what's the other arguments? The genera needs to go in there before that. Right. So in here, I'll do, I'll do genera indices. Right. And so I don't have genera indices yet. Right. And so I need a function that's going to convert genera, the names, the characters to a vector of indices. So we need to pause this and and we'll come back to it. But first, we need to go ahead and do that translation between this character form and what we used previously, which was the numerical form to an index. Right. And so what we can think about would be, let's see, let's do test that. And then we'll convert back and forth between genus names and indices. Okay. Cool. All right. So again, we'll keep things simple. So we'll do genera. I'll call these str. And then we'll have genera index. And that'll be C one, two and two. So I'm thinking about genera str to index. Right. So I basically want to know A is one, B is two for when we're back here, kind of aggregating columns that are from the same genus, right. And so that's where the genus index is needed. And so again, we can give that genera str. And I expect the output to be genera index, right. So we'll do expect equal on that. And then going in the other direction, I'll worry about in a minute. So let's go ahead and test that it's going to fail. Right. And we'll come back into our camera script. And again, this is going to be a function. And the input to that is going to be our genera string and index or general string, not the index. Right. And so we'll say string. Okay. And so again, like string, well, yeah, I'll say string is genera string str. So I want again, one, two and two. And so we can do a little trick here. And so there's a useful structure within our call the factor. And so we could do factor on string. And so what a factor is, it's an it's an ordinal variable. It's really a vector of values with late levels mapped to it, right. So I can take this factor string and pipe that to as dot numeric, this then returns the index for each of the levels, right. So we have one and two. And to get the levels, or the A and B, I could again do factor string levels, and get back to A and B, right. And so one would be A and B would be two. So I'm going to pull this down because that's not quite ready for that. So I'll comment that out. But I want to hold on to that for now. And I will go ahead and save that. And let's test. So I'm getting one failed test. And that's online 208, which is up here, going into the build camer database. So let's come back up to there. And let's see. I had genera indices. And so again, up here, I can do genera indices equals the genera string to index on genera. Right. So now if we save this and test, it should pass. Nope. It's not passing. Oh, yeah, of course, it's not passing. Because it doesn't know about this, right? So it's getting past that line. So again, we'll do genera indices. I need to if I'm going to run the code that's in the actual functions, I need to go ahead and load it. So I'll go ahead and do genera indices, detected camers and priors, and then calc the the genus conditional probabilities, right. So that is, yeah, so that's those two genera, right. And so I'll then do the cond prob as that. And so now I need to get back the names for the two columns. And so now what we want to do is go in this other direction. And so now what we'll do is expect equal genera s index to str and we'll do that with the genera index. And we'll expect to get back a and b. Okay. And of course, this will fail, we get two fails, right. And so the first one, of course, after that out of bounds error is this one. And so we'll come back and we kind of already have the code for that, right. And so we'll go ahead and here and put function. And then we'll put string. So let me think about this again. So what I want for the output from build camer database is a list that contains the conditional probabilities, as well as the genera that correspond to the two columns or however many columns there are in this in this matrix. And so I don't want to go from index to name, well, I do, but index of this, right, so I want a vector that has the two levels, right. So what I'm going to want to get our genera str to unique. Okay. And so here then we'll take genera str, and we'll go to a and b. So I'm going to go ahead and change the name of this function, right. And so that should work. Then we can do genera names equals and we had genera str to unique. And that was on genera. Maybe I'll give us a better name. So maybe I'll call get unique genera. And I'll spell that right. Get unique genera. So kind of as we're designing things on the fly, decisions get made decisions get changed. This is all real, right. So let me go ahead and test this. Okay, so we're failing still. So get unique genera could not find function get unique genera. Okay, why can't you get unique genera it's there. Oh, because I didn't change the name of the actual function. So right there. So go ahead and test that. And that is still failing right inside of at the output from my JIT build camber database, right. So now what we're going to return is a list of conditional prob equals cond prob. And then the other slot was what genus, right. And maybe I'll call this genera. Although that's a little confusing, but that's cool. genera equals genera names. And we'll test. Hopefully it'll pass. It fails. Why do you fail argument expected is missing with no default. 215. Okay, right here. Ah, and that's because I put in an assignment arrow and a comma. So now let's try it. It passed. Yes. Awesome. So now we have a function that will build our camber database for our sequences, the genera and the camber size. That is wonderful. So I've been testing it on kind of what we might think of as these toy tests, right, with kind of simple, simple sequences, three sequences, right, two genera. And they're really short. So what I'd like to do instead is test it using real data. And so that is going to take a little bit of effort. So on the mother wiki, I have the version 19 of the RDP training set that was released back in July of 23. I'm going to go ahead and grab this collection of, I think it's 23,000 bacterial sequences, 788 or keel and one eukaryotic sequence. So I'm going to go ahead and download this to my computer. And there's a good way to do this. And then there's the way I'm going to do it. So I'm going to do the way I do it. And we'll come back and do it the right way in a future episode. I went ahead and put that into my phyletyper directory. I'll go ahead and decompress it. And so it's got a directory with the faster sequences and the taxonomy. So I'm going to go ahead and do library tidyverse. This is something I feel like might be kind of the makings of a vignette. And so I go ahead and load the library. And then I'll do read TSV on train set, nine that and then train set that. This is going to get me my taxonomy, I think, except I don't have column names, right? So I'll go ahead and do call underscore names equals sequence. And I guess I need to make that a vector sequence. And then taxonomy. All right. And it doesn't seem to like what I did. I need to closing parentheses there. All right. So now we've got a sequence and taxonomy. That's good. We'll go ahead and call this genera. Right. And we'll go ahead and clean this up a bit. And then we're going to have to read in the fast day files. And that's going to take a little bit of effort. So what I will do is scan on the train set RDP transit again on the faster. And I'll do set equals backslash n. So this will read in nothing. Oh, and that's because scan expects the data to be numeric. And so I'll do I think it's type equals character. No, all right, let's look at the help scan file what. And so it needs what. And I think I can do character as the function. And let's so that reads in a whole bunch of stuff. Some of it I don't really care about. And so what we can do is to first. So it's got it's a fast day file where it has the name along with taxonomic information. I don't care about the taxonomic information because I've already got that. And then the sequence, right? And so I will call this fast day data. And let's go ahead and put this on separate lines. I like to have everything line up and not scroll off the right side of the screen. Your mileage may vary, but I think most people agree with that. And if you don't like having that read thing, you can do quiet equals true. The values of fast day data start with the header and followed by the sequence. And so I want to get basically all of the odd values out of fast day data. And so to do that, I can do seek one comma. And then I can do length on fast day data. And I can then do by equals two. And so then this outputs all of the sequence names, right? So I'll pipe that down to str replace all says like g sub, which I think we used in the package. And so the the string coming into it. So is going to go into that string spot. And then the pattern to get only the sequence name, I'm going to start with that greater than sign. So that carrot means match at the beginning of the string. And then I'm going to use parentheses to save the pattern. And then I'm going to do backslash t period star. And I'm going to replace it with what's in the parentheses. So I need to define what's in the parentheses by saying that this is going to be anything but a space, right? So I can do not so entitled square braces, I can define my own pattern, I can then do backslash t. So anything but a tab, and we'll repeat a bunch of that, we get back then our names, right? So these are our sequence names. And so I'll call this sequence names. And then my sequences, I will then do the same type of thing. But we're going to start at two, right? And so now my sequences are like so. And maybe I could go ahead and do a tail on sequences to make sure that it's doing what I think it did. And sure enough, we do. And so we can confirm that the length on sequences, and the length on sequence names are the same. And we're good. Okay. So now we have our sequences and all that. And so I'm going to make a table with a name, or I'll say accession equals sequence names, and then sequence equals sequences. So that gets us a table, right? So I want to make sure that the sequences and the taxonomy are in the same order. So I'll go ahead and do a inner join with this thing up here, the genera, right? And let's see, we'll do the stuff coming through the pipeline and genera. I guess I'm kind of mixing pipes. So I'll go back to this pipe. And this placeholder don't actually need the placeholder. And then we'll do buy equals. And you know what? I call it sequence here. Maybe I should call this accession. Because sequence means something different with the fast day data. All right. And so then I will join by accession. It's got to be a named argument. Okay, so I'm going to leave this out. I don't actually need it. It's going to join it together. And so now we've got our accession, our sequence, and our taxonomy. I think now we're good to go on running our camer function. So build camer database. And we'll do build camer database, Oculus seek table. Right. And then I'm going to give this seek table dollar sign sequence and seek table dollar sign taxonomy. And then camera size equals eight. And let's see what this does. So it's complaining because I think I forgot to load this. All right. Well, that's pretty depressing. After about two or three minutes, it crashed. And we ran out of memory. The problem I think is coming from this detect camers across sequences function, where we're using a supply to build a matrix that has the 64,566 or whatever is a large number of camers rows by the 24,000 columns for our number of sequences. I think it's just gagging because that's way too big for it to store in RAM. And something that I'm thinking about is most of the values in that matrix are zeros. So we probably don't really need to store them. Right. And so I think I have some ideas for how we can refactor this, but we're gonna have to wait till the next episode to figure out how to do that. But thankfully, we have all these great tests. And so we can feel confident that we can modify our code to refactor it to make it actually work and run without breaking anything and having faith that we're going to get the right answer. All right. So that you don't miss the exciting conclusion, please make sure you subscribe to the channel so you know when the next episode drops. Okay, take care and we'll see you next time for another episode of Code Club.