 Hey folks, welcome back for another episode of Code Club. In the last couple of episodes I've been talking about a package that I want to build so that we can have the naive Bayesian classifier, the RDP classifier, available to us as an R package. This classifier has been really powerful within the microbial ecology field because it allows us to very quickly and accurately assess the classification of a sequence, a 16S-reversible RNA gene sequence, from an organism that we don't know its identity. Now, I know many of you do not care about microbial ecology. Why don't you? Anyway, I encourage you to stick around anyway, because we're going to be building out this package, and regardless if you care about microbial ecology or classification or bioinformatics or whatever, I think you'll still learn a lot as we go through this process. In today's episode, what I'm going to do is talk about the algorithm that the RDP classifier, the naive Bayesian classifier, I'll just call it the RDP classifier, that the algorithm that it uses under the hood, because this is the algorithm that we want to be able to implement in our own R package. I'm going to be going through their paper in some detail, maybe fleshing out some of their methods a little bit more to put it kind of in more understandable language perhaps. This paper was also published in 2007, so about 17 years ago. I would say that perhaps our standards for how we publish papers or benchmarking papers has changed a little bit over time, and so maybe I'll comment on that. The other thing is that the technology that was being used at the time for sequencing is a lot different than it is today, and so some of the thoughts that we might have would be a bit different than what they had back then, and so something we might think about doing with our R package is going back and regenerating some of their figures with kind of modern sensibilities more in mind. The first thing that I want you to know is that there's a difference between phylogenetics and classification. Phylogenetics allows us to propose a new taxonomic name or lineage or seeing how similar our sequences are to existing sequences, right? So typically what we do is we build a phylogenetic tree with a bunch of sequences from known organisms along with our unknown sequences to see where they cluster within those existing sequences. If they cluster away from existing sequences, then maybe we have a new lineage of bacteria that we want to consider. We do not propose new taxonomic lineages when we do classification. This is akin to when we use phylogenetics to insert a sequence and we see who it clusters near, right? So that's a phylogenetic approach that we're kind of approximating with classification. So again, we have an unknown sequence, we want to know who it is or what kind of organism did it come from. And again, this cannot be used to propose new lineages. There's a few different methods of classification that I want to share with you. So the first is what we might consider a closest neighbor approach. So typically what you may have done in the past is take your unknown sequence, you upload it to the NCBI website to run BLAST and you get back a report. I'll share with one of those with you here in a moment. An alternative approach would be a k-mer search. So largely what we're going to be talking about as we build out this package is the k-mer search. So this is the output I got from running BLASTn with an E-coli sequence against the Rebson RNA reference database that is available through the BLAST server at NCBI's GenBank. And so I want to kind of point a few issues with this approach, right? And so typically what we might do is say, oh, that's a Shigella boidiii, right? And what we need to do is look a little bit deeper. And I think by looking deeper, we'll learn a little bit more about some of the challenges with classification. So the first thing I want you to focus on is this query coverage. So the LA in BLAST stands for local alignment. A local alignment algorithm is only going to share the alignment for those positions in the sequence that are most highly conserved, right? So as you kind of scroll down here, you'll see things like this 97% or this 99%. And so that means that it's pretty, pretty similar to this sequence, whatever that is, right? This E-coli strain. But it's only comparing 97% of the bases. The sequence I uploaded I think was about 1,450 nucleotides long. And so that match is quite a bit different from the match below it, right? And so in this case, yeah, sure, like the top six or so different matches have 100% coverage. They're using all of the nucleotides I provided. And we can then look and see the percent identity, right? And so we see that the first two sequences are 99.73% identical over the full length of the gene, right? So you can hopefully see this starting to get complicated, especially since these two sequences have different names. Now, Shigella and Escherichia are basically the same thing. But if you're not kind of thinking deeply, you would say, wow, I have Shigella, Boydiei, or I have Escherichia, Frigasonii, right? If you're kind of looking deeper here, right? And so a question that comes to mind then is how many of these sequences, these matches, should I be accounting for when I determine my final classification, right? Because if you go down further, far enough, we see that there is some Citrobacter in here as well. And there's some Salmonella and Rautatella. Sounds like Ratatouille. Anyway, right? But these are all very high, similar matches over very significant chunks of our sequence. And so what do we do? How do we deal with this ambiguity? So there is one approach of using the top match. And so this we might think of as being one nearest neighbor. We're looking at the most similar match. Another approach would be to use the N nearest neighbor approach. And so N being the number of neighbors, right? So the top match would be N equals one. And so we might say, well, maybe we should use five nearest neighbors, right? And so then we would take the top five matches here and return the consensus, right? So here we would see that three of the five are Shigella. We can't really agree on perhaps the species level. And so we'll call it Shigella, right? Because the majority of the sequences are Shigella. In the RDP classifier database, Shigella and Escherichia are actually lumped together. Again, this is an illustration, but you can see the same type of thing across the phylogenetic tree. Again, as we kind of increase the number of neighbors that we might be interested in, we might be cutting ourselves off from identifying sequences that belong to particular lineages, right? And so we know that many of the genera within our reference databases only have one representative, right? So if I say three nearest neighbors, I would never see any of those genera because there aren't three. They don't have three representatives in them, right? And so this can cause problems. So a third approach, which is what we're going to talk about, is what we might think of as a model-based approach using the naive Bayesian classifier. And so I'll give you kind of a foreshadowing. We're going to use camer searching, and we're going to basically build a camer model of what each genera looks like. So this is the paper that we're going to be talking about, naive Bayesian classifier for rapid assignment of RNA sequences into the new bacterial taxonomy. And the new bacterial taxonomy that they're talking about is Bergey's reference taxonomy. I believe at the time George Garrity, the second author here, was leading up the Bergey's manual, which is the official naming scheme of bacteria. James T. G. has been a pioneer and leader in the field of microecology. James Cole and Joing Wang, I'm sorry for mispronouncing that, have been at the RDP for many, many years, and have been really instrumental in making this resource available to us. So this is figure one from their paper. This figure always makes me go a bit cross-eyed across the x-axis, if we call it that. We have the taxonomic lineages from phylum down to genus. They're looking at bacterial sequences. And the z going into the screen, if you will, is the sequence length going from 50 nucleotides up to full length. They're a bit vague here in terms of if this was kind of like averaged across the full length of the gene, or if these were random 50 nucleotide spots, I kind of guess that they took 50 and they stepped it across the full length gene. They do that elsewhere in the paper, and then they perhaps averaged. And so the height of the bar corresponds to the accuracy of the classification. So basically if you have 50 nucleotide reads, trying to classify at the genus level, then 51.5% of the time you would get the right classification. Whereas if you had full length sequences at the phylum level, that you'd be 99.8% accurate. And so I think this is interesting. Again, it kind of shows what we might think that as you increase the length of the sequence, and as you kind of go to a more coarse taxonomic level, the classification gets better. Although I'll emphasize that it's not perfect. And at some point, you don't really get amazing returns. So going from 50 to 100 is a big leap in accuracy at the genus level. 100 to 200 is a little less dramatic. And then kind of from 200 to 400 isn't huge, right? And certainly from 400 to full length isn't going to solve all the world's problems in terms of classification. This figure is figure two then. Panel A is the classification accuracy rate for the Berge corpus. So this is using the Berge base taxonomy with sequence segments of 100 nucleotides, 100 bases moving 25 bases at a time across the 16S gene. And so here these v1 through v9 are the nine variable regions of the 16S gene that people spend a lot of time talking about. I sequenced the v4 region for a lot of my work. Other people use v3, v4 or v4, v5. At the time this came out, I think v6 was actually pretty popular because it was short. It was actually about 60 nucleotides. And so then you see for each of these lines, it represents a different taxonomic level. And so what we see is like at the v4 region, we actually do a pretty good job of classifying at like the genus level. So 80% of the sequences gave a correct classification. Again, only using 100 nucleotides in the v4 region. And so if we came back here, we would see that that's actually higher than the 71.1% that we saw averaged across the full gene, right? And of course, today we're using longer reads. I'm seen as kind of a throwback by using the v4 region, which is only 250 nucleotides. And I suspect it's actually a bit larger of an accuracy level. And of course, across the gene, different regions classify to varying degrees of accuracy. I think this data is one of the reasons that the v4 region received so much favor in the microbial ecology field. Panel B then on the bottom shows us the average confidence estimate. And this is our degree of confidence in a classification. And so this might not mean a whole lot right now, except to say like the v4 region, and I would say also like the v1, v2 region, it actually did well up here for the average accuracy. We have pretty good, we have pretty good confidence in our classifications at that taxonomic level. Overall, I don't think B is super helpful. I would focus more on A. And actually what would be more interesting would be to basically do a join between these two figures and say, if we require say 80% confidence, what does that do to our classification accuracy? Anyway, something we might think about long term, as we think about perhaps redoing some of this work, going back to those full length sequences mentioned that I think was only 91% of the full length sequences gave a accurate classification at the genus level. We see that that actually varies across the different phyla. And so here we see that for the firmicates, about 12.7% of the sequences were misclassified, proteobacteria about nine and a half, actinobacteria only two and a half, right? And so the amount of variation in classification accuracy varies by lineage. And so again, this idea that if we get full length sequences will solve all our problems. Eh, not so much. I think we still need to be mindful that to get accurate classification, you probably really need a full genome, although there's still a decent number of misclassified sequences in here. I think it's still pretty good. The other thing they point out is that some of these are due to taxonomic anomalies, where it's really not the classifier's fault, because it might be assigning a sequence or the sequence might be coming from a lineage, that's polyphoretic, that it's not a single entity. It's actually kind of, it's like E. coli on chagelle, right? Like that's something where you can't accurately distinguish between the two. And so again, that's kind of an example of a misclassification. So this table actually gets me thinking about two types of misclassification. The first would be calling something Citrobacter when it's an Escherichia, right? Those are different genera. We should not be confusing the two. Another type of misclassification is something that you've perhaps seen in analyzing your own data, where it comes back as an Interobacteriaceae rather than an Escherichia, right? So an Escherichia is an Interobacteriaceae. So that's not wrong. It's just not complete, right? And so we can have a misclassification at the level of, that it's calling a classification, but we can also have a misclassification because it doesn't go as deep. I would prefer to have that latter type of misclassification over the former, right? So I'd rather say something's an Interobacteriaceae than saying it's a Citrobacter. It's an Escherichia and getting those mistaken, right? And so again, these are things to be thinking about as we go through and think about the types of experiments that in the end, we might want to do with our version of the classifier. So a couple of things to think about in terms of the paper and how we kind of process what they did and their experiments. The database that they used, again, used the Bergeys classification. They also did some work with an NCBI-based database and got very similar results. But the Bergeys-based database that was available through the RDP contained 453 Genera that only had a single sequence in the database. They had a total of 988 Genera in the complete collection with about 5,000 sequences. So almost half of the sequences or half of the Genera in the database only had one representative, right? So again, if we did like N nearest neighbor with like two, then we wouldn't see like nearly half of these Genera. So that's something to be aware of, right? We might say, well, why can't we just get representatives of everything, multiple representatives of everything? And the reality is that that's hard, right? There's going to be some Genera that we see a lot of like Escherichia and other things, but well, we just don't see a whole lot of. The current database that was released last summer in 2023, I think in July, it contains 1720 Genera represented by a single sequence. So considerable growth here out of 3883 total Genera. So again, a very similar ratio and nearly 25,000 sequences in the database. Okay. So again, we've seen growth in the database, but as these databases get bigger, the number of Genera goes up, and the number of Singleton Genera goes up as well. Next, remember that the RDP uses Berge's taxonomy. So this is the official taxonomy used by bacterial systematicists. So these are the taxes that are correctly named, right, that are appropriately named and proposed and accepted by the community. There's other taxonomies out there. We can take the same sequence and give it different names depending on what taxonomy we're looking at. We could look at, say, the NCBI, silver, green genes, and others, right? You might have your own taxonomic scheme that you want to apply to sequences. And so the advantage of a standalone tool, like we previously have put into mother and what we'll hopefully be able to develop here, is that we can use the same sequences with multiple different taxonomies so that if you want to know what the classification is according to NCBI, silver, green genes, RDP, you can get them. And so we'll see about doing that again as we go forward in developing this package. All right, so let's dig into the methods. So it's called a naive Bayesian classifier. What is Bayes theorem? This is Bayes theorem, okay? And so it was proposed by Reverend Bayes, of all things, and he was a statistician, and I think a Presbyterian minister. And so kind of breaking down all this conditional probability that on the left side of the equal sign is the probability of a genus given a sequence. And I'm going to put that in terms that we're using for our application, right? So if you give me a sequence, what's the probability that it belongs to, say, Escherichia or pseudomonas? On top here, this P of B given A, it's the probability of seeing a sequence if it came from a genus. So hopefully you can see that we've we've flipped the situation that we have a sequence. What's the probability of seeing that type of sequence if it came from a specific genus versus what's the probability of seeing a genus given that we have this specific sequence. And the way to make that conversion is through knowing the probability of the genus, so PA, and the probability of the sequence. This then is the materials and method section that we're going to be talking about and is a description, basically, of how we go from this to a classification. So the first thing to talk about, again, are the Berge's taxonomy and the database that they used. Again, these are small subunit RNA gene sequences. I should add that I know people have used mother to classify other sequences, right? Not just 16s RNA gene sequences, but people have used large subunit people have used I think 12s other other small subunits, right? And so I think that's really the power of the method is that it's generic. It can be generic enough that we can apply it to a bunch of different cases. Okay, something to note is that each sequence was labeled with a set of taxa from domain to genus. So we go from domain to genus, we don't have species information. And so that's the reason that the classifier does not output a species level classification. There is strong consensus in the field that we cannot use 16s to classify to the species level. If you want the species level, you need to sequence the full genome. Okay, so we can't ask, you know, we can't ask more of the gene than it can actually give us. And so again, that's why it's giving these different ranks. There are in their database some subclasses and some orders and I think subfamilies maybe. But we're only going to be working with these more traditional taxonomic levels. Thinking about what we need to do for our package, we need to get reference sequences and their linear and taxonomy from phylum to genus. And so I have this, because we have these types of files from the RDP that they post online, but that I've also modified in an easier to digest manner for the mother software package. So we'll need to be able to get this loaded into our package. So these are things to think about as we go along. The algorithm description is summed up in this paragraph. So I'm going to step through this. So the classifier uses a feature space consisting of all possible eight base subsequences or words. So what does that mean? If we think of this sequence, so this is the same sequence repeated many times, and we're going to look at eight base subsequences or eight nucleotide words, or eight Mer's, right? So k hyphen Mer is a eight Mer where k equals eight. And so what I've highlighted here are all the possible eight Mer's. Again, for the first, you know, 10 or so different eight Mer's that I've shown. Okay. And so we're going to need to step through each of these, extracting the eight Mer's from our sequence, all possible eight Mer's from our sequence. And so this is for the part of the sequence I showed all of those k Mer's. And so if we think about, you know, what I showed here is 36 nucleotides with eight nucleotide k Mer's that there will be 29 possible eight Mer's here. And so the total number of k Mer's is the length minus the k that the length of the k Mer plus one, that will tell us the total number of k Mer's for our sequence. So there's a trade off with the k Mer size. So if we use a very small k Mer like say two, then there's going to be 16 possible k Mer's. And we're going to be likely to see the same tumor many times within our sequence. And so that's not going to give us a lot of resolution. At the other end, though, if we use too large of a k Mer, say like 10, then there's going to be four to the 10 possible k Mer's. And so if we have to keep track of all those k Mer's, that's going to use a ton of RAM. And so I've got a pot over here on the right, showing the number of k Mer's as a function of the k Mer size. And so this red dot is at a k Mer size of eight, which is what they used in the program. And so with eight Mer's, there's 65,536 possible eight Mer's. Okay. And so again, if we think about like a v four sequence, it's going to be about 250 nucleotides, it's going to have 243 eight Mer's. So it's very unlikely that we're going to see the same eight Mer multiple times in our sequence. So that should give us pretty good resolution, right? Similarly, if we look at a 1500 nucleotide sequence, it's going to have 1493 eight Mer's. And even at that length, we're not likely to see the same eight Mer duplicated. So again, they said that size eight was chosen for all further work to reduce the memory requirements, that they found that eight and nine performed pretty similarly, but say six and seven weren't so good, didn't produce such good accuracy levels. They didn't show the results of that. But they said eight and nine were very similar, but nine required a lot more RAM. So they ran with eight. Next, the position of the word or the camer within the sequence is being ignored. This is the naive part where the naive Bayesian doesn't care about the context of where that word comes from. And so only those words occurring in the query contributed to the score. And so if if a camer wasn't found in the sequence, the sequence isn't being penalized. It's only being credited or not credited for what it has in the sequence. And this will hopefully make more sense as we go through here. Okay, so now we have some more things that we know we need to do. We're going to have to be able to get an unknown sequence into our functions. And we're going to need to collect all of the eight nucleotide words, perhaps other sizes as well from our unknown sequence, as well as from our reference database, one of the first things that we're going to need to do is to calculate the word specific prior so the probability of seeing each word across our entire database. And so we'll do that using this formula here, where our probability of seeing a word, again, regardless of what genus it comes from, is the number of times we see that word across the whole database, plus point five, divided by the total number of sequences plus one. This plus point five and plus one, ensures that the values of the probability go between zero and one. So again, thinking about that formula I just described on the last slide and applied to generating that p s given G, we're going to then think about that at the genus level, right. So again, like I just said, we can think about this with G being saying the asherichia or the pseudomonas or whatever. And then this m w is going to be the number of sequences that have a specific eitmer in that genus, plus the probability of seeing that word across the entire corpus divided by the total number of sequences from the genus plus one. I should also add here that if this value is zero, the probability won't be zero because we're going to have the pi divided by the denominator here, right. And so that will come in handy here when we calculate the joint probability of observing genus G from a partial sequence s, right. And so if we take our unknown sequence now, and we look at its eitmers, we can then look at the eitmers that we have and compare those eitmers probability in each of the different genera. And so then we can get the probability of sequence given the genus by this this pi sign is a multiplication where we multiply together all of these probability values for all the words that we see in that species, right. And so then we can imagine we're going to get a bunch of p s given G's because we're going to have the same sequence but for many different genera. So now we know that we need to get p w g for each genus and word. The naive Bayesian assignment will be determined by this formula, which is Bayes theorem, right. And so here, what we'll find is that it will assume that all genera are equally probable. So we have equal priors. And at the constant terms, PG and PS can then be ignored. And so basically what we end up with is that the sequence is going to be a member of a genus given the highest probability score, right. So we take that PSG across for the same sequence across all genera, and we find the PSG that has the largest probability. And that then becomes our classification. So the next thing we need to add to our to do list is then for each unknown to calculate p s given G. So now we need to get the confidence in our classification. So recall that we have eight character words, right. And so now what we're going to do is we're going to conduct what's called a bootstrap trial, where bootstrapping is I consider a part of like bioinformatic black magic. I don't know that we totally understand why it works, but it works pretty well for giving us confidence in our classification, or in a phylogeny. What the algorithm does here is that because we need the bootstraps, the sub samplings to be independent of each other, they don't want to resample everything, but they want to subset all of the possible k-mers. And so because each k-mer is eight nucleotides long, they then take a one eighth subset from the possible words. So say we have a sequence and we found the 800 k-mers in that sequence. What we would then do is randomly grab 100. So one eighth of the 800, randomly select those k-mers out, and we're going to sample with replacement. So we'll grab one out, record it, put it back in, grab another one out, record it, put it back in. So that's with replacement. And we'll then grab that one eighth subsampling or the 100 words. And we're going to then take those words. So we'll then classify that using our algorithm. And we're going to repeat that process 100 times. Based on those 100 randomizations, we will then record the genus that has the most frequent classification for our pool of bootstrap sub samples, right? And then at the higher rank assignment, so say like the family of the class or the order or the whatever, we're then going to sum the results for all genera under the taxa. This gives us then our bootstrap confidence estimate. In general application, we're typically happy with an 80% confidence on our bootstrap. I don't know where this 80% comes from, but it's frequently used with bootstrapping. So I've added steps three through eight here, where again, we randomly sub select one eighth of the total words from the sequence, we then calculate the probability of a sequence given the genus or the genus given the sequence, right? And then we repeat steps three through 500 or maybe 1000 times depending on the speed. And we calculate the number of times each genus is returned, as well as kingdom through family. And then we return the taxonomy with more than 80% confidence. And so again, in practice, what happens is if our genus is say it like the 60% confidence, then we typically report that as unclassified and we go up the next level. And so if family is over 80%, then we would return the family. If family is below 80%, then we go up the next level. We can also think of this outline then as pseudocode. These are the things that we need to do in our march towards coming up with a classifier. The things on the right again correspond to the database, whereas the things on the left correspond to our unknown sequences. And so my thinking is that the things on the right with the database can be pre calculated. And so then we can use them with a large number of unknown. So we shouldn't have to do the stuff on the right for every sequence, we can pre calculate the stuff on the right, and then run through everything on the left. The other thing that I noticed is that step two in both of these is calculating the eight nucleotide words. And so they think that's going to be where we start in our next episode of trying to build out this package. And so we're going to start by extracting k-mers from our sequences, either our unknowns or from our reference sequences. So I hope this description of the paper has been helpful in informing you how this classifier works. Perhaps you've used the RDP classifier in the past on their website. And you've always wondered how it worked, or perhaps you've used BLAST and didn't know why BLAST was such a bad thing. I hope this has shed some light on the strengths and weaknesses and kind of why we get the results we get from these different classifiers. Again, in the next episode, I'm going to start coding up the package by building out functions to extract the k-mers. In our case here, perhaps starting with eight nucleotide words from our unknowns and from our references. So that you don't miss that episode, please, please, please, be sure that you subscribe to the channel, and we'll see you next time for another episode of Code Club.