 This one, I think, is kind of trying to bring multiple things together. We're going to be bringing neural nets together. We're going to be bringing a little bit of the hidden Markov models. We're trying to bring elements of the concepts that we introduced yesterday and today into this near, well, epic effort to try and do gene identification for prokaryote systems. So this is module six of the machine learning course. As before, we have our Creative Commons license that tells you to share and share like slides can be shared with others and distributed freely. So the title for this module is Finding Genes with Artificial Neural Networks and Hidden Markov Models. The reason why this is a fairly long module is because it's and it's set for almost two hours here, is it includes both the lecture and the lab, but it also tries to cover some of the problems or mistakes that can happen when people try and apply one concept too far. So as I say, if you've got a hammer, everything looks like a nail and not everything is. Sometimes you need a screwdriver, sometimes you need a saw. And this is partly to illustrate the importance of feature engineering or feature selection in trying to develop good models. And I've mentioned this a few times before, and I'll emphasize it today and again, other lectures and modules as well. You've got some, you know, a toolbox of very powerful equipment when you've got machine learning, you've got neural networks and Markov models, random forest decision trees. These are all very powerful computational tools. But if you don't have the right data or if you haven't formatted the data or if you haven't set up the right topology or thought about it in a comprehensive way, you have the situation of garbage in and garbage out. And that's one of the things that we're going to show here today. So to understand how you can use machine learning for gene production, we're going to focus on prokertic genes. These are simpler genes, simpler gene structure than eukaryotic genes. The best advances and some of the best utility or examples of machine learning have been applied in eukaryotic gene prediction. But since this is an introductory course and because prokertic genes are a little simpler, we're going to focus on prokertic gene prediction. So this is not going to win everyone the Nobel Prize or anything, but we're just going to, as an illustrative example, how you can use machine learning to predict genes in bacteria. And it's a general concept that could also apply to many other things. You just have to associate the relationships with your own problems. We're going to talk about how gene prediction has been traditionally done, how it's assessed, and then we're going to look at some code with different ways of predicting prokertic genes. And we're not really going to involve too much about the HMM, but then we're going to dive into using COLAB to look at the gene prediction code. So genes are segments of DNA that typically code for proteins. There are some non-coding genes that produce messenger RNA, while tRNA and rRNA. And then there's non-coding RNA that also is relevant for the function. And this is found in eukaryotes, prokaryotes, all organisms. This is the genetic code, which essentially allows you to translate the different codons, triplets of bases to the different amino acids. And you can see this is in the RNA code, so we have u's instead of t's. There are some codons that are very abundant. Things like the valine, leucine codons are quite abundant. Arginine, curiously, is a very abundant codon, but it's a very rare amino acid. And then there's the termination codons. There's three termination codons. Then there's kind of the universal start codon, ATG or AUG, which is methionine. And I guess I won't belabor it, but this, of course, is an important table when you're trying to identify genes. When we look at DNA, we typically have a forward strand and a reverse strand. So DNA is double-stranded. When we translate genes, we look through codons and we talk about reading frames. The forward strand, there's reading frame 1, 2, and 3. In the reverse strand, there's reading frame minus 1, minus 2, and minus 3. So I'm showing double-stranded DNA segment here, and I'm showing the translation for reading frame 1, 2, and 3. And you can see that ATG is methionine. You can see that CGT is arginine. We can see that TGC and frame 2 codes for cysteine and that the reading frame GCG codes for alanine. We can see the same in the reverse, and then I've got the arrows showing how these are read out. And then the stars represent termination codons. So again, this is just for refreshing people's memories. In terms of prokaryotes, the gene structures are usually referred to as open reading frames. There's a start codon, and then there's any one of three stop codons. And that from start to stop is called an open reading frame. And then there's typically some kind of promoter upstream, anywhere from 50 to 200 base pairs up. And those include the Scheindel-Garno sequence, the Prypno box, a variety of other sequences. I've marked off the different frames, reading frame 1, 2, and 3, assuming that we're in the forward strand, just to remind people about frame structure. So you can kind of write up a really simple gene finding algorithm. And some of you maybe have even done this yourselves. You can write it in different languages. And typically, if you've got a 4 million base pair prokaryotic genome sequence, you can start from base 1 and you can start scanning forward, looking for ATG, which is the start codon. Once you've found that ATG, then you scan things in groups of three in that same reading frame until you find one of the three stop codons. And then you can use some kind of rule and say based on the number of bases or based on the number of codons, am I far enough away from the start to say that this is actually a gene? So if you find five bases, six bases later, here's your stop codon. That's probably too short. So some people will use a cutoff. You have to have at least 50 codons. Some people use a cutoff of just 50 bases to being a gene. If you have found a gene, then you can go back and then start looking for another start codon and then scanning forward to look for stop codons. If it's less than 50, then you can go back and start looking for another gene to start over again. So you do this repeatedly for the forward frame and then you look at the reverse complement or the reverse strand and calculate the genes for the reverse or backward strand. And so both strands can code for genes. And by looking at both strands, you can identify several thousand genes, a genome of about 4 million bases will have about 4,000 genes typically in a prokaryotic genome. There are web servers. NCBI has one called OrFinder. It's been around for many years. You can paste in an entire genome and then it will find all of your open reading frames. You can see this in the six different reading frames. Shows the direction of the genes. Shows the length. You can click on them. It tells you whether it's in the positive or the negative or forward to reverse strand. Tells you what reading frame. Tells you the start and stop. Tells you the length and bases and the length and amino acids. You can get the sequences. You can do BLAST to see if these things match anything. What you can see from this diagram is that the coding density is pretty high and that you have a number of cases where genes seem to be overlapping or nested all on top of each other. This is not reality. This is not how dense genes are in prokaryotes. They are modestly dense, but not to the point where you've got genes upon genes upon genes. So this is what a simple OrFinder will do. It finds lots of possible genes and it has many, many, many false positives. So when we talk about gene prediction, whether it's in prokaryotec systems, we can talk about it either at the gene level or at the base level. This is illustrating the performance of a gene predictor or gene identifier at the base level. So how many bases off are you from the right stop or the right start codon? So if we've got in light blue is the actual gene position. So we've got gene, then non-coding, then gene, then non-coding. And then in red, we have our prediction. So in red marks the coding region, the black lines mark the non-coding regions. And we can see that in the first gene, our prediction is a little too long. It's right at the first codon, but it's wrong in the last codon. And so for whatever reason, our gene is too long. In our second gene, we see that it's too short. It starts at the wrong place and it ends in the wrong place, but it's in the right reading frame and at least it covers part of it. So in this way, we can think of gene prediction as a combination of true positives, that's a TP. The blue matches the red, false positives, where the red is over predicting and it matches the black line. True negatives, so we're saying there's non-coding regions and there's the false negatives where we're saying that there isn't a gene or coding region here. We say it's non-coding, but in fact, it is really coding. So when you have true positives, false positives, true negatives, false negatives, you can calculate sensitivity and specificity. This is the same thing as the confusion matrix. We can also calculate things like precision. We can also calculate correlation that combines sensitivity and specificity. So these measures can all be used to report the performance of a gene predictor. Remember, sensitivity is a false positive rate. Specificity is the false negative rate. So here's the formulas for sensitivity and specificity. There's also the formula for precision. I'm not gonna give the formula for correlation because it's really, really complicated. But these are all used when people report performance of gene prediction, both for eukaryote and prokaryote. And this is done for, as I say, the base level. We could also do this at the gene level. And in this case, I wouldn't be drawing lines here. What technically would happen is that the predictions in this case were all wrong. So even though maybe, let's say 500 out of 1,000 bases are correctly predicted because we never got the exact correct start and end codons, the red predictions at the gene level are all wrong. So at the base level, we're probably, I don't know, 60% right, 70% right, but at the gene level, we're 0% correct. So depending on the level of what you're evaluating, whether it's the gene or base level, you can sort of artificially inflate your performance. Most people in the world of molecular biology report at the base level because it gives them better numbers. If you reported at the gene level, we actually do pretty miserably when it comes to gene prediction. So how do you find open reading frames? Typically there are start codons. The most common is ATG. In many bacteria, they have sort of alternate start codons that are about maybe found in 1% or 2% of the cases. Two of the most common ones are GTG and TTG. And then there's some pretty universal stop codons, TA, TAG and TGA. So if we're gonna try and find open reading frames, we can use the same algorithm. We're gonna look for the start and stop and we're gonna make sure that things are in the same reading frame. So this is a bit of an animation where we've got a start codon and then we can find, as we move along, we're finding combinations of starts and stops. And we're locating the indices of all of these start and stop codons. So this is a slightly different algorithm than what I described before, but it's one that's mathematically a little faster or computationally a little faster. And so as we scan through this sequence of about 40 bases, we've identified the stop and start codons. So there's, I think, three stops and four start codons in this sequence. And based on their position, we can identify whether they're in-frame or out-of-frame and which ones might correlate that. So by finding all of the starts and stops, we can essentially identify possible genes. And this is looking at the forward strand and then also the reverse strand or the reverse complement. And this is an example from actually, I guess, E. coli where we're identifying the start and stop. We found one that went from 337 to 2799. That actually turned out to be a real gene. But we also found sort of a nested gene from 567 to 780. That's not a real gene. We found another one between 798 and 915. That's not a real gene. But then we found another one that is immediately after 2799. It started at 281 and went to 3733. That's a real gene. That's in the forward strand. And we looked at the reverse complement or the reverse strand. And then again, you can see where they were identified. Some overlapped, some didn't. Some were long, some were short. But on average, you're seeing that at the gene level, we're finding a lot of false positives. So we're just gonna take an example where we're gonna use Python to program an ORF finder. This is not machine learning. This is just Python. And we do have the Python code for it. Again, it's not machine learning. It's in module six and it's just called ORF finder. The general algorithm is simple. It reads the data. It verifies the data, just checking to see if there's any missing data. Then we create a codon function that just essentially identifies start and stop codons. And then we look at all the start and stop codon lists that are in the different three reading frames. And we identify start, stop, codon pairs that are in the same reading frame. We then create an ORF function, which then analyzes those lists and looks for separation. In this case, we're finding out that the ORFs have to be at least 40 bases. This is quite short, but it turns out there's about a dozen of these in the average bacterium with very short, open reading frames. We also encode another check to identify start codons as about to three stop codons in the same reading frame just to avoid truncating some of the genes. And again, this is learning by experience. So we combine the codons and the ORF functions together to create the ORF finder function. And then we just run it and then we calculate the correct and incorrect ORFs where we're using the correct ORFs that were tabulated by NCBI. In this case, we're analyzing E. coli genome. So for running the program, we have to import numpy and pandas as we normally do if we're reading the dataset. We're looking at both the forward and reverse complement sequence. And these are all fast A formats and we're placing certain backslash next line characters just to keep the data simple. We're doing the missing data check to see if there's something not there. So we usually have that in these programs. Here's the codons function. We're just identifying the stop codons here. And we're gonna create a list of the start, which is in this case, ATG and the stop ones, which are the TA, TTA and TAG. And we keep track of our counters. We're gonna look at three different rating frames. We start from zero, one, two, which gives us three. So we don't start at one, two, three, we start at zero. We look at the first codon, see if it's a start codon, add it to a start codon list, look to see if it's a stop codon, add it to a stop codon list and then keep track of the reading frame. I've talked about some of the alternate start codons, ATG, GTT and TTG are the more common ones. The other class 2A ones are really rare, so we're not gonna consider them. So once we found start codons, then we're pairing the stop codons and that allows us to form an ORF. So this sort of showing frame one, here's the list of start codons in green, and then frame one, the list of stop codons in red. And you can see that the frame one, we found stop codons at positions 13, 46, 61, 64 and 112. All of those are less than 190, so therefore the only valuable stop codon is one at 253 because that's greater than 190. So we can pair 190 to 253 and say though that's a viable or potential ORF. So we have the ORF tool or create the valid ORF start and stop. We check for a minimum length, and then we determine the frame of the start codons from that. That continues and it's just also dealing with situations if we have any additional stop codons, perhaps getting the right start codon but the wrong stop codons. So this makes sure that we're able to at least include if the gene's too short, then at least we found one or two or three that are possible ones where there's possibly a skipping step that goes on. Then we've put all of these components together to essentially create the full ORF finding function that starts and stops and also does all the comparisons between the correct and incorrect codons that we had or ORFs in this particular calculation. So as I say, this is not machine learning. This is just a Python program for identifying ORFs and it's not too different than the one that I showed earlier with the NCBI ORF finder. Almost identical in terms of its coding. It doesn't have the nice graphical interface. So what we did was we analyzed E. coli. E. coli is probably the best studied bacterial genome. There's about 4,400 known genes, give or take about 10 and the overall length for the E. coli genome is about 4.6 million bases. So we took the data and ran it through, so all 4.6 million bases ran it through the program and identified the start and stop codons and both correct ones and the incorrect ones. So using this algorithm, we were able to correctly predict 4282 out of the 4407, which is about 91% correct. But we also predicted about 1.63 million other genes, which is insane, but that's an example of how a naive ORF finder can seriously overpredict. So we maximized our correct predictions, the true positives, but at the expense of predicting, I guess it's 400 times more other genes. And so this is essentially the problem with ORF finders. They produce lots of false positives, so many false positives that they're almost useless. And so this is sort of showing in this truth table or confusion matrix. So this is, if you want the baseline, this is about as bad as you can get in terms of overprediction, but it's also about as good as you can get in terms of finding ORFs using the concept that there's a start codon and a stop codon and that there's three possible start codons and three possible stop codons. So 91% is okay, but we obviously want to get rid of the 1.6 million over predictions. So this is where neural nets can come to the rescue, or at least naively this is where we thought. We said, look, ORF predictions are kind of pattern analysis. Machine learning is a good tool. Neural nets are probably the best ones. We should be able to use neural nets just like we did for secondary structure to identify genes. So we're gonna define our problem, follow through the standard six steps. And here is how do we find prokaryotic genes from large genome sequence. So the rationale, as I said, was to use neural nets, artificial neural nets. And as you guys saw yesterday, we can take a sequence, in this case, amino acids, and we can predict the appearance and occurrence of helices, coils and beta strands. And the program we produced was able to get around 61, 62% correct. Technically with genes, thinking would be it's simpler. DNA, there's just four bases, not 20 amino acids. And instead of helix, coil, and beta sheet, there's just coding and non-coding. The non-coding is the gene. Or the coding is the gene, the non-coding is essentially empty sequence where there's nothing there. So we can structure our problem almost identically to the secondary structure prediction one. So when we can find our training and testing data. So in this case, as we found out from the ORF finder that I talked about earlier, we can use the 4.4 million bases, the correct gene locations that have been tabulated through many years of study, both in the forward and the reverse strand. And we can use the same concepts that we had and talked about last time in terms of neural nets. We can take segments of 17 bases at a time instead of 17 amino acids, a window. We can do the standard one-hot encoding where we convert A's and T's and G's to these numbers. We can convert coding to non-coding. So our outputs are sort of simple again through one-hot encoding. And we can take our sliding window and move it around from, you know, in this case, I'm just using Windows 3, but we can put out a window of 17, 50, whatever we want. And our input vector is one, our output vector is another. Again, this is the example where we've got the sequence that we've put in, we've one-hot encoded it. So T is 010, G is 1-0-0, and A is 0-0-1. We have our weight matrices that are initially randomized. We do our matrix calculation and we calculate our final output. And we wanted an output of zero-one. Our initial one is actually 0.24 and 0.74. So we do back propagation, modify things as we talked about last time, and we back propagate through layer one and layer two. And then we input either another vector or retrain it in another epoch. And now instead of getting, and it was 0.24 and 0.76, now we get 0.16 and 0.91. And if after many trials, that's about as low as we can go, we can say we've converged. We can train on a second input, compare, see how well that does, iterate, back propagate, go through more training rounds. And eventually we have our generalized training matrices which are then useful for calculating any other gene or gene prediction. So conceptually, same thing as we did for secondary structure, but just with a different alphabet and different one-hot encoding. So we construct our dataset. Now we're gonna do transformation and feature selection. So in this case, this is our encoding. So the one I gave before, we just have three bases. Now we have four. So you can see how A's, C's, G's and T's are encoded. And you can see how we've encoded C's and N's. In this case, we didn't have to have zero, one. We could just have zero and one. So this is the encoding that we're doing. We're not gonna use the null amino acid concept. And this is because we're dealing with a genome that's 4.5 million bases. So I really don't care if we're off by 10 or 15. Whereas if we're looking at a protein that maybe has 200 protein, 200 amino acids, being off by even just a few residues is gonna mess you up. So this is just a simplification. It means we're gonna slightly mess up our prediction for about 0.001% of the whole genome, which is pretty small. So we're gonna have a, just like with our genome or with our secondary structure, we're gonna have a window. This is a small window. And if this is, what is it? I think 11 bases. So we predict the sixth base and say the middle base corresponding to the bold G corresponds to a bold C, which is a coding region. So as we slide the window along, we say coding, coding, coding, coding until we see a transition to non-coding. As I say, we're not using the null window, so we're just sliding along. No predictions are gonna be made for the first few nucleotides, but out of 4 million to miss 10 is trivial. So it's just a simplification. Here's an example where we've got a window of 17 and now this is not amino acids, but it's bases. And we can see how these are encoded with the encoding function that we had. We're gonna do the same flattening. So if we have four bits instead of the 21 that we had for amino acids and they got a window of 17, then we'd have a flattened one of 68 bits for each position along the genome. And we can shift this so we can move it down. So again, this is almost an exact analogy to what we did for our secondary structure. And you could use essentially the same code for the secondary structure. We also converted the genome data that we got from the prokaryotic data in NCBI to mark coding and non-coding. So we took the 4.4 million bases in E. coli and put in ones and zeros to match the positions of all the genes, because that's our output. We're gonna do a training set of 70% and the test set of 30%, 4.6 million bases. So that translates to 3.2 million training samples and 1.3 million test samples. So lots of training data, which is great. And then we have associated coding bits. We have our window sizes. So it could be up to 68 bits or larger or smaller and their output, which is written on the right side of this table. So again, if you understood a little bit about the secondary structure prediction, ANN, this is almost identical. We're just using a slightly different encoding. So this program, we'll call it a simple ANN, GS ANN, we can write the program. To say you can pretty much copy the secondary structure, ANN and start coding just as we did. We put in the same functions. We import the gene position data that we've got so that we know which one's right, which one's wrong. We're gonna create, read both the forward and reverse strands, the same reading sequence that we had for the or finder. And then as I said, it's basically the rest of it's the secondary structure program that we've developed in module four, which is some minor tweaks. So how do we do? So we looked at different window sizes. So we went up to a window size of 17 then we started shrinking the window size down to seven bases and then nine bases. And these are the confusion matrices. And certainly the window size of nine does better than the window size of seven. Going up to 11, going down to, or up to 17 or down to five didn't improve. So window size nine turned out to be the best. And it's not great. The gain which you want to see with the confusion matrix is along the diagonal, predicting an actual of being 0.99 or one and predicted non-coding in actual non-coding. You also want to see around 0.99. Here we're seeing that we're only about on average, I don't know, 60%, 50% correct and that we mess up with false positives and false negatives fairly badly, probably around 40, 45% of the time. So kind of like the hidden Markov model, which we presented in section module five, this one is also disappointing. And it's basically saying that you can't simply translate what you thought was a good model or worked for secondary structure to gene prediction. What we found is that because the windows sizes were relatively modest and because there wasn't enough information in these windows, it caused not the typical output. So if this is the sequence down at the bottom here, we would have thought that we would see clusters of Ns, non-coding clusters of Cs for coding following the cluster of Ns. What we actually saw was just this variation of Nc, Nn, Nc, Nc so that there was no coherence. So this looks like it'd be good for maybe a protein production, secondary production, but the secondary structure prediction, but this is not what you wanted to see. So it caused the flip-flopping at the output. So it didn't really give very useful gene predictions. So at the gene level, it was even worse than what I show here. So this is at the base level. At the gene level, we're probably about 0.1, 0.1 and the rest is pretty abysmal. So overall, this was a disappointment. And I think the lesson is that simply naively applying, say a neural net or a concept that worked for one to another doesn't work. And so what we needed to do was rethink how we could use the neural net and rethink which features we should be looking for. And this is what I've been emphasizing not only in the last lecture, but I think throughout yesterday as well, that to really use machine learning well, you have to know a little bit about the problem and be aware of the things that help or enhance the performance. You need to understand the problem. You need to read through the literature and you need to be aware of the things that other people have suggested or thought of. This is the essence of supervised learning. You are the supervisor. You are the one that's deciding what I'm going to expose my learner to and what are the pieces of information that are relevant and which ones aren't. You also have to make the decision about how do I train it? What ways do I train it? And it's a similar situation with the Hidden Markov story said, how do you design the topology? How do you choose how many hidden nodes you're gonna have and how they're connected? That's up to you and it's often requires the fair bit of thinking and intelligence. So what we thought about is we looked at the problem and said, okay, we've done really badly. How to improve it? So this is where we actually started reading the literature about gene prediction. And we realized that we needed to include more than just start, stop codons. We also wanted to include alternate start codons and we wanted to include codon frequency information. Some of you might remember or know that there's a preference bias for codons in coding regions and in non-coding regions. There's a first base preference, a second base preference and if you're designing genes, if you've ever had to do that, you have base preferences and codon preferences that you're told about to enhance the expression. There's also people noticed even back in the 1980s or late 70s that there is a tendency for triplets or doublets of codons or hexamers. Three bases also tend to be biased between coding and non-coding region in between promoters and terminator regions. We also wanted to include information on the Prypno box, the Scheindel-Garno sequence. And ideally, those are sequences that we could identify through hidden Markov models. Although in this example, we just actually used position-specific scoring matrices, partly because it was a little faster and at the time we hadn't finished the hidden Markov model when we were putting this in. So this is the additional data that we felt would actually help improve it. So this is, if you want, feature engineering. It's looking through the literature and saying, what else have other people found to be useful features for making the prediction? And naively, we could have just said, all we need to know is the sequence and the start and stop codon were done. But that's obviously not sufficient. We saw pretty poor performance. And so that's what we started doing. So this is just sort of a review about the Prypno box, the minus 35, minus 10 region that we talked about in module five. The Scheindel-Garno sequence, which is the Ravisson binding site, and it has a characteristic pattern of bases and positions. And this is something that can be encoded through either hidden Markov model or through a position-specific scoring matrix. There's certain features about termination loops that are found in bacteria. This pairing stem loop structure where you look for a stretch of six or seven complementary bases to find or scan for that. This is another way of terminating a gene. So that led us, as I say, to think about how we do feature engineering, how we do feature extraction. And it's essentially a way of using standards statistical methods to describe the data, to intelligently select what data we want, to use the literature, to use our knowledge, our experience or the experience of others. What it also does is that instead of just dealing with raw sequence, which is just bases that are collections of things, we can convert hundreds of bases into a few features. So it collapses the sequence instead of being hundreds, thousands or millions of bases long to half a dozen features. So this is almost a scaling that's happened. So we've got our raw sequence that, as I say, represents thousands to millions of bases. And now all we've converted the thousands of bases to is five features or six features or whatever we have. So this actually helps a lot. It sort of reduces the data size. It speeds things up. I should comment that the gene predictor that we used, a simple one with a neural network, took forever because it was dealing with so many features and training for so long because we're working with millions of bases. But by extracting features from the sequence, we've actually collapsed it into a very small little vector. And that's great. So some of the things that we did was we started looking at code on usage frequency. And this is just a plot of 64 codons that are node and how frequently they're used, how frequently they're seen, serine, arginine. So only two arging codons are basically used. Only one loosing codon is frequently used. There's only two out of the three stop codons that are frequently used. These are all things that have been noted and detected in bacteria. And it may vary for certain classes of bacteria, more at the class or order level, but for the same species or genus, this codon preference seems to hold pretty strongly. So we can code and identify the codon usage frequency. We can look at the percentage and how frequently they're used, which ones are most frequently used. We can calculate the fractions of ones that are in the first position of A's. And we can calculate that occurrence. So in the end, codon usage frequency can be converted to just a vector of 12 numbers. And that was done. So we were able to take all of that information, all those plots for the 64 known codons and essentially convert it into a frequency usage codon set, which is, again, it's not thousands of bases long. It's not millions of bases long. It's just 12 pieces of information that we can calculate. And this is converted for every potential codon or every potential gene or that we would see. So we've converted if the ORF has been identified as 1500 bases, it's now 12 units. We can also go a little further, and this is not our idea, but this is something that other smart people had figured out. So again, this is using the literature. We can convert some of this information into an entropy. And this goes back to the Shannon entropy formula that we introduced yesterday with P log P. In this case, you can sum over 12 instances and we can come up with a total. So instead of the 12 numbers, we can come up with a single number for that entropy. And then we can convert that entropy into what's called the gamma value. So that's just a reference to, in this case, the log 12 is the 12 units. And so now we can convert the gamma from, I guess, the original value that we had back here, 0.24 and so on all the way through to this number, 0.12. Next thing that we can do beyond the codon frequency statistics is to get the Hexmer statistics. So it turns out that there's 4,096 heximers. So that's 64 squared. So 64 codons, it's three, 64 spheric gives us 4,096. We can look at the information that we have already for other prokaryotic genomes, looking at promoters and terminators or non-coding. And we can calculate things in a sort of a log odds or log ratio term. So again, this keeps the numbers, it's a scaling trick. And remember, I talked about scaling and normalization. So this is one of the things that we're doing here. So that was the same thing with the entropy value as another scaling trick. And in this case, high values indicate higher usage in coding. So A-T-T-A-G-C, that codon has mostly negative numbers or small numbers. So it's not very widely used in promoters or terminators. On the other hand, T-A-A-A-A, that particular Hexmer is widely used in promoters and terminators. So this can be calculated for all 4,096 codons from existing data on prokaryotic genes. So we can look at the Hexmers, we can calculate them for ORFs, we can calculate them in promoter regions, we can calculate them in terminator regions. Again, this is data that we can get from other sources. So we're creating feature sets. The other thing that we mentioned is trying to get promoters. So we could have used the hidden Markov model at the time when we put this together. We didn't have a good working one. So we just used a position-specific squaring matrix. And because promoters have different gaps between the minus 35 and the minus 10 region, some of them are 15, 16, 17, 18 and 19. And because possums don't handle gaps, we had to create five different possums to deal with the different gaps to find these particular promoter regions. And we can convert them to log odds ratios and then identify the score and generate the lowest score wins with possums. So the one with the 18 space has the lowest score at 4.97. So that's the winning promoter. And so we can also identify the location as we scan along certain regions and map out where the lowest score is and what position that lowest score is and what the position box is. So again, this is a set of features that we can calculate throughout the entire genome. We can do the same thing with a sort of a possum for the Scheindel-Garno sequence. So created a possum for that. We can scan along that looking forward and seeing if there's a strong Scheindel-Garno sequence. In the region. So the codon usage is 12 features. Hexamer scoring, we convert it to three features. The Prybno box possum, we convert that to 10 features. The Scheindel-Garno possum, we convert that to seven features. So out of the, if we recall, we have this ORF finder, which identifies 1.6 million ORFs of which 4,200 are true ORFs and the other 1.6 million are not ORFs. So we use the old code that I described earlier to find any possible ORF. And then we take all of those 1.6 million ORFs and we convert those ORFs into these features of 12, three, 10 and seven. So that's 32 features. So those 32 features for each ORF are put in as inputs to the artificial neural net. And the neural net is gonna make the call of is this entire ORF of 1,000 bases say, is it coding or non-coding? So I just wanna make sure people are clear on that. So we're using the old code that I introduced to you, which is the ORF finder. It's not machine learning, it's naive, it just finds start and stop codons. It over-predicted by 1.6 million, the number of ORFs. And so we're now invoking our artificial neural net to clean it up. And to do this, we're just taking ORF features, not their sequence, but we analyze, we extract features from each ORF, including the codon frequency, the hexamer statistics, PREVNO box presence around the ORF and the Shindelkarno sequence present around the ORF. And we run it through the neural net. So now, instead of having G-SAN, which is sort of the simple neural net, we're now using the feature neural net, and we're calling it G-PAN. So this one actually does exist, whereas the G-SAN we deleted because it was so bad. So you can find the code in there. And what you'll see for this code is that it has same structure for the first half of it, which is just the ORF finder. So that's up to about this line here. So this identifies all the ORFs. It gets the 1.6 million. And then we call in the next part, which is to calculate codon usage frequency, calculating entropy parameter for gamma. We do the hexamer scoring and counting with these functions here. Then we run the possum scoring functions on the promoter region or the privilege box and the Shindelkarno scoring function is also called. And then we take all of those calculated parameters and then feed them into a neural net with 32 inputs, five hidden units, and then the output as a yes, no coding, non-coding. So if you look at the code, you'll see that we import numpy and pandas as we normally do. We import all the gene position data because we want to evaluate it. So that's reading it. So this is the truth table, if you want. We import the sequence just as we did with the genome or finder that I described earlier. So this is stuff you've seen before, the checking, the missing data, label values. This is the same code again from the or finder where we're using start and stop codons that we're making sure that we're in the right reading frames and check all three reading frames again or finder. Same code as before, two lists are created. We have the minimum length of 40 bases. We look for possible stop codons. This generates all of the necessary ORF data. So now we can start from here, which is determining the codon frequency position all the way to the neural net. So the codon frequency calculation. So we have to identify how many of these are found. We look, break them out. We look at the range, make sure that we've got all of them marked up. From there, we count the frequencies of A's and C's and G's and T's in those codons. We calculate the entropy. We calculate the gamma. So you can see the H is the entropy here. Gamma is one minus H over H naught. And then we produce that into a vector, which is used for the calculation. Next, after we've done the entropy and codon frequency, we have to do the hexamer calculations. So we're calculating the numbers of each of the 4,096 hexamers, possible hexamers. And we're calculating it not only over the coding regions, but also in the promoter and terminator regions. So this is a function to calculate hexamers. This is the function to assign a score based on its frequency in both coding and the non-coding DNA. So we've got codon frequency information that it's looking up. We're also tracking for some non-coding ones, some prior information that's in the tables that we collected. After we've got the codon frequency scores calculated at the codon frequency and hexamer frequency, then we're looking at the promoters. This is the Prypno box and the Sheindel-Garno sequence. So we're looking at the possums. And so these are the possums that were previously created. And then we're used to score each of the sequences in and around the promoter regions that we'd identified. So this is just simply looking and calculating possum scores. Same sort of thing that was done for the Sheindel-Garno. We're looking for this consensus sequence, but we're also looking through some all possible variations of the 4,096 hexamers. And the Sheindel-Garno score then calculates those things to a sequence region in the promoter region. So make sure that we find that and to store those and identify how good they are. So these are Python routines. They're intended to extract the features, evaluate the features, produce some numbers. And each of them is producing, you know, certain quota numbers. The codon frequency gives us 12. The axomer gives us whatever it was. Sheindel-Garno promoter, Prypno promoter total is 32. So about 32 inputs and we decided that we're going to use our neural net to do the discrimination to make the decisions. So 32 features for each of the 1.6 million codons. And then those are put into a hidden layer, which has five units. And then the output is a single, yes or no, zero coding, one non-coding. So that's the architecture. Why we chose five, it was just optimization as we tried different numbers of hidden units and that seemed to give the best answer. So this is a big program that's almost a thousand lines. It takes a while for it to extract the features, about 70 seconds to perform and run. It's about a little over a minute. So it's, as I say, two parts. One is ORF identification, non-machine learning. Second part is actually the feature extraction. And then I guess really the third part is the neural net. So all three combined to produce this 800, 900 line program. So it was trained on a training set of about, what was it, 3 million ORFs and then tested on about 1 million. Why, I should say, how did we divide up? So 1.6 million ORFs. So 70% of them were not useful. But anyways, 70, 30 split. We took the 30 and tested them and then we assessed it. So using just information about some of the start and stop codons. And I think it was just the data alone about our Hexamark statistics. I believe we got up to 91%. If, no, so this is the one where it was a simpler finders. So it was 91% for finding the 4,200 out of the 4,400 genes. But then the 1.6 million over prediction. Is what we got in terms of the false positives. So that performance in this direction, good performance in this direction. The G-SAN, where we use a nine base window, gave us a 40% prediction for correct genes, 73% for non-correct genes, but still not a great performance in terms of false positives and false negatives. So that was the G-SAN. If we just use the codon frequency statistics, using the feature performance, or G-PAN method instead of G-SAN, or the simple neural net, or the simple or finder, you get a really dramatic change. So 92% correct genes, 91% non-coding genes, and only a seven to 9% error. So let's just go back here. We're seeing 60%, 27%, only 40 and 73. And then with a really naive one, it was 91% here, but 99.999% here and 0.9% there, or 9% there. So this alone should show you that changing your format and the model where we're now saying, let's not just simply use it as a secondary structure prediction tool, but one where we're now taking features and combining them and doing it in an intelligent way or using feature selection and feature engineering, a profound improvement. In fact, this alone is about as good as the best genome predictors can get. And it's just using codon frequency statistics. So then we ran the program where we use the hexamer statistics and things improved slightly. So 92 to 923, 911 to 917, number of false positives also slightly reduced here. So codon frequency hexamer statistics help. Then if we include promoter information, then again, we're almost up to 93%. This is above 92 when these things have now dropped to around 7%. We tried a little bit with the HMM model, often on it, made slight improvements, but nothing really to shout about. So although we had hoped the HMM would actually have helped more than it did, it didn't make that much of a difference in terms of promoter prediction. But again, overall, I just want to emphasize that we went from a really awful performance in terms of over predicting by millions and millions of base of orfs to one where we're hovering around 90, 93 and 92% in terms of prediction of the correct orfs and only having around a 7% error of incorrect orfs. And as I say, that actually makes this particular algorithm, at least version three, about as good as any prophetic orf predictor there is. So again, the point is that doing proper reading, proper feature engineering makes a huge difference and codon frequency information. There have been papers written about it. There is long discussions, the idea of using entropy to calculate it. These were not our ideas. This is what was in the literature. Using Hexamer statistics, again, someone smart came up with the idea, noticed it was useful. We incorporated that. The Scheindel-Garno and PribnoBox, that was known. We just said, well, that's going to make a difference. I don't know how it helps out. So the lesson here is it's smart encoding, smart feature selection, knowing more about your problem. And using those can make a real, real difference to performance. And using sort of the naive approach, which I showed you before, which was just this scan over a base, nine base window, really gives you pretty miserable results. And if we just simply used the simple hammer of neural nets for gene prediction, this is about as far as we could go. And we could have tried to use deep neural nets and we could have used all the latest tools that anyone's described. And we would probably still get the same performance, which is pretty bad. So this, again, just to highlight that using and reframing your problem can lead to dramatic improvements to the performance. So what we have now is a neural net program written in pure Python. We've also done this in R. And what you can do now, I guess really is use the time. Probably looks like we've got almost 45, 15 minutes to run through the lab. So I'll briefly give you some guidance about the lab. Again, you can go to the G pan code repository, which is in module six describes all of the tools. We have both Python and the R. It's a long program, almost a thousand lines in Python. So it might take you a little bit of time if you want to understand the code, but hopefully the slides have given you a bit of an outline. This is complicated. So if you want to be able to upload the data, it's not a single file upload. There are six files to upload. These have to be done in order to be able to run the program. Once you've uploaded all of the data, then you can run from the runtime. And you can also run all. You can enter a length between any number of 10,000 to 50,000 equitides. You can choose a minimum or flanks. And then you can collect and view the results. You can go to different cells. And so you can change how long your orph needs to be. You can enter a value in terms of the limit for the genome. In this case, we're not going to try and use the entire genome. So we suggest you something between 1% and 30%. We also have other exercises where you can play around with different cells. And removing, in this case, code on usage frequencies, numbers and see how that accuracy for the predictions increases or decreases. You can select and deselect other features to see what happens. Try and find out which ones make the most difference. We can also change the SD start and end locations. This is the Shine Del Garno sequence. You can rerun that and see how that changes and how that perhaps improves or decreases overall performance.