 Okay, I think, can everyone see this now? This is module five today, and the usual creative column, commons. And so what we're gonna be talking about is gene finding with artificial neural nets today. And there is a change in the schedule. So we're gonna be, for those of you on Eastern Time, starting, well, no, it's about 10 after 10. We're gonna go until noon. So instead of an hour and a half, this is gonna be almost two hours. Gonna have a break. And then we're gonna go into modules six and seven, which are dealing with machine learning, with Keras and scikit-learn. Hopefully these will be a little shorter. They'll give you more time to do things. And then we'll wrap up today with a module on information extraction and coding with chat GPT. So minor change to the schedule, just take note. And I'll try and also have a little bit of a break because sitting for two hours is a long time. So this is about gene prediction. And we're gonna talk about prokaryotic gene prediction, partly because it's easier than eukaryotic gene prediction, but I think it's a really useful example. And we're gonna talk about how it can be done, how it can be assessed. We're gonna look at some Python code for different methods to predict prokaryotic genes. But we're not gonna use the HMM here. So I think that's something I should have fixed up before tonight, last night. The key thing here, and I'll start off and I'll emphasize this. This is not like secondary structure, which we did last night. It is to show how, it's an illustration of how feature selection and how smart analysis of your problem can make a huge difference to the performance of a machine learning algorithm. So it's an illustrative example. So it's not just, oh, let's do the same old thing that we've been doing before. It's partly to show you a real process, how naively you'd wanna try it this way. Here's a result you get. You find it's not as good as you would hope. You try another way, still not as good. And then you try, in this case, smart feature selection. And you get a radical improvement in your performance. So that's the whole point of this particular lesson. Okay, so background, I think everyone knows this stuff, but there is a genetic code. There's 64 different types of codons. Codons are these three base words, if you wanna call them that, the code for different amino acids. There are start codons, well, there's one in particular that's ATG or AUG start codon. And there's three stop codons, TAA, TAG, TGA. And of course this is using uracil, which is actually the RNA version of things, which is probably the correct way of thinking about it. So the genetic codes been known since the 1960s. We can use simple programs to do translation. So if we find a gene, we can then figure out what that gene codes for, at least those protein coding genes. And we can talk about forward strand and reverse strand because DNA is double-stranded helix. So the forward strand, which is at the top, traditionally, we're showing how it can translate. There's a frame one, there's a frame two, there's a frame three, we can shift things. And then the reverse strand, we read the other direction. We can create a reverse complement, which allows us to lead from left to right, but I'm just showing what it would be like if we were reading from right to left. And again, we can see frame minus one, frame minus two and frame minus three. So technically any given strand of DNA can code for six different reading frames. Again, I think everyone knows that. Just a quick review about prokaryotic gene structures. Prokaryotes have really simple gene structure. Typically they'll have some upstream signalling components. The most common one is called a Tata box. These are maybe 100 bases upstream. They have a start codon, they have a stop codon. Genes and prokaryotes just typically called open reading frames. So they don't have interrupted genes like eukaryotes. So it makes it a lot easier to find things. I'm just showing a segment where we've got a gene or a pretend gene and we've got the green start codon and the red stop codon ATG. It's a start codon for more than 90, 95% of genes and then the stop codons any 103. And again, there's different reading frames but once the gene is being read, it sticks with one reading frame. So some of you, maybe if you've done some coding, maybe that was the very first real program you wrote in biology was to do gene finding which is open reading frame. So you start with a long gene segment or a long piece of DNA and you scan until you find a start codon and that defines your reading frame and then you stay in the same reading frame and scan in groups of three until you find a stop codon. Now, if that segment or open reading frame is long enough usually you might say 40 or 50 codons, 40 or 50 bases but everyone choose. You say, okay, that's a gene. And if that's less than that, it's not a gene. And then once you've found the end of the stop codon then you go to the next thing. I can bring back to this top of the algorithm you scan forward again. Now, in this case, if you're avoiding nested genes you might jump to the complete next gene or you might then start scanning within the same open reading frame that you just identified and look for another start codon. Anyways, it's a fairly simple algorithm. It's just looking for starts and stop points and in both the forward reading frame and the reverse reading frame. There are web servers, NCBI offers one which is just a simple or finder. You can paste in tens of thousands or a few thousand bases and it'll identify the open reading frames. I'll show you the direction. It numbers them, it tells you the start, stop, the length, the strand, the frame type and so on. And then you can translate and do blast searches on that again for people who've done prokaryotic genetics. This is pretty simple stuff. It's not the most sophisticated but it's the way prokaryotic genes often are found. Now, if it's prokaryotic or even eukaryotic gene prediction there's formal ways of evaluating how well you're doing. So, in light blue is say the genes that have been identified in this DNA segment and then you have some or finder or this could also be exons and introns. So, blue is exon, blank line is an intron. Either way, whether it's prokaryotic or eukaryotic you can use this concept of comparing actual to predicted. And if when blue matches with red, that's the true positive. When red matches with the blank lane, if you want that's a false positive. When the blank line or simple line matches actual and predicted that's a true negative. And when the prediction doesn't make any kind of prediction so it's the blue matches to a straight line that's called a false negative. So, this is the same concept of the confusion matrix that we talked about before. And it also gives us the same way of calculating things like sensitivity and specificity. There are other calculations that people use. Some people use precision and correlation which have different formulas, but to measure how well things are being done. And you can measure it at the gene level or you can measure it at the base level in terms of your performance. So, here we're measuring basically at the base level where we're counting bases as which ones are false positive which ones are true negatives and true positives. If we were counting at the gene level, every one of these predictions would be considered completely wrong because they didn't get to complete match all the way through. The sensitivity has a formal calculation formula given I think that before last time the specificity. There's also a calculation for precision. Again, these are all things you can look up but this is how overall gene prediction, gene evaluation is done. We'll use this later on when we talk about the performance of this gene predictor or finder. So, we can have a set of start codons. The ATG is the most common for 95%. There's a couple of rare start codons, GTG and TTG. And then there's a sort of universal stop codons, TAA, TATG, TGA. So, for a large prokaryotic genome, three or four million bases, you can find all the start and stop codons both for the forward and reverse strands. Now, not all of those orbs will be genes and you'll see some numbers here about how many orbs you can actually find in a prokaryotic gene and it's a lot. So, here's a little animation where we're going through a DNA segment of about, I don't know, 40 bases and we're sliding around along three bases at a time or one base with a code on search and we're determining whether something is a start codon. So, if there's an ATG, that's a start. Then we'll have to slide along a little bit further and this is a stop. And each time you find it, we add a number to our list of starts and stops. And so, what you have to do is track which reading frame you were in and you can choose a arbitrary start. If the T here is reading frame one, the G beside it is reading frame two and the G after that is reading frame three. So, we're just keeping track of the starts and stops and the frame that things are happening in. And again, that's a trivial way to write a program. And so, with that, as I said, if you have three million bases scanned through it, forward and reverse strand, we can identify a start and a stop. And you can see a whole bunch of start and stops that were done in the forward and then we look at the reverse complement and there are also others that are starts and stops but what's marked in red are the ones which were wrong and then green ones that were correct. And that's the typical thing that happens with just simple or finding because there are literally millions of orfs in a genome of three to four million bases. Okay, so the first thing we're gonna try and do is just a really simple open reading frame finder using the algorithms that I've just described and pictured and animated. So, we're not gonna force you to write it. We're gonna go to module five. We're gonna open the Python code for that and we're gonna find the or finder, should be RF rather than OPF and select that and go to colout. So, the general algorithm is just like everything else. We read the data, then we verify the data set to look for sort of missing data or anything else. We're gonna create a function called codons and we're gonna have a list of the start and stop points. We're gonna identify all the start and stop codon pairs in all three reading frames and then we're gonna match them in the same reading frame. So, if we've got a start and a stop in reading frame one that's an ORF. We're gonna use an ORF function to match minimum ORF length of 40 bases. We could have set 40 codons, that could have been 120 bases but there are some really short genes that turns out in prokaryotes. So, this one's something that's useful. There's also something that's been noted over time that some cases genes read through stop codons. So, we can also identify pairs of start and stop codons where there's up to three stop codons in the same reading frame. And then we're gonna create another function just called ORF finder, which is different than ORF and different than codons and we're running it. And then of course we can do this on the reverse strand as well. Then we're also gonna determine how many of these ORFs because we have a reference set of in this case E. coli gene where the ORFs have been correctly identified through years of hard work. So, if we were writing this on your own you would import as we've always done NumPy and Pandas. We are aliasing them just to keep things a little simple. So, NumPy is Npy and or NumPy and Pandas is PD. We're reading the genome sequence here. So, we're gonna read both the forward and reverse strand and produce a reverse complement. We're cleaning up some of the gene tags or descriptors here just to make sure that we've got the sequence as we wish. So, getting rid of the first line. This is the missing values and data check. We've done this before and there's a verify data set function. We're looking to see if there's any missing data in our columns and since we're dealing with sequence data you almost never missing data. Of course, there could be all kinds of Xs and other things and that probably should have been added to this particular code but. And then the codons. One which are the start codons and then the other one is just the stop codons set. So, we're creating list one for start codons and list two for stop codons. And then we're counting as we go through this. So, gang, this is just a small function. And this is where we're now scanning for ORFs. We're looking over three different frames. So, we started at zero, one and two as opposed to one, two and three. So, once we're in position frame one, we'll look at the first codon. If it's a start, we add it to a start codon list. If it's a stop codon, we add it to the stop codon list. And then once all the codons in that current frame if it's frame one or zero in this case then we need to the next frame which is from zero to one or one to two. So, you're seeing list one and list two which are capturing those start and stop codons as we plug away. So, essentially what you got is given frame one, the green codons or the start ones, the red ones or the stop codons. We're building this list of frame one, stop codons, frame one, start codons, frame two, start codons, frame two, stop codons. And we have to identify that for a given start codon, if it's at 190, something that has a stop codon that's less than 190 is not possible. But if it's greater than 190 and if it's greater than 190 plus 40, it's potentially a valid stop codon. So, this is where we've got a minimum length that says we can keep that. And then we can, based on knowing which frame we've been working, we can say this is the start and codon. So, again, this is what the ORF finder is doing. Now, the other point we've had, which is at least for genes and this is something people observed is that you can have up to three stop codons in some genes or at least where two are ignored and the last one is kept. So, this is a trick to come through possibly finding the correct start codon but the wrong stop codon. And again, this is data that people who have studied prokaryotic genes have learned about. And this is another point where, I guess we're using feature information or feature selection to say, hey, we just can't use only the simple rule that as soon as you find a stop codon, that's the end of it. Likewise, if you recall at the beginning, there's also a couple of other alternate start codons. So, again, this is knowledge or feature selection that we're technically embedding into this program but it's not a machine learning program, it's just wisdom and I think that's, nothing beats having some experience or wisdom when you're coding. Then we're having, we have the ORF function. This is the official ORF finder function which is the third function I talked about. Runs the previous functions on the list of start codons and returns the list of all ORF predictions. And in this case, it's gonna make a comparison between the correct ORFs, core ORFs and incorrect ORFs, in core ORFs. And this is where we're able to look up the reference set of known open reading frames for the genome that we're analyzing. So, we can run this looking at, in this case, the E. coli genome. We're choosing E. coli K12 because it's probably a best-studied prokaryotic organism. It is the reference genome, it's been widely used in prokaryotes. We know the exact number of protein genes, 4,407. We know the exact length of the genome, 4.6 million bases. You can go to the NC identifiers of NC 00913.3. This is the reference genome for K12. And it has the list of the correct gene locations. The first gene starts at 337 and goes on and we also know there's complement ones. So all this data is available. This is our gold standard. I actually spent a good part of my early academic career working on annotating E. coli genomes. This is also another reason why we've chosen it. And if we run our simple or finder, we're looking at the 4.6 million bases using the methods, very, you know, accounting for extra stop codons using minimum gene lengths, which we know about already, using sort of multiple start codons, including ATG and GTG. We get 4,282 correct genes identified. So that's a little over 91%. However, we overpredict 1.6 million genes. So if we put in our confusion matrix, we get a nice score of 0.91 for the predicted genes and actual genes. And in terms of non-predictive genes and actual genes, we get basically a score of zero, which is terrible. And in terms of overprediction, we get 0.9999, which is rounded up to one, which is absolutely terrible. So if you wanted to use an or finder, even as one that's as sophisticated as this one, because it can account for alternate start and stop codons and much other things, it overpredits by a huge margin. So this would be a total disaster if we were trying to do gene prediction. It's also a reason why people don't generally use or finding for prokaryotic gene finding anymore. But this is what happens when you attempt to write a rational program for gene finding. Any comments or questions at this point? Okay, so this is, let's say it's an example of, okay, let's try and do it the rational way. Let's try and do this the machine learning way. So we know that the or finding approach doesn't work, so let's try an artificial neural net. So we're gonna use the standard workflow. We've seen this diagram before and we've pitched a proposal of how do I find genes in a prokaryotic DNA sequence? So what we would do, and most of us probably would say, this problem is actually pretty similar to what we were doing yesterday, which is we take a sequence, in that case it was a protein sequence and we're trying to find locations of strings of Cs for coils, Hs for helices and Bs for beta strands. In gene finding, we could do the same thing, except the sequence is not, I mean, acid alphabet, it's a DNA alphabet, so it's simpler. And instead of three states, it's just two states. There's non-coding, which are ends, and coatings, which are Cs. So this is the way that we could probably draw it out and the way that we could imagine how to code this. So if we're gonna do the ANN, then we have to find our training set. So the training set is the same training set that we were testing our algorithm, our ORF finder on. So the NC00913 E. coli sat with the 4.6 million bases, where all the start and end gene locations have been given, both in the forward and the reverse complement. And again, the same data that we've been compiling since 2005 about their ORFs and where they're located and what they are. So we have a really good goal data set to train on and we can do the same thing where we choose 70% to train and 30% to test. So we've already decided we're gonna be doing an artificial little net. We also know that we have to sort of do sort of some feature selection and data set transformations. So we can do the same thing that we did with the protein searches, the protein secondary structure prediction, which is to encode it just like we did with amino acids, which through one hot encoding. So we can encode nucleotides using sort of a four base structure. A is a 1-0-0-0, C-0-1-0-0, G-0-0-0-1-0 and T-0-0-0-1. It could do it even less. It could just have a three bit coding if you wanted to save space or time. The output will just be a zero or a one with zero for coding, sequence and one for non-coding. Now you guys might remember for the protein structure we have to insert these dummy variables or null amino acids. And that's because the secondary structure can start in the first or the second amino acid. It can also carry on to the very last amino acid. With genes, it's not typical. So we can kind of ignore the null characters. It's also a case that we're looking at sequences that are millions of bases opposed to hundreds of amino acids. So this is a slight approximation just to simplify the code. So in concept, although we're using a slightly different alphabet, you know, here's this training set where I don't have, I guess I could have put in a C here too, but we've got the A's, the T's and the G's. We have coding and non-coding, marked as zeros and ones. We can imagine that we have a sliding window where we're moving along one codon at a time. So that means if we've got three or as now I think we'll be using four, four letters for each base or three letters, we would have an input vector of nine characters at three or 12 characters. And then our output character is either a one or a zero and the window that we're actually looking at would give us a non-coding prediction. So we would have our, as we sort of illustrated with the protein secondary structure, we could have, you know, a codon, one hot encoded, we'd have a weight matrix. We would calculate sort of those intermediate values. Then we have another weight matrix and then calculate the output. And in this case, maybe the output isn't exactly one, it's close. We go through some back propagation to adjust the hidden layer and the output or input layer. As we talked about before, so things get changed. Then we do another FIFORT analysis. And we do this over and over and over again. So this is thousands of steps later and we find that now it's climbed up from 0.74 to 0.91. And maybe at that point it's, we feel that it's converged. We might then train on a second input and we'll do this for all the codons. So we're training through many thousands of runs, dozens of epochs and again, modifying the weight matrices until things, in this case also converge to one. And finally, after all of that, you have your generalized weight matrices, which could be used to identify codons with an artificial neural net. So very similar in concept, just sort of different encoding or one-hot encoding because of the alphabet size and the output values being just binary ends or Cs. So this is a game, a windowing that we're gonna use where we're using rather than, I think, just a three-base window. We're using multiple bases, more than three. We're not using a null input. So if we're just sliding things along, in this case it's a residue or we're doing about 12 bases that we're sliding along. And that means we're gonna miss the first five or last five bases in our prediction. I think, sorry, did the question from Nazia. Yeah, Nazia, go ahead. Thanks, I just also typed it in the chat. So my question was, do you randomly generate the weight matrix one, hidden layer and weight matrix two, and how do you decide on the dimension of these matrices? Thank you. So it's very similar to what we did yesterday for the protein one. So in fact, yes, the weight matrices are initiated with random numbers, usually between sort of zero and one. And the dimensionality of the matrices is partly dictated by your one-hot encoding rule. So if we're using an output of just a zero and a one, then that changes sort of the dimensions of what your hidden layers or middle layers might be. That means they're probably gonna be pretty small. Your input gain, we're encoding things and the window sizes that we're using is gonna dictate how big that input matrix is. So how you encode or how you embed sort of dictates how big those weight matrices will be, at least their dimensions. Okay, so for our input, we decided to use a window of 17 and for our base annotation or one-hot encoding, we used a four string or four-bit string. So that means it's gonna be 17 by four is the window or input code or encoding that we'll be using. And as I said before, we're sliding the window along. It's very much like what we did for the protein secondary structure. And in fact, with protein secondary structure, I think we used the same window of length of 17. And in terms of, we're also replacing the n's and c's with zeros and ones. And this allows us also to identify the positions of start and end because this is how we identify our codon placements or not our codon, but our gene start and end points. So this gives us a sort of a simple output, simple input, simple way of reading things. Everything is basically a bunch of zeros and ones, which is what neural nets like. So in this case, we took the 4.6 million basis of E. coli. We had a training set of 3 million, 3.2 million and then a test set of 1.3 million. And we had everything structured so that we had those codons or bases written out as needed. And they're coding, non-coding regions written out as needed. So we can write out what we've visualized or imagined in terms of designing the program. We can go to co-lab and do this. You guys that are all familiar with how to select the notebooks, create a new program, start writing, we can do exactly the same thing that we did before. We won't be using Seaborn or Matplotlib for the data visualization, but we can take all of our gene position data that we had from the E. coli NCBI genome set. We can import the entire sequence and we've got both the forward and the reverse complement data just as we did for the or finder. And then rather than going through everything that we did with the secondary structure algorithm, which, because we're gonna use a lot of the same things, we did the same, the training, we did the calculation of the differences, we had the sigmoidal functions, we had the same approaches for flattening vectors and everything else. So you could basically use a large portion of the secondary structure. And then we can look now at the gene prediction neural net performance. And we can also look at how big the window sizes are because we tried 17, we tried nine, we tried seven. And I'm just showing the ones that actually performed better. So the initial one of 17 didn't do so well. So we started dialing back and this is the confusion matrix. So remember that in the confusion matrix, the perfect one should be ones on the diagonal and zeros on the off diagonal. Instead of getting ones, we're getting 0.33 and 0.75. So we're doing pretty good at predicting non-coding regions but not so good at the actual coding regions, but we're not anywhere near the 90% we'd want to be or 95%. And then we have lots of errors around 60% for false positives and false negatives. So even though we spent quite a bit of time modifying code for this one, thinking that should work just like secondary structure and at least should give us something on the order of 80 or 90%, the neural net was a total fail. And when we actually looked at the output, instead of saying clusters of non-coding regions and clusters of coding regions like you would expect, the actual output is what you're seeing at the very bottom where you've got 10s and Cs and NNN and C and NNN, which is not the way that genes are structured. They have to have some continuity. The shortest gene is about 40 bases. The average gene is about three or 400 bases. So this was not exactly what we wanted to see. And I think it was not as if we were failing around, I think this is something we wanted to demonstrate for you guys. So we tried different ones to sort of show how you can fail with machine learning, where taking sort of a naive approach to machine learning and say, okay, it worked for this one. I'll just use the same thing. Or in the case of gene prediction, let's just treat one base at a time and see what we can do. And this is a tendency, as I said, especially with sequencing data where it's easy to get terabytes of data and it's all just sequence data. And the thinking is, well, let's just analyze the sequence. We've shown how we could apply a heuristic algorithm for our finding and do a terrible job. This is showing how you can apply a standard neural net and also get a terrible result. So it turns out, and this is where, if you read the literature about how people do gene finding for both carrots and new carrots, it's not done on a residue by residue or base by base level. Instead, it's essentially thinking about smart feature selection. It's taking what we've learned or what we know about genes and gene features and trying to encode that information into the gene finding algorithm. So what, as I said, the most important lesson from this particular module is not to say or learn additional coding about neural nets. It's about how to do smart feature selection or smart feature decision-making as the input for a better algorithm. So some of the things that we can do in this game by reading the literature, you'd say that you should just include a more information than standard start and stop codons. We have some of that already in the Orff Finder that we built at the beginning, having three different start codons. They're also alternate stop codons. We can include codon frequency information. This is something that people have noted for a long time. Preferred types of codons for coding regions. There's also preferred bases to be at the start of certain codons. Another thing that people found is that hexamers, which are two codons or paired codons together, tell you a lot. Hexamer frequencies are quite different between coding regions and non-coding regions between promoters and terminator regions. And this has been information that's been known for quite a while. Likewise, genes are not just simply start and stop orfs. They have the, what's called the RNA polymerase promoter site, the Prypno box. And there's also a Shine Del Garna sequence, which is the ribosine binding site. So these are other features that can be identified along a genome and are used to flag the start and end of real genes and to clean up and get rid of the non-genes or non-coding genes. So again, just to remind you about some of these alternate start codons, there's class ones, ATG, GTG and TTG. There's a couple of other ones, a little rarer CTG, ATT, ATA and ACG. So one thing to do is to include these ones, the class 2A. The RNA polymerase promoter site or the Prypno box. It has a characteristic conserved sequence. There's a TTG AC and then there's a TATA or the TATA box. And so these are located usually about 100 base pairs upstream of the start gene and it covers a region of about 40 bases or so. Just before the ATG start codon, there's another collection of about six bases called the Shine Delgarno sequence. And it has a general sequence of AGG, AGG, sort of a short repeat. And this is where the ribosome binds the messenger RNA strand. So this is where again, a lot of genes have this. And that was not part of our gene search. Certainly wasn't part of the neural net that we were coding for. There's also things like terminators and stem loops where the RNA transcript will sort of fold back on itself and produce a hairpin loop. And this is a way of making sure that the ribosome falls off and that the termination really happens on that gene. And that there isn't sort of this read through of the ATG or anything else. So these again are another features, although I think in this particular algorithm we didn't include this. So as I said, what this module really is about is not so much about gene finding, it's about feature extraction and choosing features that can make a difference. And that is a simple naive applications and machine learning, as we showed with the neural net can produce abysmal results if you haven't been smart about your feature selection. So feature selection, feature extraction, whatever you want to call it is using statistical data or additional information or knowledge or wisdom you have to describe the data. So we can intelligently select the data that will be used in our machine learning model. So what we're really technically doing is trying to contract or collapse the sequences. So I'm showing sequence one, two, and three. So these are, let's say they're genes and they all have different lengths. Some genes are 250, some are 1,500. But I'm trying to convert those sequences and codons into essentially a feature table where it's the same number of features. What's their hexamer frequency quality score? What's their Prypno box score? What's their Tata box score? What's their Shine Del Garno score? What's their alternate codon score? And so now those scores, which are shown in the bottom table, if I've only got five scores for each sequence, then I can determine whether this is a good gene or not. So sequence one has relatively low scores, so I might say it's not a gene. Sequence two has relatively high scores, so it is a gene. Sequence three also has low scores, so it's not a gene. And by simplifying it, so that I'm not looking at thousands of bases, but instead maybe a set of five features which have specific scores that I've simplified. And we talked about this yesterday where we said, you get some metagenomic sequence and you can say, yeah, I've got billions, billions of sequences. Well, why not just convert those sequences to features? So those sequences could be class order phylum genus species that you've derived from maybe the 16S RNA or some other analysis. They could be just focusing on not only highly conserved portions, but they might be for certain functional genes that are really important, so antibiotic resistance genes. So you're pulling out or teasing out the relevant information from this mess of noise. So that's the essence, if you want this maybe the most important slide is just this point about taking a lot of data and trying to extract intelligent, useful features out of it. And it also produces a simpler way of handling the data so instead of dealing with highly variable length sequences, the inputs are always of the same length and of the same scale and of the same range. And that makes it easier for machine learning to learn. So I'm gonna stop here for a few minutes partly because technically we've been going for an hour and then next time it'll be another hour. So if people need to stretch or if people have some questions, we can do that for the next five minutes. There is a question in the Slack from Shoba. Would a set of metabolites that are part of a pathway be such a feature extraction in metabolomics? Yeah, it is, it's a legitimate one. I mean, sometimes it works, sometimes it doesn't, but it is the basis for a couple of techniques that are used in metabolomics. Mummy Chug is a technique that's supported by a metabol analyst, but it sort of views lists of metabolites and converts it to sort of a listed pathways. And sometimes it's used to identify potentially missing or unidentified metabolites. It's also sometimes used to interpret the metabolomic data in a more complete way. Same sort of thing when you annotate genes with gene ontologies. So instead of a whole bunch of genes, 1,000 genes, maybe it falls, the 1,000 genes fall into 20 gene ontology classes. So instead of 1,000 inputs, it's now 20 inputs indicate how many of them are present. That simplifies your problem and also allows it to tease potentially more information out of the data. Great, thank you. There's another question in the chat. How many hidden layers will be used in the model? So, you know, if you're writing it from pure Python and doing what we're doing, usually you can only get about one or maybe a maximum of two hidden layers because it's just really hard to code. As we'll see later on when you use sklearn and keras, it's just put in a number and it'll handle the number of hidden layers you want. As you go from one or two hidden layers, you're using the simple neural net. As you add more hidden layers, you're moving more and more towards a deep neural net, which arguably means the model becomes more powerful, but it's gonna cost in terms of time for training. And sometimes you'll find that, you know, you keep on adding more hidden layers and you get no improvement in performance. And that, you know, it's an experiment you can try. So there's no, you know, hard and fast rule about the number of layers. It's just typically, you know, there's the problem hard, then you may need more layers. The problem's simple, you mean not as many. It's one where you can do trial and error to find the optimum one, but also be aware that, you know, choosing 100 layers means that you'll probably wait till you retire until you get an answer from the program. Yeah, that makes sense, thank you. I think Stephen had a question. Stephen, go ahead, please. Thank you. So we haven't discussed this explicitly here, but on the topic of feature extraction, I guess for more highly dimensional datasets, what are your thoughts on using some sort of dimensionality reduction like principal component analysis as a form of feature extraction before some other type of machine learning model being applied onto the data? Yeah, that's a really good point. It is a technique to do that. It's an approach you can use. So dimension of direction is it's a mathematical method. And so you can get some bizarre features if you want to say that, you know, what's my first principal component, my second principal component and my third principal component. And those components maybe, you know, you look at them and say, why did they put, you know, zodiac sign, hair color and body mass index as the first principal component? You know, what do they have to do with each other? But that's just what the math said. You could re-rationalize it and say that, you know, and where a person knowing more biology might say, well, the most important things, let's say it's for, I don't know, giantism, you know, trying to understand what are the genes that affect giantism, but you need to have some clinical factors. So, you know, height more than seven feet, length of jaw and nose, hand size versus foot size. Those are things that you would know as critically important as opposed to hair color, eye color and zodiac sign. So sometimes, you know, principal component analysis pulls out features that are not biologically sensible. But, sure, you can use principal component analysis to do that feature reduction and in essence, doing some feature selection. Thank you. There's another question from Kylie. What strategies can you use for featured extraction when you don't know very much about the event or disease, example, biological mechanisms? Yeah, this is a really good question. Yeah, this is where I think if you're naive to the problem and naive to the data, machine learning is not gonna help you a lot. So it really, you know, you can kind of blindly play along and try different things and maybe you'll be lucky. It's kind of like digging for gold. If you know nothing about where gold is found and you just say, well, I'm gonna start digging in my own backyard, you're probably not gonna find it. But if you know about certain minerals that are commonly associated with gold and you know where there've been hotspots and where there've been other mines that have found gold or about the geology and it has to be from a certain geological era, then that tells you that maybe this is a good place to start digging. So same thing for drilling for oil, they called it wildcatting. People would just randomly drill and hope they get something and most of them failed. But then when people started using the knowledge about geology and drilling near where others had also drilled, they started finding more oil. So I think it's really important when you start a machine learning exercise is to either get really familiar with the data, read about the studies that have been done, understand your problem more. Obviously, if it's a brand new field, then usually there's not much data on it. So in that case, machine learning needs data. And so again, it's probably not quite ready for it. If I can add Dr. Vashal here, maybe your comments also needed. We can also try with simple mechanisms such as clustering analysis and stuff like that so that we can just see that rich instances grouped together, which features grouped together and some also along with that some statistical analysis as well. That may help in the first place. Yeah, I'd say that's often a way of just getting familiar with your data, seeing if there are trends. Remember, the best machine learning device in the world still is your brain. And this is how a lot of these things, particularly these features and feature extraction I'm talking about are just things where smart people had noticed some interesting patterns and published on it. And if you collected those interesting patterns together, then you get a fairly powerful machine learning algorithm. So these are sort of independent discoveries published 20, 30 years ago. And by putting those things together into one algorithm using machine learning, you get something that's pretty useful. I think we have one more question in the chat from Desmond. If you don't know much about your data and wish to find relevant features, would something like lasso or ridge regression help in this case? Yeah, those are some of the feature extraction techniques. There's bagging and boosting and other feature selection, feature extraction. We talked about the Genie index analysis. So those are along with principle component analysis methods, approaches for doing feature extraction. And all of them can work. Some are better than others and some depend on the data type. But yeah, that's certainly possible. Okay, I think our time is up. I'm gonna move on. Hope we can finish up on time here. So this is another one that we were talking about was a feature extraction. It's the code on usage frequency. So there's a difference between code on usage and coding DNA and non-coding DNA. And so some might be rare and this is sort of illustrating where the color scheme isn't exactly clear. The frequency of certain codons they're used in. In this case, like you look at Arginine. So CGC and CGU are very common, but AGA and AGG are almost never used, at least in coding regions. So the reverse is true for non-coding. So we can look at codon frequencies. You can count the occurrence of each base at each position. And in this case, we're divided by the total number of bases in the sequence. And then from that, we can actually get a sequence because there's three bases in the codon. And four variations, we get a vector of 12. Or four bases. So this vector of 12 has been calculated at the bottom here. So there's 12 numbers with the different associations with ATG and C with the different codons. And so we have a frequency association. This approach, which was published and we probably should have provided the reference, was something that was I think discovered in the probably the 80s. And they also noticed that you didn't have to use all the information or all 12 numbers. So you could actually use some information data like the Shannon entropy concept. And they used this thing called H, which is sort of an entropy of the 12 points, but it allows you to replace the first element, which is the 0.24, with an entropy parameter. In a minor way, it changed 0.24 or 2.12. But it is an algorithm that's been consistently proven to work for assessing codon frequencies in gene coding regions. So as I say, it's something that was published in the 80s, just extracted from the paper. They showed that you could get pretty good identifications of coding and non-coding regions. And so that is the feature. The other important feature are these hexamers. So there's 4,096 different hexamers, 64 times 64. And you can calculate the hexamers scores in promoter regions, terminator regions, coding DNA and non-coding DNA. So there's essentially four regions. And again, we can do this for E. coli. We could have done it for other bacterial genome sequences because they've been characterized quite deeply. We can calculate in their log ratios just to sort of normalize or scale them. And these are some examples. So just two hexamers. There's the ATTAGC. And it's not used in promoters, not used in terminators. It's not used in coding regions. On the other hand, the TAAAA strongly used in promoters, strongly used in terminators, not found very much in coding regions. So this is one of the 4,000 hexamers that you can calculate. So you can go through your data and it could be E. coli, it could be other genomes, but this is where you go through your calculating and your training if you wanted to collecting your features. And so we've gotten various scores. We can also calculate them on the fly when we're analyzing our genome to be predicted. There's another thing called position-specific scoring matrices. And I'm not sure if you've had some bioinformatics you've probably heard about them, but these are older ways developed again in the 80s for identifying motifs in proteins and motifs in DNA. And you can use position-specific scoring matrices or call them possums to find, say the Pribno box or the promoter regions. And I think you recall with the Pribno box there is this minus 35, minus 10 region. There's a separation between these things. Sometimes the separation is 15 bases, 16 bases, 17 bases, 18 bases. And so again, preparing the promoter box for these different regions, for the different widths. It was something that was done. So I think there's five different possums that were made for different separations between the minus 35 and the minus 10 region. And so from those different possums you can then take a sequence, evaluate that and say this is a good possum. So the higher the score, the better, the lower the score, the worse. But it's still a feature. And so you can slide this window along, perform the possum evaluation and say, yeah, I've got a strong Pribno box here. Sorry, I guess I said the possum because it's a log score, it'd be the lowest score wins. So I'm not the highest score. So in this case, we found a possum box with a five 4.97, which is the lowest one. And that gives us our location for the Pribno box. The other one was the ribosome binding site. That's the Shine Del Garneau sequence. And again, you can have a possum. And this one, the actual possum is shown here. And we're just finding how many, where the A's and the G's are and how many they score. And we can slide this along and identify our best possum. So it's again, using things that were mostly developed in the 1980s or identified in the 1980s that people had found in prokaryotic genes, and using kind of old technology, position specific matrices. We could have used more advanced things like hidden Markov models, which allow for variable gap penalties. But we just wanted to keep it simple, just look for these locations, build up, the appropriate training data so that we can find these things consistently. And we could have done it from, and a lot of these things were actually described for other genomes. So they weren't sort of trained on our training data or trained on our testing data. And so what we now have, instead of open reading frames, which can range from 40 bases to 4,000 bases, we now have a structure where there's basically trivial or finder, which says, I've got 1.6 million orfs. And then each of those orfs is converted to an input of 32 features. So again, 1.6 million orfs. We look at the code on usage feature in that orf. We look at the hacks and bear scoring feature in that orf. We look at the privno box features outside that orf, and the shine delgarno features outside that orf. So 32 features, instead of thousands of bases, are how we convert every, so now every open reading frame, no matter how long or short it is, is essentially a vector of 32 inputs. So this is important, yeah. Can I ask, would that just mean that each individual orf is an observation, like a subject in this case, am I thinking about the right way? I mean, you have 32 features for every orf. That's right. So instead of think of the orfs, as I said, we're just using the very first program, that's the one that found the 1.3 or 1.6 million orfs, whatever it was. So it's just finding start, stop, start, stop. And now we're just trying to clean it up and say, okay, which ones are real orfs and which ones aren't. And so now we're applying this neural net to do the cleanup. So we've got the trivial orf finder. And then, so it's generating a lot of extraneous data, 1.6 million wild guesses, which only a tiny fraction are correct. But by converting every orf to this feature vector of 32, then we can train it on our training data to say, okay, what are the features that say this is a real orf and what are the features that say it's not a real orf? And so this is sort of a filter, but it's a neural net filter. And that's, as it turns out, how most prokaryotic gene finding algorithms work. Okay, so we've got this G pan, which is the new algorithm we're gonna try to identify bacterial genes. Or we're using a neural net, but we're just using the primitive orf finder. So we do exactly what we've done before with the Python, open things up. So the algorithm, a little more complicated, but I'll take you through it. So the first thing is, you know, read your data, we're taking in the E. coli genome data. So we have to, you know, find the, at least our gene position data to check our, the quality of our predictions. But then we're just taking the forward and burst strands that we're gonna be analyzing. We do some verifying the data that's checking for missing data. We have the codon function. We identify the start and stop just as we've done, create the orf function to find the minimum length. This is the old orf function. We find the same things in the reading frames, identify the start and stop. Then that's the old orf function. That was the one that we introduced at the beginning of the morning. And then next we add these other features. So that's the code on usage frequency and the entropy parameter. We calculate the hexamer counting and hexamer scoring. Both on the coding, the non-coding data. And then we have our possum scoring. So all of these things, the Sheindel-Garno and the Pribno-Box calculation. And so those ones from, what is it about, buying 10, 11, 12, 13 and 14, those form the ANN with 32 input units, five hidden units to produce a zero one output is whether it's a coding or non-coding orf. So we import the functions for Python, numpy and pandas. Not sure Seabourn or Potlibber used. Mark, are they used? Well, let me check. I don't think so, we'll get back to you. Yeah, cause I thought I'd removed that from the, anyways, so just as before we import the gene position data, this is the same code that we used before in writing our first version of the orfinder. We're reading the reverse compliment in the forward one. We do some cleanup of the first lines. Again, same thing that we did. We have the codons one, same thing as we did with the original orfinder. Same sort of thing with the different reading frames, zero, one, two, three, all code from our original orfinder. Same description, orfinder. We're using the 40 base length, making sure that at least it's there. And then we're including the first stop three, three stop codons. So all this is the code from that first orfinder, the one that found the 1.6 million ones. Mark, did you find out if it used the C-borne stuff? Yeah, at the very end. It is used. Okay, the gene rate plot here. Okay, all right. Okay, so the real meat of this algorithm after this ORF calculator that generates millions of false positives is this part on determining the codon usage frequency, the hexamer counting, the hexamer scoring, promoter, secret, tribno, box, opossum, and the shine delgarno scoring, and then the neural net. So this is the code that's used to help with the calculation of the codon frequency and the number of nucleotide occurrences in each frame position. And then, of course, we have to calculate this over the length of the total open reading frame. We have this entropy parameter gamma, which we showed you how to calculate. That's one minus H over each naught. So this, as I said, it's an algorithm that was developed in the 80s. We're just simply implementing it, but it is a feature. Hexamers, again, something picked up in the 80s. We're just doing what others have done. So it returns the number of each hexamer present. So there's 4,096 in any given data set. Obviously, there'll be lots of zeros in most ORFs. It's run on the coding and the non-coding, so we're calculating, and then we're looking at the promoter, terminator, and ORFs. Each hexamer is given a score based on the stuff that we already know from other genome data sets about what's typically found in coding and non-coding. So, again, we're producing feature values, and it's every open reading frame is getting these, so it's 32 input values. Then we do the position-specific scoring matrix. So we're looking at the Pribno box and we're looking at when it's at our 15, 16, 17, 18, or 19 apart. Again, it could have been a little smarter and used a hidden Markov model, but this is easier to code. And so we calculate the possums for all of these Pribno boxes, and we determine which promoters are our strongest. And again, we're looking in regions, 100 base pairs upstream of the start position of every ORF. Then there's the Shine Del Garne one. So again, it's the possum looking for this AGG AGG and assessing how well it performs. And also, I guess, I'm not sure quite where this hexamer part was in there. That's been a while since we, I looked at it. And again, they're looking upstream of the open reading frame, and we're looking over a range of possible codons. So again, it's the Shine Del Garne. Hunt that's done and assesses how good this one is. So this neural net basically boils down to a filter. We have a dumb algorithm that finds at 1.6 million open reading frames. Each of those is converted to a 32 feature vector. So the input is 32. Then the hidden layer is five units, five input units, and the output is a zero or a one. Zero is for coding. One is for non-coding. So it's a fairly simple neural net. The total number of lines for the input is 32. So the input is 32. Then the hidden layer is five units. Five input units. Total number of lines for this gene finding program is huge. So it's almost a thousand lines. And to be able to do all the feature extraction, finding the possum, calculating hexamers, takes a while. When we trained it on the three million bases, it was fairly fast. So the overall performance is 70 seconds. It will train every time you run it. It will be a little slow. But you can also choose certain cells. And that's, that can be faster. There's a question, Dr. Sorry, before we, from Lance, would we be able to identify redundant or non-useful features after the neural net training or testing? Or is it that's something you'd want to have done beforehand? Yeah. Well, essentially we've already chosen the really good features. We went through the literature, found what people all agreed worked or was helpful. We could have said, we'll also calculate the zodiac sign or something else into this. And then maybe evaluate which of those features were really good. Some of the features are better than others. I think the hexamers tends to be better than the code on frequency. I think the Shine Del Garne one wasn't as useful as we'd hoped. And just sort of, this is from memory, but they're, they all help. And you can, you know, do some experiments where you knock out one of the features or and see what your performance is. But in this one, because we were using, I'll call it intelligent feature selection. All of the features are useful. Okay. So, you know, we can test on it and then we can also validate it. So there were, as I said, several iterations and I'm just trying to make people remember these numbers. So our very first predictor, the simple or finder, which we're still using in this modified one did terribly. Yes, it was able to identify genes 91% of time about 4200 out of the 4400 known. But it had millions of false positives. It over predicted by 1.63 million. So that's made it a totally useless predictor, but there was some value in this because we realized we could create a filter that would get rid of those false positives. And then we used the simple neural net where we used a nine residue nine base window. And again, where we just sort of said, oh, this should be just like secondary structure prediction. And it should be even simpler. And you, you know, train it and run it. And you can get it. Not as bad as the naive or fun in the sense that you have an average of 1.00 or 0.999 false positive. But it, you know, it still was not at the 90% level that we needed to in terms of the diagonal. So it has an average performance of, I don't know, 51, 55%, which isn't great. Now, if we just put in the code on frequency statistics, this is this version where we filtered. We go from, you know, numbers that we had here 0.4, 0.73 to 0.92 and 0.91. So that's 92% finding genes and 91% finding the non-coding regions. And we have sort of about a seven or 8% error on the off diagonal elements. So this is just in incorporating the code on frequency. That's the 12, 12 point vector that we talked about. So we could have almost just stopped there and said we're done. But if we use the hexamer statistics, we actually get slightly better improvement. And then if we include the possums for the promoter region, we get better still. And then we had a more sophisticated one for some of the promoters we used to hidden Markov model. We'll talk about that. And in fact, the performance got up to more than 93%, 94%. We predicted genes. And the actual genes are the false positives dropped low 7%. So I'll just, you know, go through this again, because I think it's really important, you know, to have a simple more finder, find a lot of genes, but we have millions of false positives. We try a naive neural net. We don't have millions of false positives anymore, but we're still pretty bad at predicting genes. So we redo it and just take the or finder and just include the code on frequency statistics on the horse. And as I said, this gets us up to 92, 91% and sort of an 8 to 9%. And then we have false positive, false negative stuff. These things start dropping as we have more on hexamer statistics, possible promoters, and it gets down to 7%, false positives. And then afterwards, if we have a more sophisticated one, it's well below 7%. So we basically used a neural net to act as a filter. And the improvement is enormous. And this gene predictor, this type of performance for gene predictor would be close to some of the better gene predictors that have been published. And this is not terribly sophisticated. So we've got a gene predictor. It's been written in all Python. It predicts prokaryotic genes. We've trained and also tested on E. coli genes. It would be able to do it for other prokaryotic genes. It might not do quite as well. But if we wanted to do something for eukaryotic gene locations, it would be a lot more complicated. There's many more features. But you could still use a naive sort of gene finder that would sort of say, oh, there's a whole bunch of genes here, and then do this filtering thing where you look for the features and convert these potential exon regions into feature vectors of 30 or 100 input vectors where you're using some intelligent feature selection. Now we're going to use the last, I don't know, 20 minutes here for people to just play around with this. We had been working on the R code, trying to fix it up because there was something that happened to it. Unfortunately, the R code still hasn't been finalized. So those of you who are hoping to have R code, we won't have it for you this time. But when we were running it a couple of years ago, it was taking 15 minutes. So the point that was impractical for people to use it. And this highlights this issue where there are situations where Python is much, much better than R, especially for speed. R for some people, if you're familiar with it, it's easy to code. It's maybe a little more intuitive in general. And for simple statistics, it's a lot faster than Python to write. But for some of the more complicated things in machine learning, it is sometimes not the best choice.