 So hopefully everyone can hear me all right. And so welcome to day two. We're gonna be starting with module five, machine learning. And as before, this presentation and all the material that we're providing today is licensed under the Creative Commons license. So it's share and share like concept. What we're gonna be looking at today is hidden Markov models. And this is a subject of interest more for historic reasons. It's something that is frankly a very difficult subject. And I'm kind of looking forward to maybe next year where we won't really present on it, but it's been used for more than 20 years and has played a key role in a lot of development of bioinformatics software. And so we'll talk about it. We'll see how it can be used in sort of a limited example. And then we'll reintroduce it later on in the course I think in module eight again. The challenge with hidden Markov models is that they're mathematically really, really, really difficult to understand. There's probably about two people on the planet that understand them. And the rest of us are still struggling to figure out how they really work. So this is just an outline of what we'll be doing today. We've already done our check-check I think. Everyone's got their cameras working. We'll have about an hour and a quarter where we'll talk about module five, we'll have a break. Then we'll get into a very long lecture slash lab which is module six. We've scheduled two hours for it. Hopefully it won't be that long, but it covers a lot of things and maybe that's the, I'll call it the highlight for the day. For others, perhaps the highlights will come around three o'clock and 4.45 when we talk about scikit-learn, Keras, HMM learn and a few other tools which make a lot of the coding and development of machine learning so much easier. So that's kind of a quick outline of what we'll do today. So with module five, we're gonna talk about sequence motifs because this is primarily where hidden Markov models of Sean. We'll talk about both text, regular expression comparisons as well as what's something called position-specific scoring matrices or postems to identify sequence motifs. Then we'll introduce the concept of Markov chains and hidden Markov models and how we can use HMMs to identify sequence motifs. So then we'll go through the Python code for a real HMM and then I think most of you will soon appreciate just how difficult they are. So sequence motifs are something that are identified through multiple sequence alignments. Typically they're literally thousands of sequence motifs that have been identified not only in protein sequences but also in DNA sequences. And they often align to something that is critical to binding to some functional aspect of a gene or protein. This is a sequence motif for a dead box, DEAD. It's a conserved set of four amino acids. The dead box is also a key signature for ATP dependent helicases. So many helicases have this. What I've indicated in blue below the sequence alignment is the annotation that we would use in terms of a regular expression. The first residue in this 10 residue motif could either be elucine, isoleucine, valine or methionine. So it's sort of a branch chain amino acids. And there's two of them. Then there's absolute conservation of DEAD. Then the next residue can be just about any residue. So that's put a star in there. And then the next two residues also seem to be very hydrophobic. So this is a motif, which you can do the math would probably match to many possible combinations of single sequences. And this could be used to scan using regular expressions, any protein sequence to look for this particular conserved sequence. As I said, these are found in genes and proteins, binding sites, recognition sites, secondary structure, tertiary structure, targeting sites. And it is through multiple sequence alignment that most sequence motifs have been identified, whether it's in RNA, DNA or protein. Here's some examples of signaling sites that had been identified. And these can be ER directing sequence, nuclear transport signal, signal peptidase, cleavage sites. And again, you can see where there's different characters that are used. So a dollar sign is used to indicate the end of the sequence. Stars are used to indicate wild card characters or any residue. Curly brackets are used to help define the length or span of certain residues between certain motifs. So all of those types of regular expressions, and depending on the library and the annotation that's used are allowed. Here's an example of more consistent or standard regular expressions. And sort of on the left is how the regular expression appears and then the text on the right explains how that is produced. So things in square brackets give you the different possible amino acids or bases. Question marks indicate that could be yes or no. Pluses can indicate how many times it's repeated. All kinds of variations on that. And certainly in the 1980s and early 1990s, regular expressions were how most people coded or identified sequence motifs, how they talked about them in papers and journals. But there are problems. And the fact is that a regular expression performs an exact match to a sequence. So maybe if we had isoleucine, leucine, valine, methionine, but occasionally threonine is allowed, we'd have to either add a residue or we'd have to sort of allow or use some kind of scoring matrix to say, okay, this is sort of close enough but not precise. If you're trying to build out sequence motifs and you have a large sequence alignment, you're spending a long time tabulating the different residues or bases and counting them up and deciding whether you should include them or not include them or how many spaces or gaps should be allowed or not allowed. And it's a pain because I've actually written and prepared lots of sequence motif databases and I'm as a student and it was not fun. So in fundamentally sequence motifs really don't match how proteins or enzymes work in terms of recognizing binding sites, either to DNA RNA or proteins themselves. It is fundamentally in biology is sort of a fuzzy process. So what developed in the late 80s and early 90s was a concept called a scoring matrix or a position-specific scoring matrix. And we pronounce it as possum. And so there's a little picture of a baby possum on the left. But it is similar to most things in bioinformatics. It's like a sequence scoring matrix, whether it's the blossom matrix that use in Blast or the day half matrices that are used in sequence alignments. What you do with a possum is you construct it through multiple sequence alignments. And by comparing the multiple sequence alignments you're able to calculate the frequency of conserved bases or conserved amino acids. The conservation score in each position of the possum is converted to a log odds score. It sort of converts small numbers to numbers that are sort of in the range of 0.1 to one or two. And so what you've seen below the picture of the possum is a real position-specific scoring matrix spanning about 10 bases. And on the left side, you're seeing the ACGT. These aren't frequencies. Originally they were, but they're converted to a log odds score. So you can see that things are between 0.1 and 1.2, I guess. If you're given a new sequence, and that's the one that's shown in the middle, A-T-T-T, A-G-T, A-T-C, you can go along the length of that possum and every number that's circled represents, if you want the level of conservation of that sequence or agreement of that sequence to the ideal sequence. And by adding all the numbers up, you get a total score. So 0.2 plus 0.2 plus 0.1 plus 0.2 all the way up, you get a score of 2.5. Now it turns out rather than in like in bowling where the high score wins, it's more like golf where the low score wins. So the low score with the possum gives you your best performance. So the nice thing about possums is that it allows you to do fuzzy matches. You don't have to have exact matches. So it's more powerful than regular expressions. It gives you a score to assess the quality. So just like bowling or golf, you can say whether you've done well or badly. But there is a limitation, and this has been a chronic issue for possums, is that it doesn't handle insertions or deletions. So you'll look at that sequence there, that possum that I've indicated, it's a defined length. It's 10 bases long, it's not 11, it's not nine. And in fact, there isn't a really good way to describe a possum that handles insertions or deletions. So it was, I think, a realization again in the early 90s of the limitations of possums that people started to come up with other approaches to do sequence motif analysis. So this led to, particularly, the introduction of hidden Markov models. As I said, sort of at the beginning, I hate them because they're really hard to understand. And I think you can kind of appreciate just even looking at this picture of how difficult the hidden Markov model is. It looks like there's nodes and connections. And you can say it sort of looks like a neural net, but it's not. And in fact, the way that it's diagrammed and how those connections go forward and backward is critical to the performance of a hidden Markov model. And it's technically called a probabilistic graphical model. So it means they have to have pretty good understanding of both probability of graph theory and of some fairly advanced and sophisticated mathematics. More recently, thanks to developments in the deep neural nets, most of what can be done or is being done by hidden Markov models can be done generally better and more understandably through deep neural nets. So this is why I suspect this may be the last year we'll actually teach about HMMs, but from a historical perspective, HMMs are important. So I've mentioned that HMMs have been used for a long time, particularly in speech recognition. So we talked about Alexa, Siri and Google Home. If you've used them or tried them out, they do a pretty good job with speech recognition. They initially used HMMs, I think now most of them use deep neural nets. Language translation was something, stock price prediction has been another one, musical composition, anything where there's a set of sequence or events. So music is a set of events, words, a spoken word, or sentences, a set of words and events. Time series data, so if they're looking at analyzing cardiac monitoring, seismic data to look for patterns. And in gene finding and motif finding for bioinformatics, again, viewing that a collection of sequences essentially is a sequential collection of letters occurring one after the other. And so this is what HMMs were initially designed for and I'll explain that a little bit more. HMMs initially in the world of bioinformatics were used a lot in prokaryotic and eukaryotic gene prediction. They have been used to create sequence profiles and there are now large collections of HMM models that were used both to get sequence motifs but also to classify sequences into families. HMMs have been used in multiple sequence alignment. They can be used in three-dimensional structure analysis, protein topology analysis. So the heyday of HMMs was probably through the mid-90s to I guess maybe about five or 10 years ago, but they are ubiquitous in the field of bioinformatics. They are, they're strengths so that they are statistical or probabilistic so there's lots of robust math associated with them. It, as you'll find out, allows you to come up with a pretty systematic, mathematically robust approach to estimating parameters and to calculating probabilities. It's an iterative process for improving probabilities. So when we calculate probabilities by hand typically, we measure a whole bunch of events. There might be conditional probabilities so we'll collect all of that information, run numbers through it and then we come up with estimates. Hidden Markov models use things like expectation, maximization methods to help actually improve those probability estimates. The strength primarily at least as an example we're gonna use today and look at is with sequence motifs and dealing with insertions and gaps. And as I said, with regard to things like regular expressions or more specifically to possums, that's a weakness. They can't handle insertion situations whereas Hidden Markov models can. So my point at the bottom is that Hidden Markov models are really, really difficult to understand. So I'm not expecting any of you unless you've been studying Hidden Markov models for the last two or three years to understand all of the material I'm gonna present today. But I'd like you at least to get a general appreciation of what's done in part because some of the concepts from Hidden Markov models have made their way into even the theory for neural nets. So Markov models are different than Hidden Markov models. Markov models are essentially a set of events or sequential events. Essentially a Markov model is something that depends only on the previous event. So if you're going down a pathway and there's multiple forks in the pathway where you end up is essentially a function of what you did in terms of choosing one or two directions. So it's a little bit like the path decision making in terms of a decision tree. There's an interesting historical bit of trivia here which is that Markov models were actually used to create the very first sequence scoring matrices. That was called the point, I guess the point assisted mutation matrix or PAM matrix that Margaret Dayhoff who was the founder of Biogeometrics developed in the 1960s. And so what she did was she calculated essentially a scoring matrix for very, very similar sequences. And that scoring matrix was called the PAM one matrix an assumption where there's just small mutations or small numbers pretty closely at aligned sequences. And that roughly correspond to about one million years of evolution, at least that was her estimate by multiplying the PAM one matrix 250 times. So that's one after the other, after the other. So this is matrix multiplication. She produced the PAM 250 matrix which corresponded to the expected evolution of sequences over 250 million years. Again, to produce it you had to do a multiplication in sequence. And so this was essentially a Markov process where one million years of mutation times another million years of mutation produces the PAM two matrix. It was all theoretical, but what she ended up producing actually was a scoring matrix that was amazingly useful. It was useful for more than 20 years in terms of sequence comparison alignments. And it was only improved on slightly through the development of the blossom matrix. So a Markov model, as I say, it's a set of sequences and typically there are connected by arrows or in some cases multiplication events. So the arrow represents a probability of transition. So the probability of A to B or from B to C. So the probability of B coming from A is equal to probability of the state being B and the previous state being A. In this case, B can only come from A and so the current state depends on the previous state's transition probability. If that transition probability is zero, then obviously B cannot come from A and therefore C cannot come from B. So that, as I say, is the Markov model. There are different types of Markov chains. If we have the simplest one, which is only the preceding event or preceding base or preceding amino acid, determines what's gonna follow. That's a first order Markov. A second order Markov chain depends on having, or is dependent on two other preceding events. So in the sequence that's given in the middle, you need A and T to contribute to the occurrence of G. You need T and G to contribute to the occurrence of C. Now, a Markov chain becomes a hidden Markov chain when we have the observable sequences, ATG, CATAA there, but then we have these other boxes, which are hidden effects or hidden observations that influence things. So it's sort of the puppeteer behind the curtain or the sound effects from, again, the hidden black curtain that influence those sequences. So we know that amino acid sequences are not, or even base sequences are not entirely influenced by the preceding base. They're influenced by other effects. And similarly, in terms of deciding, so there's the invisible hand of evolution that is guiding that. And so to some extent, that's what these hidden observation or hidden effects have to do. We are typically, when we're talking about hidden Markov models or Markov models, we're working in the field of conditional probability. So the probability in terms of a first order Markov model of seeing the sequence A, C, T, G, T, C is just simply the probability of A times the probability of C times the probability of T and so on. The second order Markov model of seeing the same sequence would say that the probability of A, C, T, G, C is equal to the probability of A times the probability of C given A times the probability of T given C. And then a third order would be what's shown there where we've got P given A, P of C given A, probably of T given A and C and so on. So again, if you know a little bit of probability, you can see where it can get a little complicated that it starts including Bayesian types of models. If we wanted to look at a true, this is a somewhat simpler architecture for a hidden Markov model, we talk about topologies and this is where the term graphical model. So hidden Markov models always begin with a graph or a picture. And there are certain conventions on how these pictures have to be drawn. So in the case of sequences, circles are used to indicate where there are deletions. Diamonds are used to indicate where there are insertions. Squares are used to indicate if you want the sequence. It is structured like a network. So there are arrows with directions. Some from any given node, there might be two or three or four different arrows coming in and coming out. There can also be self arrows. So something leaving and then coming back in. So those rounded arrows. And within the arrows, what we would call in neural nets, the weightings, these are called transition probabilities. Those are the edges in this network. And then the squares, diamonds and circles, they have certain probabilities as well. Those are written in the numbers there. Those are called emission probabilities. So the arrows have numbers and those are assigned values between zero and one. And the squares, diamonds and circles have probabilities. So we have different layers, different states, numbers within the nodes and numbers on the arrows for the nodes. And this topology here is a typical one that's used to handle sequence motifs because it allows for matching sequences. That's the match state. It allows for insertions and deletions. So we're gonna take an example of a hidden Markov model where we're beginning essentially from a sequence alignment. So we have five sequences here and it's a very short kind of meaningless motif, but it's at least simple to calculate. What we're showing here are sort of modestly conserved set of residues and maybe the second and maybe the set of A's at the, where the gap is. But what we've got are gaps or insertions here. And what I've written below is a regular expression which would explain this motif. And if you do the math, this motif would have about 3,600 possible valid sequences which is a huge number and probably is not terribly useful in terms of defining a sequence motif. And the fact that there's this inclusion of gaps and other challenges with the regular expression makes the regular expression kind of useless. And what it would suggest is that a hidden Markov model would be better. So we're gonna use this multiple sequence alignments to help build it out. What we do is we start calculating the frequency of from the multiple alignment, the number of A's and T's in the sequence. So we go column by column in this case. And so the first column, we can see that there's four A's and one T. So we can convert that to a probability four out of five is 0.8, one out of five is 0.2. We can do that for the second column. We get a probability of G and C, a probability of A and C. Then it gets messy because now we're dealing with insertions. And so we can go column by column, identifying the occurrence of essentially insertions and no insertions. So where we have a delta, we have in column, I guess it's four, column five, column six, we have inserts instead of the calculations done there with a number of sequences that have inserts is 0.6, number of sequences that don't have insertions is two out of five. And then column by column, we can see the probability of A's and C's and T's and G's are also given. So that's the messy part where we're dealing with the insertion. And then we go to the seventh column, we can see it's all A's, so the probability of A is one and so on with the seventh and eighth, the ninth columns. So this is how we get some initial statistics. Now we have to generate a model. And if you recall that diamonds are used for insertions, circles are for deletions and squares are used for matches. So we have to now draw out a graph. This is the graphical model for that sequence alignment which is shown in the corner there with the sort of the screenshot. And this is the graph that would explain or best explain the model through a hidden Markov approach. So we can see A and T, we can write in terms of the emission probabilities in the squares, it's 0.8 and 0.2. We have a transition probability of one going to the second column. We have a transition probability of one going to the third column. But then as we get to the fourth, fifth and sixth columns, we have this probability for essentially insertions. And so this is where we fill in the diamond with all the probabilities for A, T's and G's and C's. We also fill in the diamond for both transitions to the insertions and also for the occurrence of the first base of what we want to see, I guess. So you can see the transition and emission probabilities, 0.6, 0.4, 0.6, another 0.4. Those are all taken from the calculations that we have on that little diagram in the corner. And then we get back to the squares, which now corresponds to column seven. And we can see A has a probability of one, so that's the emission probability, transition probability of one and so on. So with that model, there's a path that we can follow. And so if we're given a sequence, let's say the sequence is ACAC dash dash ATC, we can essentially calculate the probability of its occurrence by taking the transition and emission probabilities and multiplying them all the way through. So it's an exercise in multiplication, 0.8 times one times 0.8 times one. And again, we're looking at the sequence and we're using the emission probabilities that are in each of the squares or the diamonds to come up with a value. So we do all the multiplication and we end up with a score of 0.047. We can convert that rather small number to a log odds ratio or LOD. And we can use the formula here, which allows us to convert those numbers or those probabilities and correct for both the frequency of bases. So there's four bases, so 0.25 is used for DNA, there's 20 amino acids, so one out of 20.05 is the probability for any given amino acid. And so this is the formula for log conversion. So we can convert that now to log odds values and I've put those in. And once we've converted to log odds, instead of doing multiplications, we can do additions. So the same graphical model and the same set that we saw here has again been converted to log odds. And you can see the numbers are different, but and that we simply, we add them all up. So instead of getting something at 0.047, we now get a score of 6.67 when we do our trace through with this particular sequence. So the point is that with this hidden Markov model, this graphical probabilistic model, we can calculate scores for sequences and say how good is this sequence, how well does it match to my sequence motif? Is it a high-scoring sequence motif? Is it a low-scoring sequence motif? And should it be considered as a real one? And so now it's better than the possum because we're able to handle gaps in both insertions and deletions and we're also able to handle the appearance of matching bases or amino acids. So we can take in a whole bunch, here's seven or eight different sequences and we wanna see whether any of these approximate that motif that we are postulating on. So I've put in a bunch of sequences, I calculated both through multiplication and through the law of God's calculations. And what we find out is that there's one that has a very high law of God score and that one's a very good match to our motif and then there's one in red which has a very low law of God score and that's one obviously that shouldn't be included. And then there are others that are hovering around 4.6 to 5.3 and depending on what you've chosen as your threshold or your cutoff, you could say, well, these are also probably legitimate motifs. So what we're doing in a hidden Markov model and let's say this is a very simple one. We had a very small number of sequences and so at some level we could evaluate it. If this is a real example, one where we are trying to actually have legitimate motif, trying to create a motif out of five sequences, something you shouldn't do, there's a trick called pseudo counting which has been first developed for possums that allows you to sort of fill in extra information so that you're essentially these are dummy variables but that's another thing that you could have done with this HMM. So there's an evaluation problem and when you're doing hidden Markovs, that's where I was doing all the additions and multiplications and where I had to follow up path. There's this determining which sequence to follow or a set of events to follow. Again, because it's simple for us, one, two, three, four, five, six, seven, that's a path but we could have gone one, two, three and then I could have gone here, I could have gone up here and gone around a few times and then down here. Those are paths and when you are determining a path in a sequence, you use dynamic programming and some of you may have heard of dynamic programming, it's always used in sequence alignment, it's a path finding algorithm and there's the needle and winch path finding algorithm and there's also the Viterbi path finding algorithm. So in hidden Markov models, we use the Viterbi instead of the needle and winch algorithm for dynamic programming. That's called a decoding problem. So there's an evaluation problem, a decoding problem. In situations where you don't have an alignment and this is more frequent in not only sequence motifs but in general for most hidden Markov models, you can learn the transition and emission probabilities. So in our case, because we had an alignment, we could calculate the emission and transition probabilities but if you don't have an alignment, you can have a learning algorithm, it's called the Baum-Welsch algorithm to estimate the emission and transition probabilities and so that's the learning problem. So in HMMs, you have to solve three problems, the evaluation problem, the decoding problem and if you don't have alignment information, you also have to solve the learning problem. So this is summarized here. So for evaluating, given the model parameters, calculate the probability of a particular sequence and you saw how I did that for some simple sequences back here, I just did some log odds or simple multiplications. But while it seemed simple, I still had to essentially follow a path and determine what those numbers are. There's this decoding one, which is in fact, how did I find the path and how did I get those, both include the certain hidden states, the circles and the diamonds in my calculation. And the learning problem, I didn't have to do that in the example I gave you, but if I just had the simple observations that I would have had to have estimated the emission and transition probabilities. So we're gonna take an example which is more complicated than the simple one I just gave, where we're gonna try and essentially solve or address all three problems, which is pretty standard for most hidden Markov models. Now, to do evaluation, typically we use a technique called the forward algorithm. So it's a path tracing algorithm. So it's a form of dynamic programming. In the example I gave you, it wasn't a very challenging dynamic programming when it was just followed the obvious path. So there's something that we'll call the forward algorithm. You'll see that a little later. Then the decoding problem, that's another dynamic programming algorithm. It's to find the most likely sequence of hidden states that results in the observed set of events. So it's, how do I include the insertions and deletions, the diamonds and the circles? Which path do I follow? And it is essential for HMMs involving sequence profiles. So here the word returbie a lot when people talk about sequence motifs. The one that's probably more interesting at least as it pertains to machine learning is a forward-backward algorithm. This is also what's called the Baum-Welsch. And it uses a technique called expectation maximization. So it's an optimization algorithm at some level. And it allows you to estimate the transition and emission probabilities without having, in this case, sequence alignments. It iterates, it estimates probabilities and then refines them. And so in that regard, the Baum-Welsch algorithm, you could say is an analog to say the back propagation elements that we saw in neural networks. It just, it optimizes things and it calculates error, goes back, adjust things, calculates the error. And just like in neural nets, you have this idea of going forward, backward, forward, backward. And that's why Baum-Welsch is called a forward-backward algorithm. Okay, so hopefully I haven't lost too many of you, but as I say, hidden Markov models are really complicated and I have no expectation of anyone really knowing the innards of hidden Markov models after this. But we're gonna try it or show how it would be used in bacterial DNA sequence. We've chosen bacterial DNA because bacteria have simple gene structures, simple promoter structures. Eukaryotic gene structure is really, really complicated and certainly there have been very good successes with the Eukaryotic gene analysis using hidden Markov models, but they get incredibly sophisticated and are really complicated. So we're not going to try and go from zero to 60 here. So just for people who don't know or as a reminder for people who've forgotten, promoters are sections of DNA located typically on the five prime or in the front of genes that promote the binding of other proteins to enhance transcription. Typically you'll find promoters anywhere from 30 to 200 base pairs in front of the translation start codon. So they're not immediately in front, they're kind of a little further down. They're typically associated with operons, so you don't find promoters for every single gene. You'll find promoters maybe for gene clusters. So I've illustrated there again, the sequence ATG is the typical start codon, the promoter is some random sequence up there. In particular, there's one site that's called the Prygna box, which is one that's used for transcription initiation and sequence alignments and sequence comparisons have identified this RNA polymerase site that has modestly conserved TTGAC in the minus 35 site and then a TATA T-A-A-T sequence a few bases downstream from that, about 20 bases or so. This is not 35 bases upstream of the ATG site, it's just could be anywhere up to 200 bases away, but this is a characteristic of called the Prygna box. So we're gonna be looking for promoter-like sequences, similar to the Prygna box. And what you saw with the sequence there is it's sort of a sequence motif. The conservation again is lower cases indicates it's not well conserved, upper case say these are well conserved, but this is not a great sequence in the sense of it's not a regular expression that doesn't allow fuzzy matches, doesn't really handle gaps because the gaps can be variable. So this is a sequence that's just ripe for doing hidden Markov models. So this is the workflow we've talked about in the past, instead of how do I define or how do I generate a motif from unaligned sequence data. So we define our problem, what we're gonna do is looking at promoter sequences that have been collected until identify this Prygna box. So here's a mess of sequences. And we wanna run our hidden Markov model through that and we wanna be able to see that TTGAC and TATA link with certain variable gaps in between. So we wanna see if our hidden Markov model can pick out those promoters and conserved elements within the promoter sequence. So what we're gonna do is take a set of about 50 bases. We're gonna develop a hidden Markov model with a total of 50 hidden states. And we're gonna train them on a set of sequences. We're gonna determine the emission and transition probabilities from this set of unaligned sequences using the Baum-Wilch algorithm. So that's our plan. We'll see if we can do that. So from our plan, we have to get our data set. So there is a large collection of E. coli promoter sequences, a total of about 1800 promoter sequences where there's a known transcription start site. And these ones are all collected to span about 50 bases before the transcription start site. Now, these are not aligned. So we aren't attempting to find the Pribno box through alignments. We're just simply saying, here's a mess of sequences. Here's 1800 of them. Let's use the Baum-Wilch algorithm to figure out where the patterns are. Now, the other point is that within E. coli, there's about 4,500 genes. So promoters on average, there's one promoter for every three genes. And that's just a characteristic of many bacteria because they have operons where they just need one promoter, maybe for three or four genes that are runoff in the operon. Okay, so we're gonna try programming our hidden Markov model. If we were to do this, we'd go to our colab file. In our case, obviously we're not expecting you to write this, but we've already written it for you both in the Python and in R. It's given the title HMM motif, and so you can open it up. And the general algorithm we're gonna follow is outlined here. Just like with all the other programs, we're gonna read our data. We're gonna do some encoding. It's not one hot encoding, but we are encoding the sequence data. We're gonna initialize our HMM. We're gonna write our forward algorithm for the Baum-Welsch. We're gonna write the backward algorithm for Baum-Welsch. So forward backward is another name for Baum-Welsch. And then we're gonna combine the two to create the Baum-Welsch function. Then we're gonna write the Viterbi function. So this is the dynamic programming function that I talked about. And we're gonna initialize and train the hidden Markov model by calling the two algorithms, HMM initializing and Baum-Welsch. And then we're gonna create the log probability functions. That's the, if you want the log D or log odds. And then we're gonna use the Viterbi algorithm again to do the evaluation to find those log probability measures. And they're gonna evaluate the performance of our motifs in terms of detecting or correctly detecting the promoters. So it's a fairly complicated algorithm and game probably don't have time. In fact, we don't have time to go into a lot of the details. But with most things that we've seen in machine learning, we call up NumPy and Pandas. We have the dataset reading. So we have our motif sequences in a comma separated file format. They're structured in a certain way. We're using a drop NA function to help clean up some of the data. We're looking for any, well, we're also looking for the reverse sequences as well as the forward ones. So we've read in our dataset, we're gonna now transform our features. So in this case, we're converting A, T, C and G into numbers, zero, one, two, three here. So this is a form of encoding to help simplify things. Again, we're working with the forward sequence as well as the reverse sequence. We've chosen to solve this using an HMM. As I said, in the future, we're probably gonna be using recurrent neural nets or LSTM models. But historically, this is how people would do HMMs or use them. So we're gonna initialize our parameters. We're gonna initialize the transition probabilities. We're gonna initialize the emission probabilities and we're gonna initialize the initial state probabilities. And so basically we're calling random numbers to put in random start points. And this, as I say, is what you do when you don't have an alignment. You're letting the bomb wash do your work for you. And you can see various functions like random or random.rand that help create those initial random distributions. So now we encode the forward algorithm and bomb wash uses both forward and backward. So the forward algorithm is defined here. I'm not gonna get into the details. It's pretty complicated math, but that's what it looks like. It calculates the current state to turn the probability of the next state. So this is where the Markov concept comes in. We have visible states. Those are the, if you wanna call them the match states, which are the squares. And we're then gonna calculate probably in Markov model will be in a particular state at a particular step. So this is essentially the optimization equivalent to the back propagation that we talked about in neural nets. It's fairly costly in terms of calculations. It goes as the order n squared. And there's the backward algorithm, which is the other half of the viterbi or rather the bomb wash. So consider this the equivalent to back propagation like we had in neural nets. And it's used to calculate the probability of observations given the hidden states. So we've initialized HMM. We've generated the forward algorithm. We've generated the backward algorithm. So then you have to combine the two to create the forward slash backward algorithm, which is essentially the learning algorithm. So what this is doing is it's calling. So this is the bomb Welsh function, we're creating that. But then here's where we're calculating the forward backward. Those are the calls that we're making here. And then there's some normalization, some dot products that are also calculated through the matrices that we're building out. That carries on. Again, fairly complicated math and we're not expecting anyone to understand that. But in the end, we create our transition and emission probabilities. So these are tables of transition emission probabilities for this particular hidden Markov model. So that's the learning part. And as I say, think of it as being equivalent to the back propagation algorithm, but we call it the bomb Welsh function. Then we have to create the viterbi function, which does the dynamic programming. So this is, it's used in, as we call it the decoding, also in the evaluation. And again, fairly complicated math, lots of probabilities determined through it. It's not a short program. It's calculating various maximum values. It's doing some essentially what most dynamic programming algorithms do, which is this trace back step to find the best path. So this is the viterbi algorithm. Again, we're not gonna come into details, but just trust me that it is a dynamic programming algorithm. So we've built out all the functions that are key for hidden Markov models. Now we can actually start putting the hidden Markov model together. So we've built the base, now we can put on the icing. So we start off with initializing, in the sense that we have the initializing function that we had created that's called here, the bomb Welsh function. These are highlighted sort of in blue just to show you that we're calling them. And this is where we're essentially training. So we start off with these random set of probabilities, then we go through this iterative training with the bomb Welsh. And the result is these are the probabilities that essentially come out from that training. So these are state to state transitions. So there's essentially a seven by seven matrix and then there's the emission probabilities. That's a four by seven matrix for the different bases that we're talking about. The initial probabilities and array of seven numbers. From that calculation of the transition, emission, initial probabilities, then we have to call the viterbi dynamic programming algorithm to calculate the log probabilities. And so this function calculates the log probabilities and it also has a cutoff to sort of draw the line. And through running that viterbi program, we can test our sequences. We can look at the 1800 promoter sequences along with thousands of other non promoter sequences that we also had. And then we have a cutoff that essentially says, is this a sequence that's a promoter or does it fit the promoter probability or does it not? And that's, if you recall, how we were doing our log odds evaluation where we either multiplied or added. And we said, is this above or is this below some kind of cutoff? Well, we accept it as a member of our motif or not. So this again is using the similar sort of thing. Is there a cutoff? Is it a member of the promoter motif or the priv-nobox motif or not? So a lot of work, really complicated. And this is the result. So again, we've got 1800 promoter sequences and thousands of other non promoter sequences. The confusion matrix is shown on the right. Recall that a good confusion matrix along the diagonal for prediction, matching, observed should be something on the other of 0.9 and 0.9. Instead, we're getting 0.2 and 0.7 and that are often diagonals are 0.8 and 0.3. So in some respects, this hidden Markov model is a failure. And it just underlines how complicated they are and the fact that we were sort of naive and taking just the first 50 bases because it turns out that some of the promoter motifs are outside of that 50 base scan. So at some level, it's a lot of work for nothing but at least the diagonal is not all zeros. So it is picking up promoter sequences and it highlights the fact that promoter sequences are actually pretty noisy. That the conservation of the pridinal box is actually quite poor and that the position of the pridinal box actually varies tremendously more from the zero to 50. It also highlights some of the weaknesses I think have hidden Markov models, which is A, as you've seen from the code, they're really complicated, and B, we didn't do sufficient feature engineering in this particular example to really build on what's known or some of the other components or a smarter way of assembling the data or compiling the data so that we get a slightly better performance. So I guess what I basically say is that we have an algorithm that sort of works. It's a correct hidden Markov model. It doesn't crash, but because we hadn't been careful in assembling our data and putting in a sufficient number of features, the performance is not great. Through this, we can still create sequence motifs with this trained hidden Markov model. And so there's additional code in the program to actually perform that and to propose a minus 10 box and a minus 35 box from a particular sequence. So overall, it's a big program, hundreds of lines long, it takes a long time to train, it takes a fair bit of time to run. We've also written the program in R and the performance in R and in Python is about the same. R is a little faster, but the R code is a little longer. In R, it's pure R, so we aren't using any extra functions like matrix or data reading tools like NumPy and Pandas. So whether it's in pure Python or pure R, we've had a hidden Markov model. Performance isn't great. We think there's still possibly a bug in the code, which is why it's not doing perfectly, but we've never been able to find the bug in the last year. So maybe that's just the actual performance. It is possible to use this structure, the hidden Markov model structure, where you've got the bomb wash, we've got the terby algorithm, so that you could potentially apply it to more sophisticated things, but you'd really have to understand the code. And as I say, very few people really truly understand hidden Markov models. So we also, as I say, have the R code and if you're more comfortable in R, you can run that.