 Alright, welcome back everyone if you're watching this on Twitch or if you're watching this later on YouTube. Part number two of the sequence alignment lecture. So let's just continue, right? So we talked about pairwise alignments a little bit. So a pairwise alignment is when I have two sequences I want to align them together. So I want to know how these two sequences are similar or how they are different, right? So the question is, if I have a sequence of interest and I have a sequence with a known function for example, what is the similarity between these two sequences, right? Just looking at them I can see that there seems to be some similarity, right? This one has three Ts, this one has three Ts as well with an AC, with an AC. So the last part is more or less identical but the part before is not, right? But had the hypothesis here is that if these sequences are similar then we can also infer that there might be a similar function of the protein or the microRNA that is being produced. And here we have the concept of sequence identity. So sequence identity is the amount of characters which match exactly between two different sequences, right? So that is the definition of sequence identity. So you can have a sequence identity of 10 or 15. It's not a percentage, it's just a number. So there are two major ways of aligning sequences together. The first way is to have global alignment and a global alignment attempts to align every residue in both sequences and you use global alignment when you are aligning as two sequences which are more or less equally long. So equally long means one is 100 base pairs, the other one is 110. But if you have a very big difference in the length of the sequences, for example, we have a gene which is 5000 base pairs long and we want to align this to a genome, right? So we want to find the best optimal position for this in a big, big sequence. A genome is generally billions of base pairs long. Then we need to use local alignment. So local alignment is the alignment describes the most similar regions within the sequences to be aligned and short sequences are compared versus a larger one or a longer one. So how does this look? Well, imagine that we have a sequence S1 and S2. Then when we do a global alignment, we try to match the entire string. So we try to match S2 to the full length of S1. Well, if we do a local alignment, we try to match the optimal substring. So we try to find the region in S1 where most of S2 aligns and this generally means that we are ignoring the terminals of S1, right? So in this case, you can imagine that S1 might be a genome, billions of base pairs long and S2 is just a 500 base pair strand of DNA that we want to look for. So how this works is that in local alignment, you kind of move the sequence that you want to find against the sequence that you are searching against. While in S2, you don't do that. A global alignment always tries to start S1 and S2 at the exact same position. And then it tries to optimize the alignment by inserting gaps into S2 and to still have enough matches or as many matches as possible. But this is the major difference between global alignment and local alignment. So local alignment is optimal substring while global alignment tries to match both strings together from the start of S1 and all the way up until the end of S1. So there are many, many different ways of aligning, right? Because like we need to have a scoring function to say, well, this is a good alignment and this is a bad alignment, right? So if we think about scoring functions, then the very simplest score that you can come up with is just the percentage of matches. So if I have my first alignment where I align like this, now I'm matching five out of 17 positions. While if I have alignment two, I'm matching 11 out of 16 positions. And this is not entirely perfect because it actually changed the font. This should just be a equally wide spaced font. But now we can have a way of determining which alignment is better. So we can just say, well, if I look at the percentage of matches, alignment two is better than alignment one. Of course, this is the most basic scoring function that you can have. But people thought about this a lot. And in like 1980s people came up and they said like, no, just looking at the percentage is not good enough, right? What you want to do is you want to look at the similarity of two sequences, right? So you have a similarity score for the aligned letters saying that well, if S1 at position X matches S2 at position Y, then we give a plus one. Well, if there is no match, then we just do a negative one. So we give a positive increase in score for matching, and we give a negative or we subtract for mismatches. Additionally, what they say is, well, if you have these sequences, right, and you do the alignments, then sometimes you have to introduce gaps, because it's better to introduce a small gap in sequence S2, so that the rest aligns better, right? So what they then did is make something which is called a gap penalty. So there's a penalty for introducing a gap. And so you look, so you define the score as the sum of the similarity, plus the number of gaps that you needed to introduce times the gap penalty. And the gap penalty is usually set to minus one, because if you have a minus one gap penalty, that means that had the more gaps that you introduce, the more of the score you subtract from the alignment. So this is the most simplistic scoring function which is used in biology to score similarity between two sequences. Of course, there has been a lot of improvement in recent years. And one of these things is that nowadays we have additive scoring with something which we call an affine gap penalty, because in biology, having a gap is, so if you look at how sequences mutate and how insertions and deletions occur in sequences, then there is something which is very common, right, which is the size of the gap. So if we look at a gap, right, and if we have a gap, then it doesn't really matter how long that gap is. But introducing the gap in the alignment is kind of a big negative penalty, right. So that's why we, that's why the next step, what people did is say, well, we have this number of gap positions times the minus one. But of course, it matters if you have 10 gaps introduced at 10 different positions in the sequence, or if you have a single gap, which is 10 base pairs, right. Biologically speaking, a single gap of 10 base pairs is much more likely to occur than 10 deletions of a single base pair. So that's why they have a fine gap penalty. So this affine gap penalty means that you have a gap opening penalty, which is relatively high. But then you have a gap extension penalty and the gap extension penalty is relatively low, right. So that means that introducing a gap of size five, or a gap of size 10 is more or less very similar. But opening five gaps in the alignment, five different places will actually lead to five gap opening penalties. So having five gaps across the whole sequence is more penalizing to the alignment score than having a single gap, which is five base pairs long. So this is nowadays more or less what people are using a variant of this is generally used. So again, the similarities then defined as the sum of how many base pairs match or mismatch, right. So plus one for a match, minus one for a mismatch, then the number of gap openings that we have get a GOP score. So a gap opening penalty, which is relatively high, so minus five or minus 10, but the gap positions times the gap extension penalty is generally very low. So a gap extension penalty is in the order of minus 0.1. So and this makes that the alignment of the read is penalized for having a lot of little gaps, but not so much penalized for having a single big gap. And this is more in line with how biology works, because if you think about an insertion, right, if we have 15 base pairs which is inserted into a sequence, then of course, when I have the sequence without the insertion and with the insertion, and then of course, these 15 base pairs are more likely to be next to each other. So that is the difference between the linear gap penalty where you just say no, every gap that I introduce gets a minus one, compared to the affine gap penalty where you say no, opening a gap is relatively expensive, but extending a gap so making it bigger is a relatively cheap operation. So I don't penalize too much for that. So when we think about alignments, of course, then we have to realize that on the DNA level, the alignment can have a very different answer than on the protein level, right? So if we look at S1 and S2, right, and if we assume that S1 and S2 are both coding for proteins, right, then of course, we can have this alignment of these two sequences. If we now calculate the percentage of matches on DNA level, right, then we would say, okay, so this one matches, that one matches, so two out of three match for the first codon. Here we also have two out of three matching, so that's already four out of six matching. Then here we have no matches, so that's zero, so then it's two mismatches out of nine base pairs. And then we have another three base pairs of which one to match. And so in total, on the DNA level, we have six matches out of 12, right, so a 50% identity between the DNA score. However, if we then look at the protein level and we have to go and we compute for each codon the proper amino acid. And so in case of CAC, so we start with C-A and then we go to C, then we code for histidine. But if we look at C-A-T, which is C-A-U, that's also code for histidine, right, so there's a perfect match of the first codon. Then if we do the same thing for the second codon, then we do G-C-G alanine. G-C-A also codes for an alanine. So on the protein level, there is again a perfect match for the second codon. So you can see that generally the matches on the DNA level are not as good as matches on protein level because of the fact that the third base pair in the codon is degenerate, right. So if we continue here, then two out of four amino acids match and then we do TCC, so UCC, which codes for a serine. And then we have AGT, so we have AG and U, which also codes for a serine. So again, at least three out of four amino acids match on the protein level. And then the last one, again, the only difference is in the third base pair, so these also are similar. So they would also produce a similar amino acid. So that means that on the DNA level, we only have a 50% identity, so only 50% of the base pairs have a direct match. But on the protein level, we have a 100% match. These two sequences code for the exact same four amino acids, right. So when you are considering alignments and when you are looking at things like evolution or how far two proteins are apart, then there might be a big difference if you compare two animals on the DNA level or if you compare them on the protein level. So keep that in the back of your mind that DNA alignment is not going to give you the exact same answers, protein alignment because of this third base degeneracy, so the wobble base in the sequence. Not just that, but if we look at DNA, then DNA doesn't, if you have an A changing to a G, right, because of the biological structure of the base pairs, it is more common for an A to turn into a G, which we call a transition. These are very frequent in DNA. So if we look at DNA, then transitions are very frequently observed. While a C changing to a G is biochemically much harder. So that doesn't happen that more often. So transversions, so going from a C to a G or from a G to a T are very rare. And when we do the alignment, we of course want to take this into account. So instead of saying two sequences match, I give a plus one. So an A matches an A that is a plus one score. If you have an A matched to a G, then the penalty for this should be lower than when you match an A to a C. Because matching an A to a C is just much more uncommon in how biology works and how biochemistry works. Just having a random more or less mutation, which occurs because of chemicals or whatever, chemically speaking, purines, adenine and guianine are much closer together. So it's chemically much easier to change a guianine into an adenine as it is to change a guianine into a thymine. So not only do we consider the fact that we have matches and mismatches, for the mismatches we also have to consider the substitution probability. So the substitution probability is the probability of an A turning into a G, which is a better probability is it has a higher likelihood to occur than an A being substituted or being trans or being transformed into a T. And so the original code where we looked at that we say a plus one for a match, a negative one for a mismatch, that this should also be adapted and we should not score always a negative one for a mismatch. Sometimes we should score a negative half and other times we should score minus two, for example, because of the fact that transitions are frequent and transversions are very rare in the genome. I hope that that's clear. If we talk about amino acids, this is even more complicated, right? Because amino acids are, some amino acids are very similar to each other, while other amino acids are very different. So here we have the Taylor diagram of amino acids, which is just a Venn diagram of how closely related amino acids are, right? So this tells me that substituting an isoleucine by a leucine is a very common, commonly observed thing in biology, right? If I have a protein sequence, then at amino acid position 44, having an isoleucine or having a leucine is very common. But if instead of an isoleucine, I all of a sudden see a proline, then that is of course a big difference because the distance between the isoleucine and the proline is much bigger, right? So there's a much, the chances of seeing an isoleucine change to a proline is much less. So when I align two sequences, I of course want to have, score a mismatch between an isoleucine here and a P, I want to give a bigger penalty than saying I have an isoleucine in the one sequence and a leucine in the other sequence, because these are very close together, right? So similar amino acid substitutions occur with higher probability and then if we align two protein sequences to each other, we need to compensate for this. So we need to give more, for a mismatch, we need to be more relaxed when it's an F to a Y compared to when it's an F to a Q. And of course, this figure, the Venn diagram of amino acids, this Taylor diagram, of course, a computer cannot understand this. So for a computer, we have a different representation. So for computers, we have two different types of representations which represent the substitution probabilities for protein or for amino acid exchanges. And the first one that I wanted to tell you guys about is the POM matrix. So the point accepted mutation matrix and the point accepted mutation matrix has here the mismatch penalties for when you align two protein sequences together, right? So on the diagonal, you see the score which is given to a match, right? So if a cysteine matches with a cysteine, you increase your score by 12 points. If you then see that at an other position, the first sequence has a proline, but the other sequence has a leucine, you score it negative three. However, if you see a proline turning into a tryptophan, then you score negative six, right? So these give these the scoring matrix, right? So a computer can use a scoring matrix and it can just look up, okay, so I'm aligning, now I have these two things which are mismatching or which are matching, what type of a score do I need to give it, right? And then here, the POM 250 matrix is based at similar amino acids are substituted with higher probability, right? So if two things are small amino acids, then you penalize the mismatch less than when you have a small amino acid being substituted with a big amino acid. Besides that, you nowadays almost no one uses POM anymore. Nowadays, people always use the Blossom matrix. So the Blossom matrix is very similar to the POM matrix. It has the same kind of goal. But it is called the block substitution matrix, right? So amino acids in the table are grouped according to the chemistry of their side chain. Each value in the matrix is calculated by dividing the frequency of occurrence of the amino acid pair in the blocks database. So the blocks database is a big database of protein sequences. And they just look at all known protein sequences, they align all of these protein sequences together, and then they see how often substitutions occur. So a score of zero in a Blossom matrix indicates that two amino acids were found aligned and the database was expected by chance, right? So there's no positive of negative score that you would want to do this, right? So this is a penalty of minus one. But a positive score in the Blossom matrix indicates that the alignment was found more often than by chance, right? So we don't want to give a minus one penalty in this case. We want to give a slightly less penalty like minus 0.2 or minus half. Negative scores in the Blossom matrix indicate that the alignment was found less often than by chance. So this is a very uncommon substitution. So this uncommon substitution should be higher penalized. So instead of minus one, you say I do a minus two. So Blossom matrices actually come in a whole bunch of variety, because you can you can build this Blossom matrix. Since the Blossom matrix is just built based on the occurrence of the amino acid pairs in the blocks database, you can say, build me a Blossom 62 matrix, right? So when I build my matrix, only consider sequences which are no more than 60% or 62% similar, right? I can also build a Blossom 80 matrix, which means that I build the matrix and the input is not just the 62% similarities, but I also allow proteins which are more or less 80% related. And I can have a Blossom 45 matrix. And that allows me so I'm building this this Blossom matrix now based on the fact that these that I also include distantly related proteins. So which Blossom matrix you use when you align two sequences together is dependent on how far away the two sequences are that you are aligning. So if I have two sequences and I know that both of them are for cows, one for cows from Europe, one for cows from America, then I'm going to use a Blossom 80 matrix because these two because cows and cows are very similar. Then of course, I want to have a more related protein. So I want to kind of look at related proteins and see what is commonly commonly observed for substitution frequencies for more related proteins. Normally, if you don't know this, if you don't know how far apart your your species were that you used it for, you use a Blossom 62 matrix. So the Blossom 62 matrix is more or less the standard matrix. But if I'm doing alignment, and I know that the sequences that I'm looking at should be very related, because the animal that I got the sequences from our related animals, then I use Blossom 80. If I'm comparing, for example, proteins from mouse to proteins from ants, right, then there's a big evolutionary distance between these two species. Then I generally tend to use the Blossom 45 matrix, which means that I want to look at substitution probabilities. But I also when building it, I want to build this matrix on very dissimilar proteins. So it allows me to to have different substitution probability matrices for, or less, very recent evolutionary events, kind of unknown evolutionary events, because I don't know how far my species are, are very unrelated evolutionary species, right? So it has to do with evolution and how further apart two species are, the more likely that the proteins start also differing. So I want to compensate for that. So fun fact about Blossom 62. It is the default matrix when you do protein blasts. So when you blast protein sequences to other protein sequences, you use the Blossom 62. And it was used for many years as the standard. But the problem is, is that in the Hennikoff and Hennikoff article from 1992, they make a calculation mistake. And this calculation mistake actually caused this matrix to not be the way that they describe it, right? So they had this algorithm. This is how you build the matrix. Then they built the matrix using their algorithm, made a computation mistake. And surprisingly, this computation mistake actually improved search performance. So it actually made proteins, protein comparison better. Just a little fun fact. Okay, so when we compare pump matrices compared to Blossom matrices, then these are more or less the differences. So a pump matrix, so point-acceptant mutation matrix is based on global alignments of closely related proteins. Well, the Blossom matrix is based on local alignments, right? So if you have sequences which are very different length, then you use Blossom. If you have sequences which are very similar in length, then you can better use pump. So the pump one matrix is the matrix calculated from comparison of sequences with no more than 1% divergence, right? So pump one is the matrix which looks at very, very closely related evolutionary species. And then Blossom 62 is the matrix calculated from no more than 62% identity. So they are each other's inverse, right? So Blossom 99 matrix is similar to a pump one matrix. So all the other pump matrices are extrapolated, right? So they only do one computation. They compute pump one and then pump two is computed from the pump one matrix. But in Blossom, that's not true because every Blossom matrix or Blossom 62, Blossom 40, Blossom 80, they are all computed based on observed alignment, right? They are not extrapolated by the pump matrices are more or less a mathematical generated from the pump one. So higher numbers in the matrices naming scheme denote higher evolutionary distance. While for Blossom it's the other way around, larger numbers means that there is smaller evolutionary distance. All right. So finding the optimal alignment was more or less solved in 1981. So if I have two sequences, I want to align them together and I want to find the best alignment, then this is a solved problem, right? You don't have to study this anymore. You can do a PhD in finding the optimal alignment because this was solved once and for all mathematically by Swiss modern man. And in 1981, they published their algorithm for finding the optimal alignment. And this under all circumstances will find the optimal alignment between two sequences. It works for global and local alignment. And the way that it works is that it's a systematic construction of all optimal solutions. So it's an algorithm which is guaranteed to give you the best alignment given two sequences. The big problem with the Swiss modern man algorithm is that it takes time proportional to the product of the sequence length. So the longer your sequences are, the longer you're going to wait for the optimal alignment. And of course, this is not always possible, especially as Swiss waterman works very well when your sequences are very short. So if you have a 100 base pair sequence, which you want to compare to a 50 base pair sequence, then Swiss modern man will work. But now when one of these sequences becomes very large, like the size of a genome, like 2 billion base pairs, then the runtime of the algorithm is not exponential, is multiplied by the difference. So aligning a sequence which is 2,000 versus a sequence which is 100, compared to aligning a 2 billion sequence compared to the same length will give you a million times increase in runtime. So it will take a million seconds or a million times longer to do it for 2000 versus x versus 2 billion versus x, right? That's the multiplication in there. So the way that the Swiss waterman algorithm works is more or less a visual thing. So it creates a dot plot and then tries to find the minimal path through the dot plot. So I did my best to kind of show you guys how this looks. So here we have the human myoglobin sequence. Here we have the mouse myoglobin sequence. And here we just what we do is we just have all of the letters from zero to x and here from zero to x. And then for each letter we write down if it matches or if it doesn't match, right? So if you zoom in a little bit and we want to find a path, for example, we want to align Paranas with Parana, right? Then what we do is we just write down the letters and we say, well, the P matches the P, the A matches the A, and it also matches this A, the R matches this R, the A again matches both positions, the N matches this one, the A matches again both positions, right? So we just make a matrix and every time that the row name or the row letter matches the column letter, we say this is a perfect match, right? And of course, when using DNA or protein sequences, we don't write a perfect match like a one in the matrix, but we write the score using the substitution matrix, right? For DNA, we take the the transversion transition probabilities. And when we use a Blossom matrix, then we take the value which is located in the Blossom matrix and then put it here. So how do we do this? Well, the optimal alignment will be the alignment which you get when you go from the source, the green dot here, all the way to the sink, which is gone from the slide, but that's the red dot here. So here there should be a red dot. Is it still in the previous one? No. So the sink is over here, right? So any pairwise alignment can be represented as a path through the matrix and we find a path from the source to the sink. So every time that we are here, when there is a perfect match, we always take the diagonal. When there is not a perfect match, we can do three things, right? We can say, well, we introduce a mismatch or we introduce a gap in sequence one or we introduce a gap in sequence two, right? We can do a mismatch or we can go from, or we can introduce a deletion from P to A, so a deletion in sequence two or a deletion in sequence one. So what Swiss Waterman algorithm does, it just walks all possible paths from the beginning to the end and then it, for each path, you get a score because introducing a gap or introducing a mismatch just as a negative score. So for each of the paths through the matrix, you get a score, right? One of the ways of aligning would be we just say that these two sequences don't match altogether. So we just go from green, we walk all of the way here. So we introduce gaps in sequence two and then we just go here. So we introduce gaps in sequence one and then we're at the sink. But this is a very bad path because this will get a very low negative or this will get a very negative score, right? But this is how it works. It just goes and computes all of the paths through the matrix and then just takes the one with the highest score and that is the optimal path. This algorithm is very time-consuming. So in 1990, Allshoots at All published a paper in Journal of Molecular Biology, which is nowadays one of the most cited papers in bioinformatics and is one of the most popular bioinformatic programs. And this is called Basic Local Assignment Alignment Search Tool. So it's called BLAST, right? So the tool aligns two sequences together and it is the most popular alignment tool. Why? Because it speeds up searching by orders of magnitude. So if you would want to compare one sequence against all of the like 12 billion sequences that are in GenBank, using Swith Waterman, you would wait until the end of the universe before it comes up with the best alignment, right? But by using BLAST, you can have one sequence searched against all of the sequences that are in the GenBank database. So I think it's like 12 billion or something. And you can do that within minutes. And that is because it uses smart pre-processing. So only sequences with similar substrings are being compared. So instead of taking your whole sequence and then going through each of the sequence in the database and aligning them, what it does, it just chops up your sequence in like words. So it says like, no, I take the first five base pairs, then I take the next five, the next five, the next five. And I only compare my sequence to sequences in the database, which has a match with one of the substrings that I'm computing. And of course, this quickly excludes many, many possible alignments, right? So if there's no match between a substring of your sequence compared to any of the other, and this is of course a perfect match, then the whole comparison is discarded. Funny fact is the Journal of Molecular Biology, and let me say it correctly, is a journal with an impact factor of five. So one of the most popular bioinformatic tools used by millions and millions of people across the world every day was published in a journal with an impact factor of only five. And just to remind you guys that not everything published in nature is great. Also stuff published in low impact journals can be awesome. So BLAST provides many different ways of aligning sequences. You can use BLAST N, and if you want to align a DNA sequence against a DNA sequence, if you align a protein sequence against a protein sequence, you can use BLAST P. If you want to align DNA sequences against proteins, of course this uses the amino acid translation table, right? So it uses six frames. So six frames means that you are taking six times three is 18 base pairs. So you can compare 18 base pairs of DNA, six frames at a time against proteins, then we use BLAST X, right? So I can take a DNA sequence blasted against the protein database and still find the protein which has the highest similarity to the DNA sequence. If I want to do it the other way around, if I want to align proteins against the DNA database, I can also do that. Then the tool that you need is called TBLAST N. And of course I can also do the same thing. I can compare DNA to six frame translation against DNA via six frame translation. And then this is called TBLAST X. So this is based on the fact that of course, if you compare two DNA sequences, you want to make sure that they are coding for the same protein, right? So that's why TBLAST X exists because of the fact that the third base pair is the wobble base pair, which we learned when we had the RNA lecture. Good. So when you do BLAST, every alignment gets a quality number and this quality number is called the E value, right? So the smaller the better, the closest to zero is the best, right? So an E value of zero means that two sequences are identical. An E value of 50 means that two sequences are completely different in a way, right? So and the E value represents, so the E value in the alignment, BLAST alignment is very similar to a P value that we compute with statistics, right? So it tells you how many alignments of that quality or better are there in a random database of sequences, right? So if I get an E value of 0.02, then you can more or less use this as an interpretation, right? So we say that E values which are lower than 0.05 are considered homologous and E values higher than 0.5 are not considered to be similar in a way. Besides the E value, it also tells you the percentage of identity, so the higher the better, right? So it's just the number of base pairs that match out of the total number of base pairs that you have and it also tells you the length of the alignment, right? So the largest matching substring, so how many base pairs of sequence 1 were aligned to sequence 2? Because of course it could be that the alignment is really good, but only when you chop off the first 10 base pairs of sequence 1. So when we talk about homology, then the rule of thumb for homology is that the E value should be lower than 1 times 10 to the minus 5. There should be a continuous stretch of around 100 base pairs matching in the two sequences that you're searching for the amino acids when you are doing protein search and there has to be an identity between the sequence that you're searching for and the sequence that you found of around 70% on DNA level and when the identity is 25% on protein level, we still say that there is homology, right? But this is really a rule of thumb, right? There's nothing to say, two sequences are homologous, but generally when we annotate genome, so when we have a sequence in an unknown genome, then we blast it to the database and then when we find another sequence which is 1 times 10 to the minus 5 similar and it has a stretch of 100 base pairs and it is 70% identical, then what people do is they take the sticker of the known gene and then put the same sticker on the gene in the new genome, right? So for myostatin and this would be that, okay, I have a sequence in an unknown and you just put it on, but this is the rule of thumb. All right, so that was everything that I wanted to tell you about pairwise sequence alignment. So the next step is, of course, multiple sequence alignments. What do we do when we have more than two sequences, right? So multiple sequence alignment is one of the most essential tools in molecular biology and so the aim is to find highly conserved subregions or embedded patterns of a set of biological sequences and this is because conserved regions are generally regions where stuff happens, right? It's where transcription factors bind, they have a very similar sequence across different species at which they bind. But also you can think about SARS-CoV-2, right? The reason why we designed the original vaccines against parts of the spike protein is because of course these sequences were found to be homologous, right? Because in many different coronaviruses, all of the coronaviruses more or less had the similar amino acid sequence at this position, so that's why we used it to develop an antibody against that. Not only that, but what multiple sequence alignment allows us to do is that it allows us to estimate the evolutionary distance between sequences but also between species and we can use it in the prediction of protein secondary and tertiary structure because if we know that two protein sequences are highly similar and are highly homologous, then we can also assume that the 3D structure of these two proteins is very similar. The first practical method for multiple sequence alignment was developed in 1987 by Sankov and before 1987 they were constructed by hand. So if you were a bioinformatician in 1983, the year that I was born, then around 70% of your time was spent on multiple sequence alignments and the way that you would do it is you would have like these pieces of papers with these squares, right? You would write down the sequences that you have and then you would by hand start aligning these sequences. So move the little pieces of paper on top of each other and with your eyes you would look if there is similarity or not. Fortunately we don't do that anymore but that is because no one actually was able to do this with a computer and that is because in 1987 dynamic programming was still very expensive. So like having using a lot of memory and remembering all of the different paths through this graph, through this 3D space or 4D space is really hard but in 1987 the first practical method became available by Sankov. So the idea about multiple sequence alignment is that you perform successive pairwise alignments. So instead of having a single pair or instead of so you compare sequence one to sequence two, sequence one to sequence three, sequence one to sequence four and then you do the same thing for sequence two, right? And then the first thing that you want to do is kind of build a consensus sequence. Like where is there no variation in all of these alignment? And then of course further alignments are done based on your consensus sequence. So there are again very crucial parameters when you do multiple sequence alignment so which scoring matrix you use is having a big, big effect on your alignment. Your gap opening penalty and your gap extension penalties also make for a big difference in the alignment. So when you align more than two sequences you can use the same strategy as when aligning two sequences, right? So instead of hey if you have a dot plot so sequence one, sequence two, now instead of using a two-dimensional matrix you have an n-dimensional matrix which each axis represents a sequence to align. So when I have three sequences it ends up being a cube, right? So I have piranhas here, piranhas here and I have another piranhas or something on this axis. And again we do the same thing, we go from source to sync and we calculate all of the paths possible through this 3D matrix and then the path which has the best score, which hits the most matches and which avoids most of the insertions, deletions and this kinds of things, that is the sequence, the optimal path and the optimal alignment. So of course for three sequences of length n the computational time here is n to the power of three, right? Because it used to be sequence one multiplied with sequence two, the length of sequence one multiplied by the length of sequence two. And of course for three sequences it's just length of sequence one times length of sequence two times length of sequence three. So that is n to the power of three. But the problem here is when you have k sequences imagine that you're trying to align 16 different sequences together. Then you need to build a k-dimensional matrix and then the computational time becomes n to the power of k, which is just an insane computational problem. So dynamic programming approaches for aligning between two sequences is easily extended to k sequences but it is very impractical because of this exponential runtime. So in many cases, had by just using Swiss Waterman, algorithm on like five sequences of length a hundred, you already end up with a computational time which is like five to the power or a hundred to the power of five. And so this quickly becomes the amount of waiting time is more or less similar to like the time that the universe existed. Good. Yeah, let's do one or two more slides. So it is multiple sequence alignment is very useful for more distant alignments because you get a view of which positions are conserved within DNA or which positions are conserved in proteins. The most commonly used algorithm today is Clustel W or Clustel Omega, which is the updated version of the algorithm. The original Clustel W algorithm was written by Higgins and Sharp in 1988 and it is a progressive algorithm which first tries to identify the best high scoring pairs. So which two sequences are aligning the best together and then to identify the closest pair of pairs. So you have to first you identify which sequences pair wise are most similar and then you in the next step you then see which of these two groups are similar to each other. And then you continue this. So you do this iteratively. So you take one sequence, you add the sequence which is most similar and then you add the next sequence which is most similar and the next one and the next one. So you're building up kind of an alignment tree. And then you do this until all the sequences that you want to align are aligned. So it's a multiple alignment scheme where corresponding amino acids are in one column. So how does this look right? So if we look at the myostatin gene of 10 different animal species, so human, pig, mouse, rat, sheeps and stuff, and then this is how the multiple alignment looks like, right? So what is multiple alignment? Well, given k sequences, we want to align them in a way that minimizes the path through this 10 dimensional matrix. So for these sequences that we had, it looks kind of like this. But what does this now tell us? Well, it tells us for example which residues are conserved, right? You can see that across all of the species that we looked at, there is always a C at position number five. There is always a W at this position. So this means that this C and this W are important for the functioning of the protein that we're looking at, right? Because otherwise, there would have been changes in looking across like a million years of evolution, right? And that's kind of what you do when you compare like humans, pigs, mouse and sheep together. You're looking at the more distant species that you add, the further that you're looking back in evolution, right? So if you find that something is conserved between humans, mouse, rat, sheeps and all of these things, then you know that for like how do you call that? Not marsupials, but we are people, animals that have a womb. I forgot the word, but for all of the animals together. But if you then include for mammals, that's it. So if you see that all mammals have a C, all mammals have a W, then you can more or less infer that this is very important for mammals. Not only can you see which residues are conserved, but you can also see which regions are conserved, right? So in the first four species that we looked at, we found that all these four species have QLPG, right? And all of the other species that we included in the alignment do not have this, right? So this is a conserved region and it's not conserved throughout the whole evolutionary tree. It's only conserved throughout a part of the evolutionary tree. So of course, when this happens between... So if we are aligning like mammals and reptiles and marsupials and these kinds of things, and we find that all of the marsupials have this common pattern, but this pattern is not shared with the mammals and the reptiles, then we can infer that, okay, so this part of the protein might be very important in marsupials, but it's not that important in other species like mammals or reptiles. Not just that, but we can also look at patterns, right? We know that amino acids are sometimes very similar to each other, right? So here we see valine, leucine, and the other ones, right? And these are all hydrophobic residues, right? So it's not important which amino you have, acid you have at this position. It is important that as long as you have a hydrophobic residue at this position, then the protein is probably working, right? So put all of this together, using multiple sequence alignment, we can identify regions of proteins which are conserved, we can identify patterns like this is always hydrophobic, this is always hydrophilic, and we can find these regions which are conserved regions, which tell us something about evolution and that this part of the protein might be very important in certain branches of the evolutionary tree. Good, so now it's three. I forgot the word mammals, so I'm going to take a break, just to kind of clear my mind, and after that we will be back with more multiple sequence alignment. So if you're watching this on YouTube, see you in the next episode, and this then.