 So welcome back, everyone who's watching it on Moodle. Again, like percentage of matches on DNA and protein level. And be aware, something like this will definitely be on the exam. OK, Commando says 50% protein. And how much DNA? The same. Ooh, that's interesting. Actually, same actually. All right, so we have at least one answer. So in this case, if I didn't put any answer, that is so stupid. All right, let me do this on the board, then, since I have the board. So we have CAC, C-A-C-C-A-T, right? And then we have G-C-G. And we have G-C-A. And we have T-C-C. And the other one is A-G-T. And then we have G-A-A. Is that readable? No, that's not readable for you guys. And we have G-A-G. All right. Let me switch to full screen mode, then. I hope my overlay is active there as well. All right, so I'll put not the overlay. I will move myself to the middle a little bit. I think you should be able to read the board now, right? So very basically what we want to do is we want to encode both of them to protein level, right? So we say C-A-C. So C-A-C codes for a histidine. C-A-T. C-A-T also codes for a histidine. We have G-C-G. So we have G-C-G, which is alanine. And we have G-C-A, G-C-A, which also codes for an alanine. Then we have T-C-C. So we have U-C-C, which codes for a serine. Then we have A-G-T, A-G-E-U, which codes for a serine as well. And then we have G-A-A. So G-A-A codes for glue. And we have G-A-G. So we have G-A-G, which also codes for glue. So you can see that on the protein level, there is 100% match, right? So the mismatch is 0%. Well, if we would look at the DNA level, the DNA level would be one mismatch here. It would be one mismatch here. There would be three mismatches here, right? So all these three don't match. And here, there is one more mismatch, right? So we have 1, 2, 3, 4, 5, 6 differences. So there is 6 out of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12. So 6 out of 12. So the percentage similarity on DNA level is only 50%. While on protein level, the similarity is actually 100%. Is that clear? Why Commando do you say S1 is his a la ser pro through a la? OK, I get it. Good. Good, good, good. Yeah, so you just go through sequence 1. You write down the protein sequence, and you do the same thing for the second sequence. All right, good, good, good. So then everyone's happy. Everyone's there. So this wheel is a very common wheel, and it occurs a lot when you translate from DNA to protein. Of course, there's things like stop codons in there as well, and you have the methionine here on the bottom, which is actually the, so this one here, which is actually the start codon. It's the wheel of DNA fortune. Kind of, kind of, kind of. But it is, there are very, but on the wheel it's no T. No, because the wheel actually goes on mRNA level. So on mRNA level, the T is a U. Yeah, yeah. So if you say T, T, T, then it actually is U, U, U, which would code for a finial alanine. All right, makes sense. Good. So this will definitely be on the exam, like at least a couple of them, because they're easy questions to make. But on genetic level, T is different from U. Yeah. Well, on DNA, a U does not occur. A U is just a T base pair in RNA. So in RNA, the base pair that codes for T, you call that U. And it also matches with the A. And then, of course, you have the modified U as well, which is the phi, the omega, or the this sign, right? That one. And that's actually a U, but then with a chemical modification. All right, so good. If that's all clear, then we move on to the next one. So I told you guys that when you compare DNA sequences and you want to score an alignment, then you have to take into account that some things are very common and some things are very uncommon. So here we see the four different base pairs, right? So we see A, G, T, and C. And we see here that transitions are very frequent. So an A very frequently is kind of mutating into a G. A C is very frequently mutating in a T. And you can see that that is because of the fact that these chemical formulas are very similar, right? And so just chemically speaking, when the DNA is incorporating an A base pair into the genome, it is much more likely for the polymerase or for the polymerase to make a mistake and just put a guanine at this position, right? Because they look biochemically, they look very similar. The same thing holds for cytosine and thymine. And these chemically look very similar. But it is very uncommon for a C to change to a G without having some kind of an external influence. And so transitions are usually caused by polymerase taking the wrong base pair and inserting it at that position. While transversions are relatively rare and transversions are generally occurring when you have real genetic mutations. And like a single nucleotide polymorphism, generally is a CG SNP or a CA, but almost never do you see a CT single nucleotide polymorphism because these are just mistakes by the polymerase by putting it in. So these things they have to take into account. So when you are scoring, if something is similar on a DNA level, right? Then most algorithms nowadays take the substitution probability as a weighing factor. So instead of just scoring a mismatch as minus one, they score a transition at like minus 0.25 and a transversion is scored at minus one. I hope that's clear. And that's it based on the chemical formula you can actually see very easily that have for these two, the polymerase can just make a mistake because biochemically they look very similar. So inputting an A at a certain position or a G doesn't matter too much because they are chemically very similar. The same thing holds for amino acids. So amino acids have the same thing because some amino acids are very similar and some amino acids are very, very different, right? So here we see the Taylor diagram of the amino acids and we see more or less the distance between them based on head. So the distance between the P and the G is not that big but the distance between a P and an F is really big, right? Because they are more or less on opposite sides. The same thing holds for a Q going to a V while a Q going to an E. You want to penalize less, right? So the idea is when you are comparing sequences, hey, you have to take into account that some things might be biochemically more or less similar and if they are biochemically more or less similar, it doesn't, you don't want to penalize as hard for them as when they are very, very different. So this leads to these scoring matrices. So here we see the PUM, 250 matrix and this is the scoring matrix, one of the scoring matrix which you can use when you compare two amino acid sequences. So when you compare two amino acid sequences with each other, you see here these numbers, right? And these numbers are based on experimental evidence on more or less how often a mutation is seen from one amino acid to the other. And of course you want to give a very big positive score if the things are the same and of course the question always comes up, why is a W to a W scored as a 17 and why is a V scored to a valine of four? And that just has to do with the way that the algorithm work. But you can see that there are positive scores and that there are negative scores. So a score of zero means that this is a very likely mutation. So going from a live scene to an E, which is the glutamate or glutamine is very commonly observed while the more negative the score, the less these things are observed in real protein alignments. And so what you can learn is, for example, had that a cysteine is very unlikely to be changed by a tryptophan, but it is relatively likely, for example, to be changed to a serine because they are very close to each other. And then if you look here, then you have the serine which is the S and then you have the cysteine which is more or less here and these two are relatively close together. So it's the point accepted mutation matrix takes into account how similar and how dissimilar certain amino acids are from each other. And when you align two things together, it will take this into account. And had the positive scores on the X equals Y axis, hey, you shouldn't really care about them, but they are just based on the fact that some things are more preserved than other ones. So the cysteine itself is relatively well preserved and that also holds for the tryptophan because they are amino acids which are very special. But don't worry about those. So another type of scoring matrix is actually the Blossom matrix which is the block substitution matrix. So the amino acids in the table are grouped according to the chemistry of each of the side chains which is different from the pump matrix because the pump matrix is just built based on the point accepted mutation. So this is based on more or less the wheel that you see here, right? That had the glycine is actually coded by G, G and then doesn't care what the third base pair is. And while here, the glue and the usp, they have the same two initial letters but the third letter is different, right? So glue and usp are relatively close to each other in the pump matrix, but they can have very different biochemical functions and that is why in the Blossom matrix there are different scores. And so each value in the matrix is calculated by dividing the frequency of occurrence of the amino acid pair in the blocks database. So the blocks database is one of these databases which is built with homologous sequences. And so here they just look and they look at the data and see how often is a certain amino acid changed in a homologous protein by another amino acid. And so a score of zero indicates that the frequency with which are given to amino acids were found aligned in the database was expected by chance. A positive score is that it's found more often than chance so that means that two amino acids are very likely to be substituted by each other and a negative score means that it's very unlikely or amino acid to be scored by another one. So Blossom matrixes come with a number, right? Blossom R and that means that you build the matrix from this blocks database with no more than R% of similarity. And so when you take the Blossom 62 matrix this is built using sequences with no more than 62% similarity. So that's why you take it off. And so you have different Blossom matrices and these different Blossom matrices allow you to do alignments based on what you expect. So if you use the Blossom 80 matrix then hey you're telling the algorithm that I expect these proteins to be related. If you use the Blossom 45 matrix you're saying that well these proteins might be related but if they are related they are very distant really related, right? So that's why you only take sequences which are up to 45% similar. And so the number behind the Blossom tells the algorithm more or less what your expectation is based on the alignment. And the Blossom 62 is more or less mid range is more or less most frequently used. There's a fun fact about the Blossom 62 matrix. So the Blossom 62 matrix is the default matrix when you do a protein blast. It has been used for many years as the standard but it is not exactly accurate according to the algorithm which has been described by the people that came up with it. So they made a mistake in calculating the original Blossom 62 matrix but surprisingly the Blossom 62 wrongly calculated matrix or the one that's not calculated based on the original algorithm that was described by Hennikoff and Hennikoff is actually better. So it improves search performance. So why that is, no one knows exactly but had the original Blossom 62 so the default matrix is actually not entirely accurate. So some amino acid numbers so the frequency at which they were observed to change in the database did not match what was in the database but still it worked better than the standard algorithms for alignment. So the miscalculated Blossom is still the default because it improves search performance and why that is, no one knows exactly. All right, so when we look at pump matrices and Blossom matrices then the pump matrix is based on global alignments. So global alignments of closely related proteins while the Blossom matrix is based on local alignments because you're finding the best optimal substring. So the pump one matrix is the matrix calculated from comparison of sequences with no more than 1% divergence while for the Blossom matrix is the other way around. So a Blossom one matrix means that it's a matrix calculated for comparison sequences with no more than 1% identity. So a Blossom 50 matrix is similar to a pump 50 matrix but a Blossom 75 matrix is similar to a pump 25 matrix because of the fact that they are defined in the opposite way. And so the Blossom matrices, all of them are based on observed alignments. They are not extrapolated from comparison of closely related proteins. So they are while the pump matrices are extrapolated. So they made the pump one matrix and then the pump two, pump three, pump four matrices they are all more or less extrapolated. So they don't redo the calculation but for the Blossom matrix the recalculation is done based on this blocks database every time. So the higher number in matrices naming scheme denote a larger evolutionary distance. And so if you are comparing a Bostaurus with a Bos Indicus, so a cow from Europe with a cow from India, and then you assume that these things are very closely related. So you would use a higher number pump matrix or a lower number pump matrix. So you would probably use pump one. But if you are comparing a cow with a dolphin and then the evolutionary distance between them is further apart, and then you would definitely use a pump matrix with a higher number like a pump 20. And then for Blossom, the larger numbers in matrix naming scheme denote higher sequence similarity and therefore smaller evolutionary distance, right? So if you are comparing to cow species you would use a Blossom 99 and you would use a pump one. And so that's the way that they work. But they are very similar in what they do. Why Blossom is exactly the default I don't know because pump matrices work just as well. But it's just because it's not a solved issue, right? Because what is similar and what is different is not something that is easily answered. And especially on protein levels where amino acids can be very similar chemically speaking or they can be very similar encoded using the DNA and you have just two different ways of looking at similarity. All right, so finding the optimal alignment is actually a solved thing. So once you have your sequences, you have your, this is the way that I'm going to score differences. And so for example, I'm going to use the Blossom 62 for comparison of protein alignments. And then the optimal algorithm, the best algorithm to perform an alignment is actually the Smith-Waterman algorithm which was invented in 1981. And it is a systematic construction of all optimal solutions. So you get all the best solutions and it is actually mathematically proven that you cannot do a better alignment than using the Smith-Water algorithm. The Swith-Waterman algorithm works for global and local alignments, but it takes a time proportional to the product of the sequence length. So if you do L1 minus L2, that is the number of operations that the Smith-Waterman algorithm uses to do the alignment. And this is of course very bad because if you think about how you would do a local alignment, local alignments nowadays, you would go to ensemble and you would do a blast search. But a blast search is not using the Smith-Waterman algorithm. So it will not give you the optimal alignment. But the reason why it doesn't give you the optimal alignment is because the sequences that you are comparing to, because you have your own little piece of DNA sequence, for example, and you want to blast it against all known genomes in the database, right? If you would use the Smith-Waterman algorithm, this would involve like billions and billions and billions of computer operations. Multiply it with the length of your input sequence. If you think about the human genome, the human genome is like three billion letters. If you then would take an input sequence which is 10 letters long, then you would have to do 30 billion steps in the algorithm. Well, if your input sequence would be 100 base pairs long, then you would do 100 times two to three billion base pairs. So it is the optimal algorithm, but it is computationally very, very intensive because of this fact that the time taken is proportional to the length of the product sequences. So L1 times L2. So the idea is very easily explained because what they do is they create a dot plot and then they find the minimal path to this dot plot. So a dot plot is something which looks like this. So here we have, for example, two sequences. We have one sequence for human. We have one sequence for mouse on the x-axis. So you see that they start on zero. So in humans, we have around 90,000 base pairs on the top. Here in mouse, we also have like 90,000 base pairs of mouse sequence. And then the idea is that every time you just, so the white areas are mismatches and the black areas are matches. And what you are trying to find is you find, you try to find the minimal path. So you want to hit. So you want to walk from the 00 to the 90,000, 90,000 point. And you want to walk in such a way that you hit as many of these black dots as possible and the minimal number of white dots. So in this case, you would look at this plot and you would say, well, you would walk this path. So up until here, the sequences are similar. And then the mouse sequence and the humans. So the mouse sequence here, which is 90,000 base pairs, is more or less the same sequence which is encoded in the human sequence at around 75,000 base pairs. So the last like 15,000 base pairs of the human sequence have no real homology to the last 15,000 base pairs here in the mouse sequence. So that is what a dot plot is. So how do you find a path through the dot plot? Well, here we have the word piranhas and the word parana, stupid word. But the first thing that we do is we mark all perfect slash high scoring matches of sequence S1 and S2. So the P matches the P, the A matches the A, and the other A matches the A. Of course, when you're doing this for proteins or for DNA, you would score them based on transversions. So here we are creating a matrix just filled with zeros and ones because we're looking at normal letters. And there, there's no biological background. But if you would use a Blossom matrix, then you would say, well, a P to a P is a score of 15. A P to a I is a score of minus 2. A P to a R is a score of minus 6. So you would just use the entries of the Blossom matrix. And now, of course, what we now do is then we follow the algorithm, the Smith-Waterman algorithm, to walk through this dot plot. So we start all the way here at the beginning. And if there, so any pairwise alignment can be represented as a path in the matrix. So the optimal pairwise alignment is just if there is a line going through your box, then you follow that line. When there is no choice, then you pick where you want to go. So you can go this way, you can go that way, or you can go that way. So this way would include introducing a gap into the Paranas. This way would introduce introducing a gap into the Piranas. And going like this would be a mismatch. Is that clear that you can have that this is a deletion? Or here you insert a gap. So the horizontal steps introduce gaps. And the diagonal steps, they introduce mismatches between the two sequences. All right, so I told you guys that the Smith-Waterman algorithm is very good, finds the optimal alignment, but computationally really, really expensive, and you don't want to do that. So nowadays, we almost always use the BLAST tool, which is called Basic Local Alignment Search tool. So the BLAST tool only works for local alignment. So when you have, for example, a genome sequence, and you have a small sequence that you want to find in the genome sequence, it was invented by all shoots at all 1990 published. And it is the most popular bioinformatics program, which is in the world. Because alignments are more or less the key to bioinformatics. And BLAST is nowadays the most used tools. And it speeds up searches by an order of magnitude, so you can compare one sequence against all of the sequences in GEMBank. And it only will take you a couple of minutes. If you would use the Smith-Waterman algorithm, it would take you almost 100 years for a single BLAST search to complete. And the reason why that is, is that it uses a very smart pre-processing step. So the whole algorithm is based on something which is called camarization. And only sequences with similar structuring are being compared. So do I have a slide about camarization? No. So camarization is a very interesting technique. Let's go back to the board. Why not? The thing is there anyway. And I don't have that many slides for today. So camarization is when you have, so when you have a sequence, right? So let's say A, T, T, A, A, T, right? So now when you want to camarize the sequence, what you're saying is I'm going to create camers of, for example, length four. So I'm just going to say K equals four. So what I'm going to do is now say that, well, I have this sequence. So this is then a sequence in my database. And I take the first four base pairs. So this is A, T, T, A, right? So what I now do is I just write down the camar. So this is A, T, T, A. And I write down where this camar starts. So this camar starts at position one, right? Then I do the next camar. And the next camar is T, T, A, A, starting at position two. The next one is T, A, A. No, right. And now properly, T, A, A, T, right? And this starts at position three. The next one is A, A, A, T. So A, A, T, T, starting at position four. And now here we find the first overlapping camar because this is A, T, T, A, right? And we already had A, T, T, A. So we write down this A, A, T, T, A also starts at position five, right? So instead of storing the whole sequence, what I'm doing is I'm just breaking down the sequence into very small sequences and then looking to see where the start positions are. So now when I want to search for a certain sequence, right? So now imagine that I have, and of course this should do with the whole sequence so you get all possible blocks. But now when I'm looking for a sequence which is for example like this, right? So now I camarize this sequence as well. So I have C, A, T, T, then I have A, T, T, A, and I have T, T, A, G. Now what it does, it just searches for these camers in the database. And it searches for this one in the database. And it searches for this one in the database. And now the best alignment is the one which has the most camer matches to the database and then also with the positions included, right? So instead of having to compare a lot of different sequences and making a lot of comparisons, introducing gaps in these kinds of things, I now know just by looking at this sequence that the only position in the camert original sequence where it could bind is actually at position five, right? Because at position five there is an exact match between these four base pairs and these four base pairs which means that this is a position that I should look and do the alignment for, right? So I should try C, A, A, T, T, A, G. I should try and align it at not position five because this is just the second camer that I'm looking at. So here I now know that I should look from position four to position one, two, four, five, six, seven, eight, nine, so from four to nine. So I have to do an alignment from four to nine and I don't have to do the alignment from one to five, from two to six. So it skips out a whole bunch of possibilities and this camerization is a very nice trick when you're working with very long sequences, very long pieces of text and when you want to kind of do that. And so that's the basic behind BLAST is that instead of having to test all possible alignments, if you are not testing all possible alignments, you're first trying to figure out where you should try the alignments and that is based on exact camer matches. And of course the whole database that the whole GenBank database is camerized for different camer sizes. And so when you input like a very long DNA sequence, it will take a large camer and when you input a very short DNA sequence like 20 base pairs, then it will use a very short camer. And of course a short camer will yield more matches so you have to do more real alignments at these positions but a long camer will kind of quickly exclude like large parts of the genome so you never have to do the alignment there. And this is the smart pre-processing step which is done in BLAST. So that is why BLAST is so popular and why it is so quick because in the first step after camerizing your input sequence, it just looks through the database to see is there any camer that matches to somewhere in the database and if yes, then I'm only going to do the alignments in those positions and I'm not going to do it at any of the positions where there's no camer match and that is the kind of winning trick which made all shoots one of the most, well, cited bioinformaticians in history. Is that clear? Camerization, clear. Good, all right, let me continue. So when you look at BLAST, the BLAST tool is very smart, right? So you have different versions of BLAST. So you have DNA against DNA, then you use BLAST N for BLAST nucleotide. You can BLAST proteins against proteins which is called BLAST P. Then you have DNA six frames against proteins. So this means that my input is a DNA sequence but I can search against a protein database. This is called BLAST X for translation or something X. I don't know exactly why it is X and then you can have an input sequence which is a protein sequence. You can search against a DNA database and of course, six frames here are six codons in a row and this is called TBLAST N. So this is the translational BLAST nucleotide. So translational BLAST nucleotide. Then you have the DNA six frames against DNA via six frame translation which is called TBLAST X. So this is more or less very similar to DNA versus DNA but this takes into account the fact that DNA comes in codons. And then you have MEGABLAST and MEGABLAST nowadays does many, many different queries and but it's more like a smart algorithm that based on the input figures out which kind of underlying BLAST algorithm it wants to use but these are the options. So I remember that if you want to search a DNA versus DNA you use BLAST N unless you want to do like a protein translation in the middle because you're saying I'm not interested in DNA sequence which are very similar. I'm interested in DNA sequences which code for a similar protein because then you are doing TBLAST X. All right, so when you do BLAST and you have alignments, they get a quality number. So instead of getting a percentage of matches or a gap opening and all of these things what they do is very basically they just give you a single quality number. So the quality number is called an E value in BLAST and the smaller this number is the closer to zero is the best. So it works kind of like a P value, right? So had a P value of one is not very significant or it's actually not significant but the smaller the P value or the closer to one or the closer to zero it gets had the more significant your P value the more likely that this is not due to chance. And so, and the E value is kind of a representation of how many alignments of that quality or better are in a random database are expected to be in a random database. And so it's kind of a P value with permutation approach but in BLAST they call it an E value. And so if you have an E value of one times 10, six that means that there's a chance of one in a million that you would find an alignment with a similar or better quality in a random database. And of course, depending on how many sequences are in the database this could mean that this is very significant or very not significant. Besides the E value you always want to look at the percentage of identity. The percentage of identity tells you how many exact matches there are, right? So it ranges from 0% to 100% and then you have the length of the alignment because it is a local sub, because it's a local alignment of course it can actually chop off the edges of the search string that you are looking for. So it might be that the search string have only the first 50 base pairs of the search string are found with 100% identity but then the next 20 base pairs are not found at all. So head, but that's the thing that you have to look at. So the percentage of identity, the higher the better and the length of the alignment, the longer the better. The longer the alignment is. So when you are dealing with homology and you're using BLAS to figure out if two sequence are homologous then we generally say that two sequence are deemed to be homologous when the E value is less than one times 10 minus five. There's a continuous stretch of 100 base pairs which is matched, right? So the length of the alignment is 100 base pairs. A rule of thumb is 40 amino acids should match and the identity on DNA level needs to be 70% and the DNA on protein or the identity on protein level is generally considered 25%. So if 25% matches of these 40 amino acids or 70% of these 100 base pairs match then you generally assume that two sequences are homologous, man chicken. All right, thank you for that insightful comment. All right, so that was single sequence alignment so our pairwise sequence alignment. So when we have two sequences, how do we do that? But of course nowadays we're not dealing with databases with very, very, with a single sequence looking for what is the most, or what is the most standard sequence or the most homologous sequence to win. But multiple sequence alignments is one of the most essential tools in molecular biology because we want to find highly conserved subregions or embedded patterns on a set of biological sequences, right? And so conserved regions are usually a key functional region. So for example, if you think about something, all right, so that is going to be a ban. Where's my moderator? I'm gonna give a timeout first before we give a ban. All right, thank you for the deletion. Very good, and I hit the hammer. Yeah, yeah, yeah, yeah, very nice. All right, so if you think about the coronavirus, which everyone's thinking about nowadays, then of course, if we want to develop drugs for the coronavirus, if we look at all of these different new mutants that are occurring, and then the regions which are not mutating in the coronavirus, those are the regions which are important, right? Because mutations there either make the virus completely not active or it would change the target of the virus. So the regions that stay the same during evolution are usually the regions which are of interest or are the regions which are, for example, the active site of a molecule, right? So if you think about a big protein and like hundreds or 200 amino acids, and then there is part of this protein which does the chemical transformation of substance A to substance B when it's an enzyme, right? And this part of the protein can of course not change because as soon as it changes, the protein is not able to do its core function anymore and if you can't do your core function anymore, especially if you're an essential protein or an essential enzyme, then of course, something goes wrong and it leads to lethality early on in life or even earlier. So one of the other reasons why we want to use multiple sequence alignment instead of pairwise sequence alignment is that we want to estimate the evolutionary distance between sequences, right? Hey, imagine that I have the myostatin gene and I have the myostatin gene for six, 700 different animals. Then using multiple sequence alignment, I can figure out how these things are related to each other and how closely or how distantly things are related. One of the other things where we use multiple sequence alignment for a lot is to predict protein secondary and tertiary structure. We already talked about that if you have a protein which has, for example, an alpha helix, then of course, if you know that this is an alpha helix in this protein, if you find a region in another protein which is highly homologous, then of course, you can more or less assume that this will also fold into alpha helix. So it was, it is one of these things which is very, very difficult for computers. So computers can only do multiple sequence alignment since 1987. Before that, people used to do it by hand. So if you go to Google and you Google things like protein alignment by hand, then you will find these images where you see scientists having like pieces of paper which have single letters on them which are then kind of stapled together which they put on the floor and then they have things and they put it underneath. It's really, really, really funny to see how people used to do this, but it used to be just little pieces of papers with pieces of string taped together and then you would move like one. So you would just sit there, you would move it and then you'd say, okay, so these match better, especially on a protein level and then you would move it and then all of the other would move with it. So you would physically do the alignment more or less on the ground. And especially for longer proteins, this takes hours and days even to kind of figure out what the best alignment is. And before 1987, 1987, you would do it by hand. Nowadays, we can actually do it and that is because of a technique called dynamic programming, but before that dynamic programming was very, very expensive for computers to do, but I'm not wanting to go into a lot of detail what dynamic programming actually is. But just remember that multiple sequence alignment only possible since the 1990s. And so the idea behind multiple sequence alignment is that you have to perform successive pairwise alignment. So you take sequence one, pairwise align to sequence two, pairwise align to sequence three, pairwise align to sequence four, take sequence two, pairwise align to sequence three, pairwise align to sequence four, take sequence three, pairwise align to sequence four, right? So you do all possible alignment. And then from this, you want to build a consensus sequence and then you use this consensus sequence to align the other sequences that you have, right? So there are some crucial parameters when you're dealing with multiple sequence alignment, things like which scoring matrix am I using? They have a massive influence. Hey, if you go from using a Blossom 62 matrix to using a Blossom 80 matrix, your pair or your multiple sequence alignments will be completely different because the pairwise alignments will be scored differently. One of the other two parameters which have a massive influence on multiple sequence alignment is your opening penalty. So how much do I penalize opening a new gap? And how much do I penalize extending a gap? And so these three parameters, they can completely change the outcome of a multiple sequence alignment. So in the end, it's more or less the same. So if you're aligning more than two sequences, you can follow the same strategy as aligning two sequences. But instead of having a dot plot, right? You now have a dot matrix when you're aligning three. When you're aligning four, you have a four-dimensional matrix. When you're aligning five, you have a five-dimensional matrix, right? And so you use an n-dimensional matrix which each axis representing a sequence to align. So if you have three sequences, you get a cube. If you have four sequences, then you get like a hyper-cube, like a four-dimensional cube. And the same thing holds here. If you use the Smith-Waterman algorithm, you use the same thing. You go from the source, which is 0, 0, 0, in all of the sequences to the sink which is n, n, n at the length of the sequence in all of them. So the idea is to find the minimal path going from the source through the matrix in such a way that you end up at the sink and you want to hit as many positive points as possible and as little negative points as possible. So if you think about this because the original Smith-Waterman algorithm in two dimensions was already the length of sequence one multiplied by the length of sequence two. If you think about doing it in three dimensions, then of course it is n to the power of three, right? So if all sequences have a similar length, then of course the computational time is o to the power n to the three. So if you have k sequences, you build a k-dimensional matrix, then you have a runtime of two to the power of k minus one times n to the power of k, which is just insane. These are exponential algorithms and this is completely unworkable to do a Smith-Waterman algorithm on more than three sequences in a reasonable amount of time. Just thinking about doing a multiple alignment of like a hundred amino acids in three different proteins is going to take like months of runtime on a computer just using the basic Smith-Waterman algorithm. And so dynamic programming approaches for alignment between two sequences is easily extended to k sequence, but it's impractical due to exponential runtime. So hey, you don't want to do exponential runtime. Of course we can do multiple sequence alignment nowadays. So it is very useful for more distant alignments because it simultaneously shows you which positions of the different proteins are conserved. And in 1988, there was a progressive algorithm which was designed, which is called Clustal W. And Clustal W still exists today, is still very much used today. Nowadays it's rebranded and called Clustal Omega, but they always talk about Clustal W because it's essentially the same algorithm. And so what it does, it first identifies the best high-scoring pairs and then it identifies the closest pair of pairs and so on to all sequences are built in. And I will have an example for that. So it's a multiple alignment scheme where corresponding amino acids are shown in one column and I have an example of that as well. And so more or less how does this start? So if you look here, then you have the alignment of mu statin for 10 different species. So humans and pigs and mouse and rat and sheep. And so here you then see the multiple sequence alignment, right? And you see that in many species there's a gap, this brayer species actually has this YG insertion which none of the other species have. Clustal W can be found at this link. Nowadays it will link to Clustal Omega, I think. So how do you read? So how do you do this? So what is multiple sequence alignment? Well, imagine that we have K sequences. So one, two, three, four, five, six, seven, eight. So here K is eight, right? So in theory, if you would do the box then you would create a eight-dimensional cube. But since we don't want to create an eight-dimensional cube, we want to do the kind of Clustal W algorithm. So the multiple sequence alignment of these look like this and what we now see is that we have things that are called conserved residues because all sequences have a C or a W at these two positions. We also see that there's a conserved region here which is conserved in half of the species. And what we also see is that there are more or less patterns that we can observe based on how similar amino acids are. And so here we see valine, loycine, alanine, valines again. And so these are all amino acids which are hydrophobic. And the same thing is seen at position three where we see isoleucine, lucine, valine, and lucine again. And so these are all hydrophobic residues. So that this might mean that there is a region here which does something in like the first four species. And there is a residue here which are conserved which are very important for the functioning of the protein. And then of course we have these two hydrophobic residues on the beginning of these sequences which might be due to the fact that this is the part which is exposed to the water and not to the cell membrane. And so in the end, you get like an overview of what is conserved and what is different and where are the different regions. All right, that is going to be then. Good. How long have I been talking? 48 minutes. Yeah, so it starts here. So the Clustel W is one of the most popular multiple alignment tools today or today is actually Clustel Omega for doing everything. But the W here stands for weighted. So different parts of the alignment are weighted differently. It's a three-step process where you first construct pairwise alignments between all sequences. Then you build a guide tree using a neighbor joining method. And then you have a progressive algorithm for guiding progressive alignments guided by the tree. And the sequences are aligned progressively according to the branching in order of the guide tree. So I think after the break we will go through one of these examples. I made a very small example, I think, with a very, yeah, so very, very short sequences. So sequences which are length four. So that's good. All right, so then I will stop here. So I will stop the read.