 So in principle we could use dot plots to characterize protein structure too. This is a real one of hemoglobin alpha versus hemoglobin beta chains. So here you see some very long lines along the diagonal and then lots of smaller lines that are more or less random matches. The problem is that it's hard to quantify this because it's binary. It's all or nothing. Either we have a match and then it's a dot or it's closed but no match and then it's nothing. So nobody uses this for proteins anymore. Only DNA structure. Let's look at some real structures. I think this is myoglobin. Hue is for human and then we have some sperm whale, DG is dog whale, W whale that's sperm whale, PN is pony I think etc. Do you see the patterns here? There are lots of positions here where we have exactly the same residue in all these species and then there are a handful of positions where there's slightly difference. You might even have so many assertions here. I'm not sure where there is one. We could try to calculate some source of average or at least consensus because if I demand that I can only, if I should say that I only assign an average if it's exactly the same residue, well as the number of species grows, remember there could be hundreds of thousands. At some point I will never ever have any position where every single species have the same residue. So I'm going to need to make do with say that you know if 90% of them are alanine I'm going to say that it is an alanine and allow a little bit of change and that's particular why the dot plots are not very good. Second, not all changes here are equal right? If I change something that's small and hydrophobic for something else that's small and hydrophobic you probably agree that that's not likely to change the protein structure a lot. But if we change small and hydrophobic to something large and charged that's going to be a very large change. So one way or another we're going to need to find a way to characterize and calculate these changes and see how important they are and ideally be able to go from these structures and predict what the sequence, from these sequences and predict what the structure is because remember all of these sequences have the same structure. How do we do that? Well if I plot just a small fake sequence here do you see the similarities here? There are some things that are similar but it's not entirely easy. So if we go there and if I insert a little bit of gap in one of them here now I get C, K, F lining up. So first I have the C there that's a match there are more messages but I won't show all of them. Then I have this gap or insertion so it's a gap in one sequence is an insertion in the other sequence that can happen. And then I have things that are mismatches in theory I could instead of a mismatch I could first introduce a gap in one sequence and then a gap in the other sequence but it's unlikely that evolution worked that way that it kept inserting gaps. It was probably rather that I changed an alanine for a glycine both of them are fairly small and similar residues. So I would like to find a way to somehow rather than saying that this looks good and I did this manually find a way to let the computer do this and assign a score to it. So we're going to need a way to calculate sequence similarity and in particular score it. How do we do that? Well if I just had that example alignment of myoglobin with different sequences we can calculate that. It's not particularly difficult and it's the way to do this goes beyond the class again but just to show you that this is not something magic we're lining up. I can calculate what is the probability of matching say an alanine to a glycine. Let's say that's 0.1 and then I can say what is the probability of matching an alanine to an alanine that's probably much larger rather this should be 0.01 this should probably be 0.9 and then what is the probability of matching an alanine with a tryptophan that should probably be pretty low say etc. If I do that the probability of matching everything that should be the probability of matching impulse 1 multiplied by the probability of matching impulse 2 multiplied by the probability of the match impulse 3 multiplied by the probability of matching impulse etc all the way up. The problem is that first we need to multiply things a lot. Multiplications can be a bit expensive on computers not that bad anymore but in particular when these numbers are small we can actually end up with underflow if once we get to 10 to the minus 30 or so computers can no longer calculate. So in practice what we typically do is that I introduce a score where I say that s is a logarithm of the probability and by probability here I can literally these probabilities literally just come from counting how common is it that an alanine is replaced for a glycine versus not. In that case I get that s tot equals s1 plus s2 plus s3 plus s4 etc. So I literally get score numbers that I can add up and this is the entire basis of bioinformatics but just know that we could actually derive this from probabilities it's nothing magic. So the idea with those scores is that I can look at small fragments of pre-existing residues that I know match and then use those to derive scores. How common is it that I replace alanine for tryptophan. That leads to so-called substitution scoring matrices. Substitution matrices or scoring matrices which is literally matrices with all the amino acids on row and columns and then scores. There are some extra letters here and this is just to tell you what they are but in some cases it can be difficult to tell asparagene for as per aspartic acid. What this gives us is a way to introduce more biological information right. Remember rather than just looking at residues and guessing whether alanine is more stable and leucine here this is to derive from how frequently does nature in practice replace alanine with leucine. So in a way I'm relying more on evolution. Or am I? Yes I am but evolution also encodes the physics right. The reason why alanine is reasonably easy to replace say with glycine is the physics we talked about in the first few lectures of this class and the reason why it's less likely to be replaced say with tryptophan is of course also the physics. But here instead of looking at the physics I cheat and let nature do the physics for me and I just look at the outcome. Apparently in practice nature tends not to replace tryptophan with other residues so much. Who am I to question nature's judgment? If nature does that I will just trust nature and I will use those rules when I try to identify similar proteins. So the ultimate information is the same I just choose different ways to approach it. That's going to be a theme we come back to. Virtually every single significant advance in predicting structures and understanding things in bioinformatics has come from introducing more biology not necessarily fancier algorithms.