 So I'm gradually going to take us back a little bit to protein structure because in particular in this class we want to use bioinformatics to better understand protein structure. One very important measure that you've already used in the lab but that we're going to talk more about today is RMSD root mean square displacement or deviation that I define like this. I'm going to need a large bracket here. So I take the sum of all atoms n in a particular protein and then I need to calculate x i minus x i ref squared plus i y minus i y ref squared needs to be longer plus z i minus z i ref squared and then divide that by the number of residues. I think I got that right. What this tells us if the reference here that would be for is my target the known x-ray structure while x i is my prediction or where I am running in a simulation. The whole idea here is what this is then going to measure is kind of the average error between two structures. If this error is one angstrom that means that the average error of any atom corresponds to roughly the length between a carbon and the hydrogen that's nothing it's an amazing structure. If it's two three angstroms it's a pretty good structure quite usable at three pharmaceutical companies will start complaining at five all bets are not quite off and it's bad and at six seven you don't have any similarity at all. So when we talk about models we're somehow going to relate this to RMSD and say how good they are. There are a few other ways that we can measure scores that I won't define but there's one called GDT. It's called global distas test that goes from zero to a hundred and I would say that zero at zero we don't have any similarity at all 70 80 really good and at 90 that would be an experimental structure. So if two structures have a GDT of 90 that means that they're basically experimental that's as good as it gets. There is a reason for mentioning that now I will come back to it later whenever you can forget about it. The RMSD it turns out is intimately related to the sequence similarity not one on one but on average. So if we plot the expected sequence conservation versus the rms it turns out that if I'm well of course if I'm at 100% per sequence identity the rmsd should be zero right but already when I have say 75% of the residues in the core of the protein identical I should expect this rmsd to be in the order of 0.5. So if only 25% of the residues differ it's exactly the same structure and then this will of course go up if but even if only 25% of my structures match I actually expect on average to be within 1.5 angstrom rmsd. The take-home message of this plot is twofold. First you don't need as much identity as you think. Three out of four residues can be different and they will still have the same structure on average very few exceptions. Second it's a gliding scale if we have 80% identity it's of course better than if you only have 20 but you should be fairly certain about this fairly early. This is probably worth writing down. We occasionally talk about this that some sort of scale and then we can say present identity sequence. Now of course there is a difference here right identity just just counts if things are really identical alanine to alanine there is a difference if you replace an alanine with a tryptophan or an alanine with a glycine but for now this is an average so we don't care. So if we start from zero and go maybe up to so we say 20 then the second zone might go up to say 40 and well then all the way up to 100. Here we are sad this is so-called midnight zone below 20% sequence identity I can't really say anything I wouldn't I wouldn't trust it that doesn't mean that the sequences are not homologous it's just that I can't say for certain. If you are from roughly 40% sequence identity we are very happy this is a so-called safe zone if you have 40% sequence identity in those sequences they will be homologous I would almost say that I'm going to eat my left shoe if that's not true but remember that I showed you in a previous lecture that it might be possible for us to design something up here but again that's an exception if you have 40% identity between two sequences they're going to have the same fold but that leaves this interesting area this is a so-called twilight zone or it used to be the twilight zone I would actually argue that these limits have moved down in the twilight zone we might be able to determine structures of proteins if we're lucky and I would actually argue that this has narrowed a bit so that the twilight zone is probably from 20 to 30 percent from 30 percent I would be considered quite safe and I would probably trust emoji already at 30 percent mostly because we've had better and better algorithms to detect sequence similarity and evolutionary patterns but we'll come back to that. The short story here if somebody shows you two sequences and they have 30 percent sequence identity or better you should believe that they have the same fold once they only have 20 percent you're looking at noise don't trust it this twilight zone is fairly narrow but this is of course where a lot of the fascinating research happens that's how we move down from 40 to 30 here so how similar is similar if a couple of proteins share 30 percent sequence what will they look like I have an example of exactly that for you three sequences here they share just under 30 percent of the residues as you will see they're not they're not exactly identical there are some minor differences here in particular that helix is a bit turned the fold is the same you have the same number of secondary structure elements but there are some minor changes relative to this is quite typical but remember this is 30 percent it might even be a bit of a large change for 30 percent you can if these are so similar at 30 percent you should guess what happens if two things are 80 percent identical you would not be able to tell them apart this will now enable us to use bioinformatics in particular homology and sequence similarity as a way to detect in practice the things we've talked about in physics because remember all these all my discussions about folds similarity evolution it's completely true what I told you in the physics lecture but the only way for us to know it then was if we actually had those entire faults right what bioinformatics now enables me to do is get those directly from sequence if I only have 30 percent identity and I have 207 million really good sequences and the uniprot data bank today