 As amazing as homology modeling is, there is one challenge. In particular in science and research, we tend to focus most of our attention on the white spots in the map because that's where we want to explore. And on the white spots in the map, homology modeling usually doesn't work because there are no known homologues. So what do we do then? Well, then we have to resort to these various forms of ab initio modeling. There are two methods that stand out from the crowd here that perform significantly better than anything else. The first one is a very long-time effort by David Baker's lab and it's a program called Rosetta. Obviously named after the Rosetta Stone. So Rosetta can do many things. Rosetta can do homology modeling and other things too. But the main power of Rosetta, I would say, is the fragment-based modeling. So if I have a brand new sequence, the idea here is that maybe I can take my sequence that I don't know what it is and can test how well it fits. So maybe I test fold A here and then I test fold B and then I test fold C and see how well does this work if fold A, what score do I... If I force this to be in A, how good is it? If I force it to be in B, how good is it? If I force it to be in C, how good would it be? That concept would be called fold recognition and for a while we did that with entire proteins. What Rosetta and some other programs do though is that they've made this with fragments instead. So that you start by kind of stealing things from the protein data bank. I know it's horrible. And if I take fragments of a 3, 5, 7 or 9 residues long in the protein data bank and then I start by picking the most common ones and instead of forcing the sequence of the entire structures which didn't quite work, I try to assemble small fragments here. So if I see that this is alanine alanine glycine tryptophan then I check all the places in protein data bank where alanine alanine glycine tryptophan occurs and then I say, ha, they occur. There are 14 different possible segments here that they tend to occur in. Sure, technically it could be others but in all likelihood it's going to be one of those fragments. So then I will pick my fragment A for the first part here and similarly for the second part I will pick fragment B and then maybe fragment C. The problem is of course there is not just going to be one fragment for each part. There will be fragments B, fragments B, prime, B, this, etc. There will be many, many fragments I have to combine. So this is not being a fairly large search problem in the end. Trying to combine different fragments together first testing the fragments and then combining them in different ways twisting the torsion angles a bit. So this is not quite a simulation but it's kind of a similar search problem that I'm trying to give the fragments now. I've made the problem more tractable. I don't have to look at each amino acid individually and now I need to somehow try to fold up these fragments with some more or less physics-based scoring functions. It turns out that this works remarkably well mostly because of the efforts David Slavitt put in Rosetta. So what you would do is that I think currently they pick assembled fragments of three, six or nine amino acids. The fragments themselves will of course not form proteins but they will be parts of secondary structures. Then you make a database of them and this database is present in Rosetta and this is the database now that I have a sequence. I query the database and see what are the candidates for my first nine residues here and then they need to overlap a bit. The other advantage is my fragments are locally correct. There are no clashes in the fragments and everything so I no longer have to move every single atom as I would have to do in a molecular dynamic simulation and if we're lucky you can even use a Monte Carlo optimization and move an entire fragment related to another one than using some sort of implicit solvent model. At some point when we're starting to be reasonably content here we switch over more to physics-based function literally similar to the molecular dynamic simulation and the idea here is that if we think we are really close to the actual protein local minimum at some point I should be able to just go downhill and minimize the energy here. Fundamentally it's not that different from the molecular dynamic simulations you're doing. The key difference is that in the molecular dynamic simulation you try to actually model the kinetics, the natural motion of the system. Here the goal, we don't care about the natural motion here. We don't care about how the protein folds. All I care about here is trying to predict what the end result should be and that occasionally enables you to take some shortcuts. If I haven't praised Rosetta enough I will keep doing it. The key difference in Rosetta is it's not a rocket science method you could argue, but the effort here is in the consistently having spent 20 years tuning everything and really made all these algorithms work. A couple of years ago David published a paper in Science where they showed that they could actually predict picking 16 small proteins that could predict all of their structures within a few angstroms. I think five of the 16 were within 1.5 angstrom resolution. Again, you might not have any idea how good that is but at 1.5 angstrom that's better than the resolution you would have in a cryo-EM experiment. A really good x-ray structure might be one angstrom but at 1.5 angstrom you're good enough. There is simply no point in getting anything better. So at least for these small proteins Rosetta was competitive with experimental structure determination. Now, if this was not enough they continued a few years later and added more information. So now we need to find a way to add more biological information. How can we do that?