 two years ago, something fascinating happened. In this bi-annual competition for determining predicting podium structures, there was a team from Google entering or strictly not Google, it's called DeepMind, Google's branch in the UK to do research. And DeepMind used Deep Learning to solve this problem. This is not a class on deep learning, so I won't have time to go through the details, but they largely used methods from image recognition, so-called convolutional neural networks that are based on identifying small pieces in an image, assuming that they are translational in variance in the image, and then group this together to learn literally train, do machine learning and train the machine to recognize faces based on images with many faces. What they did for proteins is that they trained networks to predict all the pairwise distances between residues, usually CBtas, but that's a detail. And again, if you have a protein with 100 residues, that is a matrix with 100 by 100 pairwise distances, think of that as the green one. And that's really similar to an image, right? So, once you do that, you can apply roughly the same methods. There is a, the devil here is in the detail, of course, there's a huge amount of training and everything they had to do. At some point, you then end up with first, what, given my sequence, what is the pairwise, what is the predicted distribution of all the pairwise distances in my sequence? Now, a long list of all those 100 by 100 distances, that's certainly not a protein structure. So, what they then had to do is that combine that with the smoothened potential, what they're now just turning and optimizing all the backbone torsions, and then combining this with an optimization algorithm, essentially, not entirely unsimilar to the simulations you've been running, to let the computer effectively fold the protein, given these algorithms that tells us what should the pairwise distances be, to start with what the distances currently are and then create the potential that pushes them in the direction so that we get them to take on the distances the convolutional neural network, say, they should have. This instantly beat all the other groups by a fair bit, a couple of units of percent. Remember those GDTTS scores I talked about? I don't remember the exact scores they got, but suddenly these scores were so high, even for relatively unknown proteins, they were in the 70s or so, that Abenitio folding was suddenly not an outlier, but likely to work. There were two groups that did really well. One of them was David Baker with Rosetta, but they were beaten by the Google Alpha Fold team. It was a landmark discovery. So, given another 10 years that should help us to really crack this problem, or so we thought, because it didn't take 10 years, it took two years until the next CASP competition, when they presented Alpha Fold 2. Alpha Fold 2 is also a neural network, and now we're going far beyond the classic. This builds on a new class of networks called Transformers, which is also really derived from language translations, really predict networks trained to predict, say, the next word in a sequence or so. There were a number of features that are there. First, they had orders of magnitude, more training data, which is always important in machine learning. Second, they designed everything as an end-to-end predictor, meaning that the input is the sequence, and literally the output is the structure, the coordinates of the structure, no separate step trying to fold the sequence or anything. They also surprisingly encoded a lot of physics in this model, because physics helps all these dependent features that it's rotationally and translationally invariant. So at some point, if I have a rotated set of coordinates, that should correspond to the same prediction. Now, we can, of course, correct for that, but having that built in into the model itself makes it much easier to let the model work as a big black box and do the entire prediction. So how well did this work? People were in shock and awe in December 2020, when this was presented. It's quite horrible, in a way, because they completely got rid of all detailed knowledge about proteins. They're literally turning the amino acids into small pieces, and then we're optimizing the position of these pieces in space. The built-in connectivity in the chain, that's something that they trust the neural network output prediction to get right, which it does get right. Based on that, here are some examples of predicted and experimentally determined sequences. Do you see how good they are? They overlap perfectly. We're talking about RMSDs that are just over one angstrom. In many cases, the deviation between the predicted and the experimental structures here is so small that it's in the same ballpark as if you had a second group determine another x-ray structure. So it's not just that the structure predictions are good. They are so good that the computer is now as good as the experiments, and that leads to another problem. How do we know that the experimental result is the best one? Now, if on average, the computer is roughly as good as the experiment, if there is now a deviation, should we trust the experiment or should you trust the computer? Maybe on average, the computer does better, and it's just that the experimentalists that now and then make a mistake. These were small protein structures. Here's a large one, they got right. A gigantic polymerase. And it's not perfect, but again, there are a few helices that are a little bit twisted and everything, but on average, it's probably of an RMSD in the ballpark of two or something. Perfectly valid to do drug design. In fact, remember when I told you about that DDT Global Distance Dress Hold? And I said that 70 to 80 was really good. That's where they got with Alpha Fold 1. In Alpha Fold 2, they got an average of 92.4. And remember, an experimental structure, the difference between two experimental structures would be maybe 90. So suddenly, in the last 15 years, since I made that original slide, we have gone from a position where ab initio prediction was pretty much impossible to having solved the problem. This is no longer a problem. Sure, there are outliers. There will be membrane proteins that we haven't determined yet. There will be alternative states. There's a huge amount of science left to do, but the fundamental theoretical problem is solved. We know how to solve this in theory and in practice, for some cases at least. There is only one problem. Google has not made this available. So there is no web server that you can just upload your sequences and let Alpha Fold do the prediction, in part because it requires quite a lot of GPU resources to make the prediction. And that's a concern. That's a concern for many research groups. It also means that we can't look in the interior details how they did it, if they made any mistakes, or if they cheated. I don't think they cheated. But not having all results openly available is a problem for science and of course, scientists in particular. So today, if you need to make an ab initio prediction, you should go with fragment-based building in Rosetta, for now at least. But who knows, other groups might build similar servers that are available two or three years from now.