 We have got a great session lined up here, and it's my pleasure to introduce our first speaker, Michael Nielsen from the Astera Institute. We'll be talking about how AI is impacting science. Michael, take it away. Good afternoon, everybody. Thanks for coming along to this afternoon session. My talk is about AI as kind of a general purpose technology, one that I wonder about the extent to which it's going to have a very broad impact across the sciences. That's the relevance to meta science in particular. And as a specific focus, I'm going to talk about the impact over the last couple of years on protein biology. So probably many of you have heard that back in 2020, biologists were very surprised when a deep learning system, Alpha Fold 2, was shown to routinely be able to make correct near atomic precision predictions for protein structure based just on the linear sequence of amino acids making up the protein. So here we have each letter in the sequence represents a single amino acid. You have a linear chain and somehow it folds up into this protein shape. This is a human insulin over here. So the first thing you might wonder, obviously you might be a little bit skeptical of this. You might wonder if maybe the modelers were fooling themselves thinking that this was done in a cherry picked way. It wasn't. In fact, it was done in a blind adversarial competition with more than 100 other modeling groups. In some cases, in fact, Alpha Fold was better than experiment. It caused known experimental results to be reevaluated and actually improved. This wasn't generically the case. It was just the case for a few of the structures. It looked at the time and I think subsequent events have borne this out, a major breakthrough in biology. So while that's kind of impressive, I think really the right way to think about it is a bridge to a new era in protein biology. It opens up many, many questions. Some of those questions are meta scientific questions. Questions about what we expect a good theory or a good explanation to provide, how we can or perhaps cannot validate that understanding and what we humans can learn from these systems. And these were the subject in my talk today. It's also, and this is I think really the broad, one of the two reasons why I was interested in the subject, whether and how such systems may impact the progress of science as a whole, a kind of systemic intervention. So I think it's valuable for meta scientists to engage with Alpha Fold as a concrete prototype of how AI can be used in science, even if you have no prior interest in proteins or even in biology. That's certainly me, I am not a molecular biologist. I've just been learning it over the last few months, so my apologies for any errors on that front. But it's been enjoyable to learn a little bit about protein biology. So I wanna take a few minutes, just to remind you of some background for those who like me are not biologists. The first thing, which I didn't actually know a year ago, was that we know hundreds of millions of different proteins. In fact, in the big meta genomics databases, there are billions of proteins. And some of them are probably familiar to you, they're these beautiful little nanoscale molecular machines. Here's the kinesin motor protein, which is capable of carrying big molecules, those are the big globes, along microtubules throughout the cells in your body. Here's hemoglobin, which carries oxygen, which is used to power cells respiration. Here's green fluorescent protein, which when you expose it to UV light, will fluoresce green. It's used in, or jellyfish is where it was discovered, and it's also now used as a way of tagging cells so you can follow where they are. So you can sort of think of molecular biology, it's a bit like wandering into this really, really large workshop full of all these wonderful machines, which have been created and sorted by evolution by natural selection. Each one of these machines could be the subject of, well, certainly I think a lifetimes study. There are thousands and thousands of papers, for example, about the kinesin family of proteins, and yet we're really only just beginning to understand it. But while there's wealth of these biological machines, we don't, a priori, know what those machines do or how they do it. We have no instruction manual for the machines. So for the vast majority of those hundreds of millions of proteins, which I mentioned, all we can easily determine is the basic blueprint, so that amino acid sequence, we can determine using genome sequencing at a cost of no more than a few cents. In fact, I think probably the average cost now is below a cent. But the proteins, what they fold into, these incredibly tiny nanometer scale 3D structures, they're very, very difficult to image. And so as a result, to actually find the three dimensional structure, it will routinely take months of work to determine the 3D structure. One was at the protein structure initiative, I think it's first phase, $250 million, and the average, they found the structure for 1,100 proteins. So it's a cost of about a quarter of a million dollars each, it's a big difference, big gap. So that discrepancy in cost and time really, really matters. It matters because understanding the protein's shape is absolutely crucial for answering questions like what antigens, or an antibody, or a protein bind to as part of an immune response? Or what can the protein carry around the way hemoglobin carries oxygen? Or how do proteins form larger complexes like the way the ribosome was formed? So there are many, many questions like that, and understanding the shape of the protein certainly doesn't tell you everything about its function, but it is a key part in understanding what the protein can do and how it does it. So ideally what we'd like to be able to do, and there are sound reasons from chemistry to expect this as often possible, is to determine the shape from the amino acid sequence alone. We'd like to be able to compute that mapping. In the 1970s people began doing physics simulations to try to determine what shape a protein would fold into from the amino acid sequence. And many, many techniques have since been developed, some come more out of physics and some more out of biology, sort of evolutionary approaches. A problem, very similar to many studied by people here, is that it's very easy to fool yourself if you're a modeler. It's very easy to cherry pick results and to convince yourself that your systems are better than they actually are. For example, here's a really remarkable press release from Johns Hopkins University in 1995. It's a very long, very enthusiastic press release claiming to have solved the protein folding problem. And I should say, I mean, these were excellent people who have done a lot of important work, but they had just fooled themselves. To address this problem, in 1994, a competition was begun named CASP, the Critical Assessment of Protein Structure Prediction. And every two years, since then, CASP has asked modelers to do blind predictions of protein structure. And what I mean by that is they are asked to predict structures for proteins, because amino acid sequence is known, but where the experimental structure determination is in process. Some other structural biologists are currently working on it. They've been sworn to secrecy, but it's expected to be complete soon after the competition so entries can be scored. Roughly speaking, the right way to think about the scores I'm about to say, they're from zero to 100, is what percentage of amino acids is 3D positioned is within some very demanding threshold. So I'll just show you the CASP results from 2006 to 2016. This was the winner in the hardest category, the proteins that thought to be most difficult, the median free modeling accuracy. And the winner would typically place roughly 40% of amino acids within that very demanding threshold. There's typically more than 100 entrants, so there's a lot of groups. An enormous amount of computing power. I'll say actually what the score is a little bit more precisely. My understanding is that it's the average, the percentages of the core carbon atoms in the protein backbone. And what you do is you compute how many are within one angstrom of the experimental determination. What fraction are within two angstroms, four angstroms, and eight angstroms, and you average those four numbers together. So it's a little bit complicated, but one angstrom is an extremely good determination. Two angstroms is pretty good. Hopefully that gives you some flavor. So that's kind of the background. And then Google DeepMind entered the competition in 2018 and 2020, and really just had a massive improvement in many ways. In fact, it's the second improvement in 2020, which is by far the most impressive. They scored 87 in this most demanding category. And well, just to put this in maybe a slightly easier to visualize or reason about why, the root mean square distance, the median root mean square distance that they obtained for their predictions was just under a single angstrom, shown here on the left, and then you see the gap to second place, which achieved a root mean square distance of I think was 2.8 angstroms. And then everybody else was about that as well. And this graph would continue across the 100 plus groups. Okay, so two casts back prior to that, the winner was actually more than 10 angstroms there. So you got volumetrically roughly more than a thousand fold improvement. It's obviously a very large improvement in the accuracy. For comparison, the diameter of a carbon atom is about 1.4 angstroms. So it's sort of atomic precision. In fact, it's also, it's comparable to many experimental determinations. It's possible to do better than an angstrom, but I gather talking to biologists, the sort of one to two angstroms is often fairly typical. So in announcing the results of that 2020 casp, the co-founder of casp, John Malt, said to see DeepMind produce a solution for this, having worked personally on this problem for so long, and after so many stops and starts, wondering if we'd ever get there is a very special moment. So he regarded it at least as a solution, and there's been a lot of discussion of what extent is it a solution or is it not a solution. Obviously that's a very strong statement. It's worth digging into in what sense it really is a solution. There are some senses in which it is, and I think there's a bunch of senses in which it is not. But even the competitors to AlphaFold were very laudatory. He's just one, I think, fairly typical, very thoughtful statement, part of a very long blog post that he wrote about this question from Muhammad Al-Karashi, who developed the first end-to-end models for predicting protein structure. Does this constitute a solution of the static protein structure prediction problem? I think so, but there are all these wrinkles, and he wrote thousands of words about those. Honest, thoughtful people can disagree here, and it comes down to one's definition of what the word solution really means. But the bulk of the scientific problem is solved. What's left now is execution. So for what it's worth, in retrospect, a couple of years later, I think AlphaFold, it's clear AlphaFold is a huge leap, but much remains to be done even on the basic problem, and many new vistas and important new problems have been opened up. Okay, it would take me several hours, unfortunately, to describe AlphaFold's architecture in detail, but I want to just tell you a couple of interesting things. It's a deep neural network, basically, meaning just a hierarchical model, with 93 million learned parameters. The input is just the sequence of amino acids, the linear sequence. Maybe you get it from genomics. And the output is in part a three-dimensional structure, so the three-dimensional coordinates of all of the amino acids. And it also outputs confidence scores, so it will tell you where it thinks it's got it right, and it will tell you where it thinks it's gotten wrong. That's also learned from the model. The basic training data used is from the protein data bank, so that's Humanitary's repository of protein structures which have been experimentally determined going back to the 1970s, those very difficult experiments I mentioned before. At the training time, that was about 170,000 proteins, it's about 200,000 today. A small fraction of those were omitted for technical reasons, but basically that's what they used. And as is always the case with these machine learning techniques, the parameters in the network, which is gradually adjusted. They started out very random, and they're gradually adjusted by gradient descent to ensure the network gives the correct output structure on PDB inputs. There are many, many other ideas which are used, that's the broad picture. Let me just mention one of the most, probably the most important other idea. It's an old idea, it goes back about 15 years, which is not just to learn from the structure information which was known, but to also learn from just the genetic information which was known, which there's much, much more. There's these hundreds of millions of known sequences. It's a very clever idea. It's to say, all right, we've got this sequence. Let's look for other similar proteins, which maybe it's the same protein, but in a different species. So a few of the amino acids have maybe changed, but there's a lot of overlap. And so they construct these very large sequence alignments and they look for correlated changes in the amino acids. So if there are two amino acids which are very distant from one another in the chain, but they seem to change together in a correlated way, the intuition is that they're likely to be evolutionary related. And the reason why is, or excuse me, they're likely to be evolutionary related, that's just true, sort of a fact. The intuition is that these are likely to be close in space, despite being distant in the chain, and they need to co-evolve together to preserve the shape and the function of the protein. That's just an intuition, it's not necessarily a fact, but it turns out that using that idea really significantly improves the results. Okay, so in this way, AlphaFold learns both from the known structural information, sort of the physics, the known physics in the PDB, and from the known evolutionary information in the protein sequence database. John Jumper, the lead on the paper, has this nice way of saying it's sort of the physics is informing our understanding of evolution and evolution is informing our understanding of the physics and kind of the network just talks backwards and forwards between these two pictures. Okay, so obviously a question one would have after seeing that, oh, actually, let me just tell you something I was just reminded of yesterday. I said there are many other ideas in the way this network learns. Something I love, something that's just kind of buried in the network, is actually a language model, much like GPT-3 or GPT-4. It's kind of a small part of the loss function. One of the things they do in these sequence alignment that I mentioned is they mask out some of the proteins and some of the amino acids and they try and predict them. That's sort of part of the loss function. So they kind of see, they treat the amino acids a little bit like is done in some of these language models which have so many people so excited. I just wanted to mention that as kind of a parenthesis. Okay, so the question, obvious question to ask is, is AlphaFold merely memorizing its training data or is it able to generalize or to what extent can it generalize? Obviously, CASP provides a basic validation. The competition structures weren't in the PDB and thus not in the training data at the time AlphaFold predicted them. So that's good. It's actually sort of slightly stronger than merely good. CASP is in some sense a natural sample. After all, the structural biology community isn't choosing which structures to solve for CASP, they're choosing them because, well in part, because they're biologically significant. They also choose them in part because they're tractable which is not quite the same thing. But it's so in some sense at least it has an ability to generalize to proteins of interest to biologists at large. It's kind of a basic sanity check. Because you'd ask, does deep learning work for proteins which don't occur in nature? Maybe it's the result of mutations or because they're designed to proteins or there could be many other reasons. And I can't summarize that work today. It would take hours. There is a huge amount of work going on and really it's a mess. It's very easy actually just to find juxtapositions of papers that apparently come to opposite conclusions. You have papers saying that you can use it to see point mutations, others saying that they can't, others doing all sorts of design work. It's very interesting. A result that I particularly like is from OpenFold which is an open source near clone of AlphaFold. And what they did or one of the many things which they did was they took the full PDB data set and they removed 95% of it. In fact they removed a very particular 95% of it. There are topology classification in there and they tried to remove 95% of the training topologies completely. So there were just a whole lot of common topologies that are not in the training set. And then they retrained and they obtained performance. It wasn't as good as AlphaFold 2 but actually they obtained performance roughly the same as the first AlphaFold. It would have been state of the art before AlphaFold 2. So somehow it was able to generalize just from those 5% of the topologies. And there's a lot of results in that vein. Actually one other thing I'll mention, something else that they do that's very nice in this paper. They also consider just more randomly eliminating things from the PDB. And they found that they got results comparable to AlphaFold even with almost 99% of the training data eliminated. And if they only eliminated 90% I think it was of the training data, they got results only a tiny bit worse than AlphaFold 2. So you can certainly eliminate a lot of the PDB data. Anyway, so sort of core takeaway there is it's gonna keep biologists busy for many years figuring out the shortcomings of these systems and improving them but there's already been I think quite a lot known and some increases in capability. Part of the reason it matters, well there are many reasons it matters. One reason is DeepMind and Emboll released AlphaFold DB. This is a database of 215 million protein structure predictions including the confidence scores. It also includes the near complete proteomes of 48 species including humans, mice, fruit flies and many of the other usual suspects. So it's sort of the picture you can have. It's a little bit cheap to say but in some sense you can almost view sort of a lot of this the gathering of the PDB is really having been gathering training data for these machine learning systems. You kind of get this amazing thousand fold increase in structures. Of course you don't necessarily believe them or to what extent you should believe them is still a little up in the air. It is remarkable, no additional experiments were done by AlphaFold, no additional data were taken and yet just by thinking it was possible to obtain a very large number of additional predictions that people expect to mostly be very high quality. Starting with a biologist about this and he just laughingly sort of made the comment no one would take this AlphaFold prediction as true on its own but it's an extremely helpful starting point and might save you months of work. Made the comment that if somebody gave him a drug designed using AlphaFold if it hadn't been experimentally validated he certainly wouldn't yet be willing to take it. I can understand. But this line between abstract model and experimental reality I think it may eventually become quite blurry. In fact the traditional experimental ways of determining structures require an enormous amount of theory as often the case across all of the sciences both implicit in the tools and to just do the data processing. If you believed that in some sense AlphaFold or some successor offered a better theory you might believe the results from that deep learning system more than you believe a traditional experimentally determined structure. It sounds a little bit like science fiction but actually there are hints of this beginning to happen. In the CASP assessment AlphaFold performed quite poorly on a few structures some of several of which had been determined using NMR this is one of the three main techniques used. It's relatively uncommon. And a 2022 paper from one of the people who in fact you did the first NMR determination of protein structure suggests that AlphaFold is actually often better significantly better than NMR. They looked at 904 human proteins and they concluded the AlphaFold predictions are usually more accurate than NMR structures. I think that's really quite interesting. And of course the reason why you can understand or there are many ways of understanding it but an analogy I like is to thinking about how you interpret images from a telescope because how you interpret what they mean depends on your theory of optics. If you change to improve your theory of optics it would change the meaning that you ascribe to your so-called raw data. In fact something very similar has been required to understand gravitational lensing of galaxies. So experimental protein structure determination depends on theory in a similar but much much more complex way. If you're talking about X-ray crystallography for example you need to purify your protein sample you need to crystallize the proteins unbelievably difficult. Then you do the X-ray diffraction you obtain these two dimensional images and you have to procedures and needed very complicated procedures to invert and solve for the 3D structure. And there are these criteria for when the inversion is or is not good enough. A lot of theory is involved in all of these stages. In fact at the final two stages the solution often involves finding a good candidate search structure to start from, a good guess. And sometimes that's very hard to do this molecular finding this molecular replacement. And in fact sometimes alpha fold has been used to provide that search structure when it was hard to find in any other way and has enabled the solutions for some particularly challenging structures. So I think there's already starting to be a somewhat blurry line between the model and the experiment. I think figuring out how to validate AI solutions not just here but across the board is going to be a really significant topic of meta-scientific interest in coming years. Okay just to change topic a little bit let me check my time. Oh okay, a couple of minutes behind where I wanted to be. Any model with 93 million parameters is obviously very complicated. It's obviously not a theory or explanation in a conventional sense. Can alpha fold or a successor be used to help discover such a theory even if only partial? Might for instance a simple set of principles approach in structure prediction be possible. And what anyway is alpha fold learning? So we don't know the answers to such questions but I think pursuing them is a very useful intuition pump. I'm gonna skip over this. I just wanna, I'll finish. That's some things about what we've learned in chess which are really beautiful but, okay I will finish with just this question. Can we instead look inside neural networks and understand how they do what they do? It's only, it's been done a little bit for alpha fold but I wanna tell you about some striking results in a much simpler system done by Neil Nanda and now at DeepMind. He was independent at the time. He trained a very simple single layer transformer neural net just to add two numbers, modulo 113. And the first thing the net did when it was learning was it just memorized all the examples in this training set, obvious thing to do. It could add those very well but it did terribly on everything else just produced kind of random garbage as output. But as he kept training without changing his training set he began being able to add examples which were not in that training set. Somehow with no additional data it was learning how to add. And he spent several weeks looking at the weights inside the neural network to reverse engineer what it had learned to do. And he found basically a bunch of, to his own surprise a bunch of trigonometry and he says this algorithm was purely learned by gradient descent. I did not predict or understand this algorithm in advance. It did nothing to encourage the model to learn this way of doing modular addition. I only discovered it by reverse engineering the weights. And what it actually does, what that trigger amounts to is it creates a wave inside the network with the frequency given by the number of X that you're trying to add. It has a phase shift by Y over 113 radians. And then finally it just looks for the inverse phase shift set which gives you the flattest possible tone at the output. That's what it was learning to do. It's kind of a radio frequency engineer or group representation theorist approach to modular addition. The question is why did the network stretch to switch to this algorithm for memorization? After all it was seeing no more training data. And the answer is that during training the neural network pays a cost for more complex models. The loss function is chosen so gradient descent prefers lower weight models. And the wave algorithm is actually lower weight. So it is preferred. It is simpler. It is more able to generalize. It's almost a kind of a mechanical implementation of Occam's razor, preferring the simpler and more general approach. So by varying the loss function you can potentially impose this kind of Occam's razor idea in many different ways. So I wonder whether one day we'll be able to do similar things for systems like AlphaFold, gradually simplifying them and perhaps discovering new basic principles of protein structure. That's everything I wanted to say. There are I think many fundamental scientific and meta scientific questions raised by AI systems as they get better and better both at doing cognitive operations and sensing and educating in the environment. How is that going to impact science as a whole? Are they eventually going to systematically change the practice of science? Will they speed up the overall rate of scientific discovery? And if so, what benefits and risks does that carry? Thank you very much. And my apologies for going a couple of minutes over. We have time for a couple of questions. I'm Miyazano from Protocol Labs. I was wondering if there's something you see that is particularly special about protein folding or given your experience with machine learning generally, if you see this abstracting to other problems that are either, that are both difficult to do in one direction but fairly easy to check that it is like potentially valid. I could think of this being like drug discovery or like drug-drug interactions or even like computer program generation where you can run it and see if it actually works. I think you can very naturally give two answers completely opposing answers to that question. One, you can say, well, it's only going to work for kind of these very complex systems where we have access to very good data and so we're able to learn from that data. So bioinformatics is a really good target. Maybe there are some other areas as well. That would be the skeptics response. Six months ago, that's the answer I would have given. Now, I think I've been convinced actually thinking about general purpose reasoning and language as a substrate for reasoning and models like GATO in particular, these multimodal models. I'm starting to see, for example, people are starting to build protein systems that actually incorporate text as well, right? As sort of a general substrate for cognition. So I think more and more of them are going to get folded into these foundation models. And there I maybe start to think, in fact, we might see progress in a lot of areas which don't seem to have access to really high quality data in this kind of way. That's, I think, how I'm at the moment, gradually, changing my mind is towards believing that. Very interesting. Just a quick comment on the last couple of slides you gave. I think one of the reasons why it found its way functions because the activation function was an exponential. So essentially, you already told it use please an exponential to approximate this problem. The other question I have was in your alpha fold problem, what if alpha fold was perfect? Would we then sort of stop doing research on multi-body quantum systems and sort of accept that the problem was solved by 93 billion parameter model? Or are we still then striving for simplification? You're certainly still striving for understanding, if you want to do things like... What do you mean by understanding? Well, in the 10 seconds, I won't try and answer that. I think just a comment on your question, which is a very complex and very interesting one. It starts to move where... Shining with biologists about how their own interests have moved, they're moving to things like protein design, for example, where it might turn out that, in fact, similar systems are very useful, but it remains to be, I think, really comprehensively demonstrated. But you still, in order to inform the design of those systems, you need to understand protein folding as well as possible, protein structure as well as possible. So there's still a lot of understanding to extract from the system, even if it's perfect. Basically, it's the point that if you have an oracle for doing perfect predictions, it's somewhat interesting scientifically, but it's not that interesting scientifically, that's not all you want out of science. Last question? I have a question that goes kind of across disciplines, across disciplines in the physical chemistry is what you've described, and also the computer science AI techniques, and also into social systems, such as in law, in particular, intellectual property issues and patent law, and kind of the question of how should we determine, especially using new AI techniques, where AI techniques allow an increase in knowledge in physical systems, whether we have a new invention or not. Is this computer program that uses AI, is it something that could be considered part of the knowledge base of someone skilled in the art? Kind of the issue is, is it a non-obvious step, or is this a new invention if the AI is actually coming up with most of the knowledge which solves a problem that hasn't been solved before? Or how should we deal with this, to be able to determine, is this a patentable new invention, or is it just basic knowledge that we can add to our knowledge base? Another very simple, easy to answer question. Of course, there's at least two ways of thinking about this. So people are just beginning to sue companies that are doing generative art and generative language. And there's at least two ways of thinking about that. One is just that, what does the law say, and to what extent is it applicable? But the lawyers who I've talked to about this, they seem to feel that at some level, you actually need to go back to something prior to that. What was the original intent behind the law? Because of course it wasn't in the situation where you can gather hundreds of millions of items of training data and potentially repurpose them. People will say things like, oh, well, chatGPT just spits out, or Codex spits out pieces of code, which are already known. Therefore it's going to be found in some kind of copyright violation. Maybe, maybe not. If it is found, that's a very easy thing to patch in the system. So it's not going to happen, even if that happens. So the fundamental underlying issue I think is going to remain, and it's going to be a very interesting thing to see how people, where people land over the next 10 or 15 years, in terms of what the underlying legal principles should be. It's a very wandering answer, but hopefully the intent was clear.