 So it gives me a lot of pleasure to introduce the second keynote speaker for the conference, Dr Kate Michie. Kate, I could read out her bio. She provided me with an exceedingly long bio. So thank you. But Kate is a senior lecturer at the University of New South Wales with over two decades of experience and expertise in protein structural biology. And she's done a lot of other things and you can go on our website to read that. So what I thought I would do because it came up yesterday when Kate gave a fantastic training session is reflect on how Galaxy Australia connects with people that are really good for Galaxy Australia and we hope we're really good for them. So Kate and I looked at each other blankly and went, how did we meet again? It was during COVID, which obviously makes it electronic. So I did the obvious thing. I searched through my emails and when Alpha Fold came out, it was an obvious game changer. And there was a push from our directors to get it on to Galaxy as soon as possible. And the Galaxy Australia team worked with Galaxy Europe. We worked with Azure and we got the tool on. And one of the things we had been learning at the time was the importance of user experience and user interface. So we arranged to interview some of our early adopters of Alpha Fold. And thanks to Steve Manos, wherever he is in the room here back there. Hi, Steven. Steven linked us into some people at the University of New South Wales who linked us through to Kate who got on and used Alpha Fold. And then we had the pleasure of interviewing Kate, myself and Maddie, which was one of our UIUX people for about an hour. And we're absolutely blown away with the level of knowledge of the field and of a field that we were blissfully unaware of at the time for the impact that Alpha Fold was having on a community and a community that we were fostering through Galaxy. So I'm very confident that Kate is going to reflect some of that passion for what we've managed to do together today. The talk and session is an hour. There will be time for question and answer. There will be the bell that we're all loving so much to go off to remind us for the talk and questions. So with that, thank you, Kate, for coming up and look forward to your talk. Can everyone hear me? So thank you, Gareth. Thank you for the invitation. It's a bit daunting as a structural biologist to come to a meeting where I probably don't speak the language that most of you guys do. So I don't speak code. I never learnt any code and I found myself in this really rude position accidentally. And so forgive me if I make some gaps of the kind of conversation and I get the acronyms wrong because there's plenty of them. So I trained as a structural biologist, which means that my job is to solve the atomic structures of proteins. And that means that you have to be able to clone. So you have to be able to do some bioinformatics. You have to be able to PCR. You have to be able to overexpress it in bacteria. You have to purify them. Then you normally put it into crystal trays and hope to goodness it will crystallize. And then you stick it on an x-ray set or the synchrotron. And then you'd have to collect that data and then you'd have to solve that. And that could take many, many years. And then, of course, there was a recent revolution. So the 2010 Nobel Prize for cryo-electron microscopy came and the whole field of structural biology switched from crystallography mostly overnight to cryo-electron microscopy. And then it came with a whole new ball of caveats. And then during lockdown in 2021, Alpha Fold was released and that's another game changer. And so I find myself speaking to a group of people that probably aren't aware of what's happened in structural biology because half the structural biologists aren't aware yet of what's happened in structural biology. So today I'm going to move quite quickly because there's a lot of material to get through. And I realize that none of you speak protein, assuming none of you speak protein, so please don't be offended if you do. I need to give you a little bit of background about what protein is and what the problem is that we're trying to solve so that you can follow the magnitude of the advances. And the last half of the lecture I want to go through the deep learning changes that have happened and they're happening every day. And this is something that I would really appeal to you guys is the structural biology sort of discipline doesn't understand the difficulty of the compute problem that's coming for them. And I think I'd really like you guys to pay attention to the scale of the problem so that you're ready when they need you. Yeah. All right. So it's going to work out how to go next. All right. So firstly, what is Alpha Fold? How does it work? How do you use it? So I'll tell you why it's exciting. And then how has it progressed? So the central dogma of structural biology or molecular biology, and this is a very basic overview. So there are certainly some more intricate pathways. But the general idea is that you have your DNA and that's a template pattern for the little protein machines that carry out the jobs in your cells. So you have this like master plan. And when you need that protein to do a particular job, it gets copied into RNA and it gets taken to the protein factory called the ribosome. And then it's translated into a string of amino acids, like beads on a string that fold up into a protein. And the protein is the final machine that does the job. And so you've got a pattern, a temporary pattern, and the final machine. So before we go much further to talk about the final machines, the proteins, I need to just show you and make you kind of familiar with the different ways we present protein structures because we use them in different ways because they're complicated molecules. So it's really, really big chemistry. So the space filling diagram just shows little baubles on a string and they're really hard to understand. We don't use this description very much. The stick diagram shows every single carbon atom, nitrogen, oxygen atom in the molecule, and that's the one second to the left. You can see it's quite complicated. It's really hard to understand a protein when you look at it just like that. And we only use this way of describing a protein when you're looking at very fine details of active sites of enzymes and functional components. The ribbon diagram is the most common one, which we will see, and that's the one that most people will look at when they look at our full results. It shows you the secondary structures of the proteins and it enables the structural biologists to see the general fold of the protein, which is kind of the core of how that protein is assembling. And it enables us to see sort of gross features very quickly. And then the last diagram is the surface diagram, which shows you effectively the surface of the molecule. And that's the thing that becomes important when you're designing drugs. You need to make a molecule that fits that surface intimately and it has to also not match just the shape, but also the charge behaviors or the electrostatic behaviors of the surface of the protein. So just to sort of sum up protein folding, I've got this little video. It's really poor high resolution, so apologies for that. But it really sums up the problem quite well. So we've got a string of amino acids. So it's just the same chemistry along the backbone. And then there's little functional groups. So there's 12 of them highlighted in white here and they're all different. So there's 21 general amino acids and they're all slightly different. To make it easy and we're lazy, we give each amino acid a letter from the alphabet. So you can see we're already transferring to a language here. So then if we color that same string from blue to red, that's important because you're going to see that when the protein folds up, that the front and the back of the protein could end up in any particular place. And so coloring, it helps us understand how it folds. So because of the chemistry of the amino acids, they have a propensity to form secondary structures, which are the alpha helices and beta sheets. And so here the molecule is assembling into these all kind of curly sections. So this is just an alpha helical protein. And the folding process is how those sections fold up into the three-dimensional structure. Now alpha fold doesn't answer this folding problem and I'll talk about that later. But what alpha fold does is calculate the structure of the protein at the end based on the sequence you give it. Now we're just going to switch on, put the alphabet back on so you can see each individual amino acid. And then we're going to turn on the chemistry. So you can see every single individual bond. So you can see why looking at it in this way is very complicated and it's not very helpful. But you can now see that the red and the blue, the front and the back of the protein, have now folded up to be exactly the same place three-dimensionally. So this is a big problem. You don't know where a part of a protein is going to be when it's folded. And then here we've turned on the surface structure so you can see the arrangement of how that protein folds. So it's been a really big long-term problem understanding how a protein folds and what the final structure is. So then just briefly that's just saying it in schematic form. So we've got strings of amino acids. We've got two major types of secondary structure, these beta sheets and these alpha helices. That's called secondary structure. How they arrange and fold up gives us our tertiary structure. And those proteins can associate sometimes with themselves or with other proteins to carry out much more complicated tasks. And so that's called paternity structure or complex formation. So why in the hell do we do this? Why are they useful? It helps us really understand how things work. So for instance, here's the picture of the spike protein, which is probably the molecule that you guys are most familiar with. The one on the left is from SARS and the one on the right is from COVID. And you can see that they're very, very similar in the way they look. And that really helped us with the pandemic because we already understood a closely related relative to COVID. So we already had people working on vaccine production for SARS. So knowing this information really trims down the sorts of places we need to be looking for drug development and understand function. So this is the first structure of the main protease. So viruses make their proteins in a really long string and they're fairly unusual in this way. And then there's normally a protease which then cuts the string into little pieces. So there's little folded units that then go off and do their job. And so if you can stop the protease that breaks the protein into individual parts, then you can break the virus from working. So the protease is normally the targets of drug development. And this is the structure of the first development of COVID drugs. And you can see the structure on the right. There's an overlay of the SARS and of COVID. And you can see that at the bottom panel that there's very little difference between the two of them. So that's really showing a sort of the active side of the enzyme and how it functions. So if we know all of this, we can make much better drugs and we can also understand how it works. Previously we used to do this by X-ray crystallography and you'd have to make the protein, crystallize the protein, take it to a synchrotron source or an X-ray machine, get it a fraction pattern, solve it by Fourier transform, build it into the model. And the process costs probably in the order of about $100,000 on average per protein. It's very slow. I spent years solving one and never got there. The ribosome took 30 years and numerous post-docs who completely failed and got a Nobel Prize, three different research groups. It's a big problem. It's hard science. Recently we moved to cryoEM. CryoEM hasn't made it any bigger. It's just made the data sets bigger. So that's a different computing problem again and it's significant. The data sets are sort of terabytes in size and that's currently the sort of headache that the unions have. It's a brute force method. You purify your proteins. You image them by a microscope. You collect millions of the particles. You basically do 2D averaging of the individual particles and then you can do 3D averaging to build an atomic structure. So it's really cutting edge but it's really, really brute force on the compute. So because of this being so expensive and so difficult, computing has always been kind of a holy grail. Could we compute the structure? You've got the sequence. You should be able to compute it. And so there's a whole field of structural biologists just working on the computational problem. Can we do this by calculation? And so for 20 years, 40 years, 30 years, we've run a critical assessment of protein structure competition where we've hidden a bunch of experimental sequences which no one's ever seen and then we release the amino acid sequence and the coders would take the sequences and then try to predict the structures and then independent structural assessors would come back and compare the calculated models to the real models and kind of rank everyone's estimates. And what happened in 2014 was something really, really quite shocking. And that's this. So you can't see it because it's hidden by the axis of the graph and that's the performance of Alpha Fold 2 versus every other competitor at that competition. So this was a absolute lightning bolt moment for us that we basically can now calculate structures to within experimental confidence and that's really, really quite shocking. And you can see the nature reviews in the bottom corner. It will change everything and that really is an understatement. And I think half the structural biologists in the world still haven't cottoned on as to how serious this is. It's a really big change and it's really exciting. So we had to wait another seven, eight months before DeepMind who developed Alpha Fold's published paper. They promised they would tell us how they did it and we had to wait a very long time. They put the code nicely on GitHub and they gave us a beautiful paper which I can almost read and then they gave us 62 pages of supplementary data which I have tried to read and haven't done a very good job of. But I encourage you if you want to understand the architecture behind these sorts of programs and that's a really good place to start. So how does it work? It's a bit of a game changer. So it's really mining evolutionary data in a really kind of, I think, quite a cool way. So the first thing it does is you put your protein sequence in and it does a multiple sequence alignment. So it's really doing a lot of bioinformatics and then so it makes a really, really large multiple sequence alignment. Google says you need to have 30 friends in your multiple sequence alignment before you can get a reliable structure, at least 30. When I download the results they're often 10 gigabytes in size so we're not talking like a little multiple sequence alignment. We're talking a big multiple sequence alignment. The second thing it does is it makes a pair-wise array of each protein amino acid versus the other amino acids in the chain. So you know, residue A versus residue B versus residue C for every single residue. And that feeds that into two transformers. So the first one is called an evoforma and it basically is looking for co-evolution of amino acids across evolution. So it looks at each one and if one changes and sees that another residue also changes across evolution it concludes that those proteins must be linked together those amino acids must be linked together in three-dimensional space. So you can imagine if you've got folding protein it's folded up. You've got two amino acids that are next to each other. If this one gets big to accommodate the same structure this one has to get smaller. And so it's using this co-evolution approach to map the distances theoretically through evolution to feed into the next structure model later on. And so that's what it does. It has lots and lots of attention so if you don't know anything about attention there's a cool paper talk about it a minute about attention. So it's looking at those pair-wise evolution and maps it into that pair-wise array and it feeds it back in a circular fashion so it's moving it's reading the data it feeds it into the array it then looks at the data again and says is anything close by also contacting so it's continually updating. And then the next step after that is the second transformer which is the structure module and the structure module takes the distances that come from the first transformer and does a bunch of cool things so it does a sort of geometrical arrangement in three-dimensional space it doesn't link the amino acids together it puts them into a gaseous box of unbonded amino acids and it tries to make all the amino acids link together to obey the rules of the pair-wise evolution data that came from the first transformer and it builds a structure de novo from that and it feeds that back into the first one to look for more pair-wise interactions and to hone those and so it's the weights of the from the attention from training which is telling AlphaFold how to fold a protein that we don't really quite understand exactly the details of what it's done and that's why I guess it's the deep learning part's kind of exciting about it and then the most important thing I guess is to the field of deep learning is this particular paper called attention is all you need and I guess this is one of the most pivotal papers to deep learning and so if you're really into code this is something you should read and it explains what attention is and why it's made really speaking protein or language models so important so this really is a critical paper for machine learning and I have a handout which I've given to Gareth which we can share that has all the references so you don't need to write them all down in a hurry um okay so how do you run it it's a really big piece of code and that's something that freaked me out when I first read the paper was like we immediately need to have this and then I read what was required and went holy hell I can't do this so you need a big machine you need GPU it's not negotiable you need GPU and it needs three terabytes of disk space because it uses the entire big fantastic database to do your multiple sequence alignments and it's super intensive read-in read-out so it's always referring to that database in and out and in and out and in and out so you need that really on SSD and you want it close to your GPU so that's quite a challenge so it's not the sort of thing you can run on your laptop where can you run it now so Galaxy as Gareth said runs it here I've been told you can run it on Galaxy US and Galaxy Europe you can run a free version that Google has released on the Colab notebooks if you Google Colab notebooks Alpha Fold you'll come to the site however it's CPU and GPU and time limited so we find that for large protein structures that it tends to time out you can install it locally and that's how I started off doing it I got a grant to install it as a docker install it's also pushed by a lot of computing consortia so there's a consortia called SB grid and they push a lot of structural biology software to make it easier for system admins to handle and they fortunately push it you can install it in the cloud it works quite well in the cloud because the data that you're uploading is quite small and the data you download is quite small it's actually the compute that's the problem here not the size of the dataset and I'm recently moving across to Google Vertex and the advantage for that is that we can strap on modules onto the site so up and down and there's lots of feed-in programs that are coming from deep learning that use similar sorts of compute but feed data in and need to take data out so you want an expandable system ideally so here we go so this is the day the week after the code was released a day of the paper I emailed the head of scientific computing saying I'm wondering if we could run this code here at MSW and he very kindly we were all in lockdown we had nothing to do and he said look we can give you a small grant and some cash to like burn in AWS and I'll give you a cloud engineer that knows how to install because you don't know what you're talking about and I was like thank you very much sir and we set up our first instance of AlphaFold in AWS which was super exciting and then a day later Google then published another paper saying that the human proteome had then been freely released and sold and I thought oh no maybe I've missed the university's resources you know this is really embarrassing maybe I'm like wasting it but the first release only came with the human proteome and so if you weren't working on humans then you were really without luck and so you can now download the structure of any human protein sold by AlphaFold from EBI so the limitations of the EBI database now it's changed quite a lot they've updated it many times since and there's now millions and millions of structures so they've basically put the whole of UniProt on EBI however it doesn't run residues that are very very short or proteins that are very short and for very very long proteins it's too much to calculate so anything over two two and a half thousand that is not there it's only for the human proteome for the human protein only they've broken the really big ones down into smaller subunits and run those independently because they figure that they're still a major target for research groups so that one's there but if you're in another organism you won't see them there are proteins that contain a few non-natural amino acids which we denote with the residue X and if that's in your sequence it will not run and it will not be in the EBI database so that's mindful if you're running your own you need to replace that with something like an alanine so that then you can just run the codes to get what you expect the protein to be and then there's a few things about the UniProt if it's not in there one sequence definition then it won't be there and if it's been modified it won't be there but most importantly for a lot of my colleagues who work on viruses if it's a viral protein it's not there and that's the problem because there's a lot of viral research that's obviously quite relevant and I'm sure you understand the reasons for that I don't need to elaborate so this is some insight into how this affects my discipline so I went to my friend who's a structural biologist and said hey do you have any cool examples of structures that you've never submitted to the PDB that you've never published because I want to test the alpha fold machine and see if it's any good because at this stage the community didn't know how good alpha fold was because we've never been able to test it and most of us have seen lots of modeling programs before and they're all pretty rubbish so we weren't very excited about it we thought I can't be that good so my colleague provided me with a small protein that had never been submitted and I ran it on the alpha fold database and then I sent it to him and I asked him how did it go and this is the response I got now mind you he's a professor in physics and he generally doesn't overstate things so for him to come back with his statement is my mind is blown I think it's a really clear indication that we're onto something quite significant here and he goes onto some boring technical details but the very last line is I find this amazing is exactly what's happened to the field like this is really amazing and these are the results of the experiment we ran so the first one on the left is showing our full alpha fold from our home instance and the collab alpha fold so this was the truncated database that was running for free on Google and you can see that the two structures don't overlay and there's quite a lot of difference in the structures so one of the purple ones got these alpha helices and you'll see that the orange one doesn't so that's a for a structural biologist it's a fundamental difference it's a big problem it's telling you that the two folds are completely different it's not right and then you can see the one on the right that the experimental structure is in pink and the alpha fold structure is in orange and you can see that the backbone is overlaid really really well in fact it's spectacularly good and so we had never seen a program that could calculate a structure to this accuracy ever before and it showed me quite quickly that even using truncated code where they trimmed down the database they didn't use all the bells and whistles was also not giving us the level of compute that we needed to get the structures so that was a good warning very very early on and it told me immediately that we were onto something that was important so how good is it well alpha fold can predict some structures amazingly well or maybe predicting all of them well and just maybe not all of them are structured but we see really really good high prediction high confidence structures that turn out experimentally to be the same we do see a bunch of ones that really don't look great and there are definitely cases where it's got it wrong we have solved the structure and we know that the data is not correct and there's a whole bunch of things that turn out to look unstructured and that's kind of caused quite a lot of questions in the field as to how important is the structure hasn't got it right so it's pretty good for most things so if you have a look at the whole proteome of organisms if you look at the table on the right you can see that in the blues of the experimental data and you can see in the orange and the lighter orange is how much alpha folds contributed and you can see that for the majority of every single organism the alpha fold is contributing more structural data to the community than we've ever seen before on looking at the whole history of human endeavor on solving structures of proteomes so it's massive so here's an example the first experimental example we did over an in-house case at the university after the first month we had the alpha fold machine running so this is before we had the caveats of blue meaning good and red meaning bad for alpha fold so at the moment it's just listing the colour of the structure confidence by inverse so it's B factor so it's showing red is good in this case so it folded this particularly unimportant retrovirus protein that's endemic in the Australian indigenous population here causes cancer it's really horrible and we've never been able to crystallise or solve the structure straight away we had a structure for this particular protein it told us that the protein folded into two domains and there was probably a flexible bit in the middle and so my colleague cut the protein in that location made both components and immediately crystallised it and it had been in the literature that this protein was impossible to crystallise prior to this so we were pretty excited and it diffracted to less than one angstrom both parts were really well highly, highly ordered and great diffraction data so just going back to just another caveat of my field is this is the equation for solving a protein crystal structure and to get the density on the left you know that's that's the solution so the experiment gives us everything in the middle the green section so all the diffraction data provides that part of the equation and there's one other part which is the bit at the very end which is the phases so without both the phases and the intensity data we can't solve the structure and crystallographers have done this by either stealing a structure that looks really similar and stealing the phases from it which we can approximate to then start improving the phases to pull out the structure or we'd have to solve it by soaking in something like a heavy metal getting a anomalous signal calculating where the heavy metals were pulling out the phases for that to then drag out the structure of the protein so that was how we used to do crystallography so we weren't able to do that with this particular protein so we got the data and we're really excited we chose all the closest friends we jammed it into the programs nothing solved we soaked the crystals tried it with heavy metals nothing worked but like well this is not helpful so then we put the alpha fold model and took the phases from the alpha fold model and immediately we got the structure and so this told us for structural biology that it's useful not just for predicting the domains of the protein but it also showed us we could solve unsolvable protein problems by stealing the phases from alpha fold so there's a flurry of data in the PDB that looks experimental that has been phased from alpha fold models so all sorts of proteins that are in the cupboard that we've had for years that we've worked on came straight out and the structures have been solved so this here shows you how good alpha fold is from a real lab with real human beings non-published that don't know what they're doing kind of very much with alpha fold and you can see that very quickly we were able to produce useful data and you can see the difference between the alpha fold structure and the experimental structure it's just in these tiny loop domains so it's very very accurate this unfortunately is not great if you needed to design a drug and you know the drug kind of binding location is where those loops are but it's telling us that alpha fold has brought us orders of magnitude closer to almost every problem that structural biology can solve with the caveat that there are cases where you're going to want more detail so then more came and this is really really important so they've published paper in October it's still in preprint I don't think it's going to go anywhere I don't think they care but this paper is really really exciting so they showed that you can look at co-evolution not just within the fold of the protein but you can look at co-evolution across two proteins that interact so you're just mining does the residue A change in protein A with residue B in protein B and if they co-change you're mapping the interface of a complex so for the first time there's a completely new tool to mine protein-protein interaction in silico at the structural level this is really really profound so now we use alpha fold almost exclusively in this mode to mine protein-protein interactions so why is that important because proteins assemble into big machines that do jobs so these are examples you hemoglobin it's four subunits they need to interact together it doesn't work by itself the ATP synthase it makes all the energy and everyone sells it's made up of numerous protein components one of the biggest problems in biology is mapping the interactions of how all these complexes assemble so if now for the first time we can do that in silico so and here's an example of one of my projects I just chucked in three proteins we think were probably dynamons and it immediately assembled it into a really complicated complex this is the overlay of five the top five models and you can see that every single time it calculates that structure independently it calculates the same structure which gives you lots of confidence that the structure is correct and it's super easy to run how do you run it you simply provide it the protein sequence that you wish to run and here you go it's really almost that simple the only thing it can't tell you is the stoichiometry of the protein so if you have two proteins that bind together you have to tell AlphaFold to look specifically at those two proteins if you have one protein that needs to form a channel of six of itself you need to give it the six for it to do that calculation for you it can't search and find it there are a couple different types of AlphaFold there's the monomer run there's a sort of slightly tweaked one which can give you some kind of alignment data for the monomer and then there's the multimer version the multimer version is by far and large the most important and that's mostly because the EBI database has been updated with every single protein sequence on Uniprot so you almost never ever need to run a monomer protein now everyone just wants to run multimers so it outputs a whole swathe of files most of them are the multiple sequence alignments there's a whole lot of ranking data there's a bunch of pickle files which make things a little bit uneasy for biologists to access the confidence data is hidden in those it outputs a bunch of models and that's basically what the structural biologist or biologist or medical practitioner will take and look at it does show differences across the models so there's an example of four models and you can see there's one that's well structured on the left so this is actually ranked from the best second third and fourth ranked model and in this particular case it's interesting because the first protein is actually not the right structure the correct structure is the second one and that's something that I flag to people that it's not always the best model that AlphaFold calculates it's just doing the calculation and we need to use biology and to really tie together the real data to work out which the right model is and so it doesn't always get everything right it comes with confidence values and these are really important so there's two types there's PLDDT and there's the PAE data these are output almost by default they're on the EBI website Galaxy is output them and it's quite simple to code that for your Docker output to code it as well the first one tells you how confidence each individual residue is so the first graph at the top is also replicated in the color coding of AlphaFold so AlphaFold now comes out as a blue, cyan, yellow and orange if it's blue and it means it's above experimental confidence it's dark blue if it's a little bit more hazy between 90 and 70% confidence it's saying the backbone or the fold of the protein is correct but the sidechains might be wrong and that's in cyan and anything from yellow down to orange is a bit sketchy and you should be very careful and what we are seeing is there are really large unstructured sections of proteins that don't seem to have any apparent fold most of the time these are explainable but there's also a lot of really interesting cases as to why this is happening and it's explained why we can't crystallize a lot of proteins for a long time so it's a really good predictor of disorder there's a whole field of protein study called intrinsic protein disorder and that field is really frustrated because AlphaFold predicts their protein family is better than their best programs did themselves so that's been a bit of an eye-opener the second error is a predicted aligned error and it tells you the relationship of one part of the structure to another part of the structure so it's basically aligning one residue compared to all the other residues in the models and says does that residue land in the same three-dimensional space all the time and if it is fixed they're saying they're related to each other and it describes domains it actually sort of ends up showing us dynamics of the protein I think this paper here shows quite clearly that you can see domains full domains and you can see flexible components of the protein by looking at the pairwise alignment data that's probably not all that interesting to you guys but it's very important for structural biologists to interpret that data so just to sort of I think you learn best by seeing so here's an example of one protein and you can see it's a long string and there are little bits that are folded and then along the long string and so the first chart's showing little globular domains with little flexible linkers between them and then the second one shows a much more extended complex where there's lots of proteins that interact with each other and you can see there's much more cross marks of green showing that this protein has a very fixed orientation to this protein as well so these are giving you insights for strengths of protein protein interactions and domain interactions here's a few more examples there's two little proteins you can see the confidence of them is quite high and you can see that you know they're both interacting you can see the cross green tells you that they're interacting so how's it being used it's been used to look at clinical data now our fold can't mine a mis-sense mutation and that's because when you think about how our fold folding a protein is it collects two residues that are conserved in co- co-evolutionary and it keeps doing that across the whole protein and it wants to collect hundreds of those not just one if you put a point mutation in a protein which can break the protein you're just removing one of the little staples and so it's very insensitive to seeing the removal of a single staple so it'll still fold the protein because it needs to make the other 462 pair-wise alignments and it's just missing one of them and so when you put in a protein residue change it doesn't really see that much of a change and so that's a big problem with it but what it can do is provide us with structures for proteins we've never had any structural information for and we can mine data now looking for and I work with a clinician who has these terrible kids with really nasty muscular dystrophies they sequence the kids they find these weird proteins that have errors that we don't know what they do we can now map those errors to the outfold structure and work out whether it's likely to be detrimental to the fold of the protein or the function of the protein and we've actually got diagnoses and starting down the path of developing therapeutics so already within a year and a half we've got more insight and treatment and understanding of disease than we've ever had before it's totally changing lots of fields the panel on the left was a predicted structure of a viral protein of how we thought it looked and the whole field has been working on this little individual domain of the protein for a long time and this is a big virus like one of the like hepatitis that people kind of really think they know when you outfold it you can see that that protein domain actually folds out across a much larger part of the protein so that little domain that the whole field's been focusing on is actually a completely in vitro fabrication and that the protein itself is spread across the domain of a very much larger part of the protein so the whole fields had a bump to you for a long time and now they're going off in a completely different direction and we wouldn't have known this unless we'd solved the structure of the whole thing or if we didn't have alpha fold so these are the probing protein-protein interactions we've had friends in the malaria field who pulled out some proteins saying we think these interact I kind of assumed it was going to be the front end of the protein I alpha folded them alpha folds as they do not interact in any way shape or form I then do it with the other end of the protein and all of a sudden alpha is really confident that these parts of the protein interact at the other end so that fast tracks them in the lab they know immediately where to start looking for the interaction now so we can start looking at how this is affecting cells we can use it to reconcile a whole bunch of biological data so there's these medics who've been working on a lung problem they had these two proteins they'd known for a long time they didn't know what they looked like they knew there was important glycosylation sites they didn't understand why I alpha folded them together and the two glycosylation sites are the little pink dots that right at the dimerization interface that proves why the glycosylation was important for protein-protein interaction and so immediately all sorts of biological data can now be reconciled that we could never understand before because we didn't understand the protein-protein interaction we didn't understand the structure and this all happened you know within a few hours like the next day they had the answer and their labs gone off in a totally different direction so what cases doesn't alpha fold support it doesn't support single chain predictions that it shows that they may be in different confirmations we don't have control over the actual output of alpha fold so it may change shape when it binds to a complex it may calculate the shape of the complex bound form or the monomeric form we know with moving proteins like machinery that they've got to have lots of moving parts we have no control over what alpha fold puts out it might put out five different moving models or it may put them all out in one confirmation and you don't know what that confirmation is for intrinsically disordered structures you know we just know that it has low confidence that we don't really understand whether that's relevant or it's an aberration but we do know that it's the best method we have to predict disordered proteins so it's better than any other method we have this is the conversation I had before about it not being predicted to affect mutations there's a bunch of papers he's showing you can put point mutations in an alpha fold fold the protein's perfectly well and these are very well characterised medical examples it doesn't predict any non-protein parts so this is a big caveat because people go oh you know I want to like look at my protein bound to a drug well it doesn't do drugs it doesn't speak drug I want to look at my protein bound to RNA it doesn't speak RNA I want to bind my protein and at the fingers of your finger it doesn't bind zinc but so it doesn't tell you that information the one thing that we have discovered is that it solves structures so accurately that it leaves the hole and the position of the residues that would coordinate a zinc atom in a zinc finger it leaves the hole that perfectly fits him it shows you exactly the binding face of the RNA and so it's just by missing things at the sites you can show two protein protein interactions that would have a protein transformation modification of the interface and if you mine that interface and cut through it you can see there's a hole inside it's very good at it so there's a whole bunch of papers showing it time and time again for a whole bunch of different examples it's not good for performing docking for drugs so this paper is quite interesting it sort of shows that you can find the pockets perfectly well it calculates the pockets very well but when you perform traditional docking methods it is poorer than if you perform traditional docking methods on an experimentally true structure so there is some sequence variation side chains it's still affecting it so this is still a working problem and then the last one was really important was engineering so a whole bunch of engineers were like great we can now fold a whole bunch of synthetic sequences we just make a synthetic sequence see if it folds an alpha fold we can make it we're off biotechs flying and they've shown now that lots of really confident alpha fold structures are not able to be made in the lab and there's an important point here is that alpha fold doesn't speak protein folding it speaks stable protein fold at the end so the folding process is going from that string to folding through this difficult pathway to produce the final structure and it's kind of like squeezing past the door to get into the room the low energy state might be quite hard to attain and if you build a particular synthetic sequence it may not be able to reach the low energy stable structure because it has a high energy impediment on the way of the folding path and so that's a big problem that's being addressed at the moment in deep learning so yeah it totally doesn't understand folding and that's a totally separate problem so when people say alpha fold has solved folding they are wrong alpha fold has solved structure the folding problem is a separate problem it's a nuance that you need to understand okay so advances on the back of alpha fold too there's a whole bunch of code that's just using alpha fold as it is but we're now using it to totally re-annotate primary sequences so we can look at the structures of the alpha fold and we can really correctly classify protein names in databases so there's already a huge effort now in Uniprot to rename all their you know hypothetically named proteins to reassign them to their correct group that they belong to which would got wrong the EBA database now contains you know this 240 million structures you probably never need to run a monomer anymore so that's a huge service originally it was very small it's now very large so we don't really need that mode very often although there is one thing I can say is that EBI just gives you a single protein structure it doesn't give you all five models and there is information to be gleaned from the model the different models that alpha fold produces so if you need to really look at a particular protein you might need to run it yourself there's a fold seek lookup this one is super cool and I use it a lot you alpha fold a weird protein you come up with a super weird structure and you're like wow that's cool what the hell's that you can feed that now into reverse lookup engine and it will search the entire alpha fold database and say hey I've seen that fold before and it's here here here and here and sometimes those proteins have assigned jobs or functions and you can use that then call back to your organism and go hey I think this protein might be involved in this process and it's very very illuminating you can now copy over existing co-factors from PDB files so that problem I said where you had a zinc atom and you wanted to get into alpha fold model you can now just use this program and it just reads and copies over the the ligands directly into your alpha fold model so it's very good for researchers to get a good working model of their own organism and now this is the exciting part which I think you guys need to pay attention to because this is the change and the field that's happening right now so there's a lot of deep learning applications that are coming out on the back of this and it's to answer all those limitations that alpha fold have I need to warn you that a whole bunch of these papers are in preprint so we can't necessarily trust them but they're coming out daily I want you to look at the impact factor journals the quality of the journals that are publishing this stuff so it's moving really fast it's considered high impact in the field it's affecting and look at the dates like this is really fast I'm going to move through this quickly all of the references are in a PDF that Gareth can give you afterwards okay so mutational analysis yep we've got solutions to that we're now going to use it by graphical attention networks there's going to be a bunch of these so I'm just going to move through them antibodies yep we can do that too antibodies are a particularly unique case because the base of the structure of the antibody is always the same but the whole point of the antibodies is the very top section is hyper variable because you need to be able to identify thousands of different foreign things and so you can't mine that by evolution but if we really deeply train deep learning networks we can now calculate structures without doing the multiple sequence alignment so the co-evolution is not needed anymore and that's a really significant difference which I'll talk about in a minute RNA structures yes we've got a whole bunch of deep learning algorithms that are looking at mining the RNA database as well as mining the protein so we can start to map protein complexes we've got the same just with RNAs RNA is a big issue with medical treatment now we can calculate it and this particular paper is interesting because they only use 18 structures for RNA so the RNA field tells me you can't do this it's never going to work for RNA because AlphaFol needed to train on 100,000 protein structures and we don't have that many for RNA so it's never ever going to work well these guys trained on 18 RNA sequences and were able to produce you know experimentally predictable structures so it's coming for RNA as well and so this is a really really big change and so it's all about how you train your network and so training is becoming critical and the code is very GPU hungry so this is the all problem is that there's going to be researchers that want to see this stuff or use this stuff and they're going to need resources to do it so this one's really interesting it used 23 million coding sequences so what we're finding with these natural language models is you just need a hell of a lot of data and the more data you have the more fluent you speak the language and in the language is the structure which kind of makes sense because we've been taking proteins from one organism expressing it another for years and then fold it up so the language of fold is inherently in the structure in the amino acid sequence you just need to be fluent enough in the language engineering okay so this one's really interesting the goal now is to focus on this folding problem because we know it's a big problem there's a whole bunch of different models now so there's a diffusion model so a lot of work's being done by the Baker Lab who are really big in the folding problem that's what they do is they're engineers so they've got a whole different bunch of structures machines that do this so there's one on diffusion there's another one on a message part Passaging Neural Network so there's a whole lot of different architectures which are achieving the same sort of thing to build synthetic sequences but I think the most exciting one I want to talk to you about is hallucination so this particular method enables you to reach new sections of protein folds that are possible that have never been explored by evolution so evolution is cheap biology is cheap it steals the same solution again and again and again and just perverts it twists it bends it to readapt it it isn't very good at de novo building something brand new and so there's whole sections we're discovering of foldable protein space that's never been accessed by biology and using hallucination you can fold using deep understanding of the protein language you can access folds that would be possible that biology has never experienced and so using hallucination they're able to do this so they learn the language of folding and they keep selecting noise the paper is quite an interesting one to read so you should have a look at it and they kind of show that you can invert the existing language networks trying to solve the problem to fold a protein so you can make a completely different you know structure that's never been tested before so you start with a you keep optimizing using Monte Carlo you guys will probably speak it better than I do but the important part is there's whole sections that have never been accessed before and that the structures that they're producing from these networks can be solved experimentally and proven to be correct and that we can access these structures that we've never seen before so this is a very exciting time for biotech there's lots and lots of language models here showing that you can follow evolution so just feeding in heaps and heaps of data shows us that evolution is entirely predictable so you can follow whole family evolution across protein folding and you can understand exactly what you would expect your protein to do and now you can push that to an extreme to design a biotech enzyme for the first time so we can now engineer classes of lysosome all sorts of enzymes you can you can choose the function that you want using the sort of deep understanding that these networks have kind of given us insight to so ligands modifications docking yep there's plenty of those too it's all coming out you know this is another nature paper you know this one shows you modifications yep this one shows you lysine acetylation this one shows you liquidation this one shows you succinylation so there is a deep learning piece of code for every single thing you can think of now and this is all happens you know May 2022 it's moving really really fast very scary computational efficiency this is one of the biggest problems it's computationally heavy so we need to be able to make it faster and cheaper so openfold is now a pie torch implementation of our fold completely retrained on exactly the same parameters and forms as well but it has a whole bunch of little tricks they've rewritten a bunch of code so there's this really nice paper from google saying that self-attention doesn't need that much memory so this one is much less labor intensive than alpha fold they also wrote some CUDA kernels to improve that you know calculation so we're working on making this cheaper and faster and why is that important and that's because we need to start mining complexes and complexes are big and heavy and we need as much compute power free for mining complexes so now we're also looking at accuracy with complexes and we use very early on traditional approaches of just mixing lots of the different neural net results together and combining it with structural biology and biological data to show that we can get much better results but now we so this concept of ensembling we know that it works much better but there actually are now deeper deeper learnings which are telling us that we probably don't even need to worry so much about that coming so this one's quite an interesting one talks about the just understanding that we can predict protein evolution and I like this line that said we found the genetic rules learned by large language models were sufficient to predict the evolution of a specific protein so you just need to feed in all the information we have and these new neural nets can work out where it's come from and where it's going to so this is a very exciting outcome and maybe it's a bit fuzzy as to why this becomes exciting but this particular paper which is from Facebook showed that by providing 200 million sequences rather than 100,000 that Alpha Fold was trained in that you can now produce this program called ESM Fold and you can now predict the structure of approaching by providing the protein sequence alone which you do with Alpha Fold but it doesn't do the first multiple sequence alignment step it completely removes that which is a third of the compute problem and that's a really big improvement it now speaks protein so well it doesn't need to compare proteins with all the other proteins it knows how to fold so that's telling you that within the language of protein structure the fold information is contained so that's a really big change it's quite a subtle change but it really tells us quite a lot of stuff so it's native to the language protein folding which makes sense because there's nothing else telling a protein how to fold other than the sequence you provide it with so then there's this huge metagenomic atlas they've got 772 million predicted structures that they folded this way it's not as accurate as Alpha Fold not sure what that means but why this becomes important is that we're freeing up GPU now for complex so you can use this method that's much better to now brute force metagenomes by your way complexes so randomly folding protein A with protein B that's an order of magnitude that's out of the park this is a big computational problem so this was one that published last Thursday so just look at this this is just a typical case of what we're seeing okay this one came from China it came out last Thursday it trained on a cluster for 13 days with 96 A108 times 40 GPU service and consumed a trillion tokens it's not baby compute here it's ginormous and every piece of code that I showed you is trained in a similar way maybe not as big as this one but it's coming for you the only saving grace you have is computing people is that the structural biologists haven't cottoned on yet but when they do they're going to be knocking on your door so I suspect you set your price high so I just wanted to say there we go this one's just telling you this one so they say they can do all these things they trained it to do all sorts of crazy stuff you know it can do all these things solubility prediction secondary structure fluorescence fitness localization it can train by just speaking huge amounts of data you can characterize a whole protein in silico and even then they just go oh just for you know for giggles we'll just also solve antibody structures as well because we speak protein so well so you know this stuff is in so much of this is in preprint it's just the field's moving on I don't even know if it's ever going to come out so it's it's big so yeah these are the references I have them all we've got a pdf you can just have it I don't care these are them you've got them all these are the real killer papers I would say that are changing the field pay attention to attention is all you need this paper is probably the paper of a century like I really think you should pay attention to it because it's what made chat gpt possible it's what's made alpha fold possible it's what's making the big difference here in these deep learning networks and then these other papers here on the deeper deep learning and the hallucination are really interesting concepts of how we can explore the deep learning space more fully more references I've got them all so you'll get them all if you want them and then I just wanted to say thank you to for having me and I hope that we can have an interesting conversation and just thank my colleagues at some good work okay we have a question online from Prash and they say an excellent talk Kate thank you their question is how does alpha fold work for intrinsically disordered proteins and what are the current implications okay so it just makes it intrinsically disordered so it'll it'll calculate a structure that's orange all the way and that's because the definition of the protein for the intrinsically unstructured protein field is they have a database of proteins that they say is intrinsically disordered by every method that we know about and when our fold was first released before it was released publicly I think that database was fed through alpha fold and alpha fold pulled three structures out of their database that had a fold and so the intrinsically distorted structure people went huh alpha fold's no good it's got it wrong and then someone showed experimentally that those proteins that they thought would unstructured were structured and I don't know why it's better at it than any other method but it just will show you an unfolded string that kind of you know spray of amino acids that look like it's a beautiful ballerina turn is not really what's going on it's just telling you that I don't know what to do with these residues so I'm just putting it out of the way of the model so it's probably doing this and why some proteins obtain this structure when they bind to their partners so you know we need to form a machine I will adopt my structure to grab hold of it you know we've got bendable fingers let's think of it that way so if the protein needs to change its shape dramatically to interact then it's not surprising they may be unstructured but then what the other ones are doing I don't know whether they're all working that way it's a different question I don't it's that's a big problem in biology and I guess that's one of the things that alpha folds opened up there's just the sheer number of disordered sections lots of them you see on the ebi database because the front half of the protein has signaling sequences and flexible domains which are normally processed away so the alpha fold the ebi database is a raw unprocessed protein and so a lot of them we already know that the front half of the protein gets trimmed off and so that makes sense why it's unstructured or it's floppy and we don't care about it so most of the structure my oldest already know that I'm just ignore it so we know in those cases what that's about but yeah I don't speak in intrinsically unstructured protein because I work on protein machine reads that we know fold so I think that's a question probably better directed for the intrinsic field but it gets it right and when it says it's unstructured that's what we see does that answer thumbs up hey thank you very much it's really really interesting talk I want to ask you is there you did mention the multimarine the one that that you can predict but can it does it only do it on sort of layers of the same protein or can it predict two proteins that interact from each other if I give it the whole proteome of can it find sort of which proteins can interact with each other yep and that's how we use it so you can't it's constrained by compute so the bigger you get the harder it is and the more computationally expensive it gets and I'm sure Gareth will tell you about the bill is footing for people who are folding big stuff how annoying that is but we can mine protein protein interactions this way so someone comes me and goes I'm pretty certain that protein A interacts with protein B can you outfold them together and yep that's exactly what we do and we see that oh look the front half the protein doesn't but the back half the protein does and so immediately you can start to see the formation of complexes and in the tutorial I did yesterday I showed you can assemble a massive complex of five proteins and I kind of restricted the folds from the databases so that all the previous solved structures were removed because someone solved the structure of the complete complex this year so I removed that from the alpha fold runs and we reassembled the entire complex from a different organism completely de novo and then compared it with the structure that was solved experimentally and published by cryoam this year in nature and showed it was the same so yes you can and so at the moment it's about trying to create as much free GPU compute time so that you can mine as many of these as possible because these interactions in the lab are really hard to do experimentally like you have to make each protein or you have to engineer the organism with all sorts of tricky little tricks you know tags labels you know it's hard to image them it's hard to understand what's going on this is infinitely faster and I would never work on a protein again or a protein complex again without alpha folding it and I absolutely mean that absolutely it's shocking it's coming for us I have a question about when will the last poor student do the last these two hybrid I reckon they're almost done man have we done the last one of those is that now officially obsolete I don't think I would run one again to be honest I mean I think it's yeah I think we're approaching that I think it's the advantage is that you two hybrid so it's a method I know if you know you can basically you know climb two bits of proteins and test in vivo if they interact in the yeast by bringing two markers together it's quite cumbersome doesn't work for lots of reasons there's lots of false negatives lots of false positives it works for two proteins I just folded five an alpha fold the other day for the tutorial and it took eight hours I spent four years my PhD trying to do one component of that complex and failed it's been 20 years since then before they solve the complex and we did it on Galaxy in eight hours and 45 minutes last Sunday night how about generally floppy proteins like losing rich repeats or transmembrane proteins yeah so transmembranes work sometimes we see that you get full transmembrane domain folding I folded a really cool one from a malaria field of a protein that they had a drug for they didn't have any idea what it does they were not protein people they didn't know what to do with it it's only a hundred amino acids how could that possibly be a drug target how could that possibly be interesting the first thing I did was look at the sequence hmm okay transmembrane two transmembrane domain okay it's very very small looks like a channel let's like fold it as a multimer it forms a perfect channel the channel is lined perfectly with charge it's got a hydrophobic gate all of a sudden it fully explained all the biological data of drugs and the drug that they had as to why it functions and how it functions so now we just need to like prove you know in vitro that in vivo that it forms this particular core so it gives you insight to things that we never had for really large things that are transmembrane we often just chop half the protein off at the transmembrane location because the other half of the protein will never be involved in the interaction in the other compartment so there are tricks trying to trim down the compute by running your multimers by using the smallest part of the protein you can that kind of makes sense so if you've got one half protein in one compartment just look for protein protein interactions that way yeah so yeah you can and it does sometimes show you the whole structure of the membrane section yeah I mean again really really cool I have one question do you think it might be problematic for future modeling and prediction efforts to have mixed experimental data augmented with predicted data like you mentioned the that you can take the face from a different protein to to do the structure and the structure I don't know if that's like recorded in the metadata or something that one is aware and like that for future modeling efforts you're not just reproducing the biases yeah so the crystallography field dealt with this problem for a long time and so we have very explicit databases they have very high standards you've got to show exactly how you solve things but I tell you when we do like molecular replacement with phasing what you typically do is you have your protein friend from Augustin X and you want to phase it so you choose Augustin Y's protein you take Augustin Y's protein and you remove all the side chains you just keep the backbone of the protein and you take off all the side chains so you just what we call Alan and I is the protein so it's just alanines and we use that for phasing and then when you've solved it correctly when you walk down the density of your protein the phenylalanolines pop up and the histidines pop up in exactly the right location so you know for sure that you've got the right thing so I'm not concerned by this with alpha fold that using alpha fold in this particular way mixing experimental data in fact I would encourage you to mix experimental data I do have some concerns about the deeper deeper learning methods that are stealing alpha fold models to improve the training of the deeper networks so if alpha fold is making mistakes and we're training those deeper networks on erroneous structures that could be more problematic and so that's my bigger concern but at the moment it works really really well and you saw the first example we had like we did it experimental we didn't believe it we were disbelievers we were structural biologists were like no crystallography rules and we were shown very quickly and very rudely that you know alpha fold got this and so I would always argue that any part of science 30 percent of it's wrong we don't know which bits mix your data all types from every source to try to build the best model you can and whether it's alpha fold whether it's EM or whether it's a cell biology experiment or whether it's some you know sick kid in hospital use all the data you have to build the best model because that's how we get to get the best models right I mean we love Newtonian physics until Einstein came along and then his model was better right so I don't think this is any different but it gives us insight like deeper insight than what our little brains can handle and the millions and millions of sequence and millions of years of evolution can tell us a lot and that's what is so exciting about those deep learning kind of networks Thanks for a really interesting talk I'm a computer scientist software engineer so I'm going to ask something from that perspective rather than the other side Remember I'm not trained in that So is alpha fold essentially a solved problem or are they continuing to refine that aspect of it is there going to be an alpha fold three for example you know what's happening with that Google's been tight-lipped about what they're doing they've definitely formed a company called isomorphic lab to go off and use this sort of deep tech to move into drug discovery and docking which makes sense I believe that the DeepMind team still doing something although they're fairly quiet but you can see the activity in the space that people have taken the ideas from alpha fold and have retrained their own open fold we're trying to make it cheaper there's definitely code tweaks and you can also see that alpha fold gets it wrong and so by training on larger datasets we've actually understood that we can get better understanding of the protein language and so I suspect there's going to be more improvements at the moment alpha fold does such a good job for the one task it was originally designed for and for more that you know we're not seeing a huge reason for people to leave the platform and I still would recommend that's the best one to use I'm very keen to try open fold because they say it forms as well as alpha fold and nothing else performs as well like all the deeper deep learning ones predict for a lot less of the compute and they've been able to brute force in millions more structures but they're considered less accurate than alpha fold and I'm a bit of a you know I only deal with a small number of things I'd much prefer to work with the best model I can get my hands on and if I can do a better compute I'd prefer to do that and wait to get the best answer I can before I move on because there's so much more downstream that hinges on the correctness of those structures that I don't want to take the risk but I think that there's definitely space in this area to add and I strongly believe that the biological data mixed together is going to help more so Okay I know that there are lots lots more questions for Kate online and in the room but unfortunately we are going to have to leave it there for this session so once again thank you Kate for that sharing your experiences in alpha fold