 Well good afternoon everyone, thanks for those of you who are here in the room but we know from the registration process that there are hundreds of you who are Watching via zoom, which we did open up to anyone who wanted to join But I can't think of a better way for having a first seminar of 2023 at least for me Then to to be able to introduce you and Bernie a good friend and colleague of both NHGRI really the whole NIH and Ewan is here meeting with various people at NIH for three days on a variety of topics and he was Eager and willing to also give a talk and we were eager and willing to host such a talk So those and if we could just put up his first slide while I'm introducing him there you go as you can read Ewan's current position is the deputy director general of the European Molecobiology Laboratory or EMBL. He's also the director of EMBL's EBI or the European Bioinformatics Institute with Ralph Altweiler and he in addition to those major leadership Responsibilities is still an active researcher and runs his own research group Historically way back when Ewan completed his PhD at the welcome Sanger Institute working with Richard Durbin And actually in 2000 already he took his first leadership position becoming head of nucleotide data at EMBL EBI I first got to know Ewan I don't know if it was 25 or 26 or 27 years ago Some were and so probably about the midway point of the human genome project where Ewan came in on the scene At just the right time and made seminal contributions to helping make sense of all the data being generated especially about the human genome sequence and really Becoming really apparent at that time He was going to be a major leader in the field for not just years to come but literally for decades to come Not that was a genome project ended as you know in 2013 and I mean 2003 But then what I can say is it was not surprising therefore that Ewan over the years has since picked up all sorts of leadership roles 2012 he took over as associate director of the EMBL EBI He became director then in 2015 and then in 2020 he became the deputy director general of EMBL and in this role He assists the EMBL director general in relation to engagement with all the the EMBL member states and external representation really criss-crossing Europe on a regular basis Some analogies as we've been meeting today between criss-crossing the NIH at the different institutes versus criss-crossing Europe Different countries some things are similar some things are different, but it is interesting some of the similarities Research interest-wise Ewan has always been very interested since the end of the human genome project in functional genomics DNA algorithms statistical methods to analyze genomic information these skills were applied in particular Shortly after the genome project ended on the end HDRI's end code project and cyclopeta DNA elements Where Ewan played a very important role not only contributing data, but also in really leading The effort and especially writing up some of the key papers at the early stages and mid stages of that important project And if he didn't have already enough to do our Ewan is now a non-executive director of genomics England And he's also a consultant advisor to a number of companies I'm not Ewan's only fan There's thousands of people that think highly of him and he's as a result received many awards He as an example he was elected to the to become an EMBO member in 2012 a fellow of the Royal Society in 2014 And a fellow of the Academy of Medical Sciences in 2015 all huge honors for an investigator in the UK So we're delighted that he's here today I will tell for those who are in the room when we're done with the seminar Come to microphones that are halfway up the stairs and to the two aisles But in addition if you are listening in by zoom feel free Do we want the questions in the Q&A or in the Q&A if you could please put your questions in the Q&A on zoom? And we have somebody in the room who will be monitoring those and will be relaying those through the microphone As at the end of the talk so with that Ewan's going to talk about genomics imaging and AI three technologies that are changing Biological research through to clinical practice you So, thank you very much Eric. It's a real pleasure to be here NIH My science has always Been very collaborative with colleagues here in America. I have fond memories of rock bottom Which I hope is still going and the beer here. I have less fond memories of windowless ballrooms Hired by NIH institutes, which I have sat in at hideous hours for my time clock But you know international science you have to make some sacrifices So I could have talked about lots of different things. I've decided here to give a very strategic talk Come on Friday to Cold Spring Harbor where I'll give my research talk And for those people who had rally you heard my research talk already But this is going to be quite strategic and I'm going to try and get through a lot. So apologies for the speed So I just want to introduce EMBL first. This is the organization. I am deputy director general of it is an Europe's only intergovernment laboratory for the life science research. We're kind of sister of CERN CERN does high-energy physics We do molecular biology. We have five different missions before maximum research to deliver scientific services To provide advanced training, which really means PhDs and postdocs To do innovation translation and integrate the life sciences not unambitious projects and We're headquartered in the lovely town of Heidelberg, Germany. It's on the NACA, which is a tributary of the Rhine beautiful German Rheinland town and that is our main laboratory and our second biggest site is at EMBL EBI Which I'm director of and that is completely dry It's bioinformatics only there and kind of occupies a similar role to NCBI Over here. We also have sites which are dedicated Which have a stronger service delivery side around the synchrotrons of Grenoble in France and in Hamburg in Germany And then our two later sites is in Rome that focuses on epigenetics and neurobiology And our very latest site is in Barcelona. This is on the beach It is beautiful and they study tissue biology and do disease modeling and mesoscale imaging and With my boss Edith Hurd She came in as the new director general and she has really challenged us to step out to keep one foot Very much in the lab and very much anchored to the atomic and molecular and cellular understanding of life But to step out with the other foot outside of the lab and think about life in context And there's one aspect here about thinking about the diversity of life Which I'll come back to and species in different places But also thinking particularly about humans and the way humans Interact with the environment the way the environment impacts humans and the way humans are in environment We're an environment for many species that live in our gods and alongside us And that's been a tremendous kind of renewal And reimagining of where the end points of molecular biology are. It's quite exciting But I want to focus here strategically on what you might think of as data science And there's a great paper by kim naysmith who was the head of the Institute from necropythology in vienna very close to bruno about mandel He gave a review about mandel and he said that mandel was the first data scientist in biology He was the first scientist That collected data And then from that data derived a scientific insight rather than doing it through observations Of just sort of individuals as darwin did so for mandel. This was of course his beloved peas in his different ways his his Data collection was about the phenotypes in the f1 and the f2 generation and his model Was mandelian genetics and Dominance and recessive And what i'm going to claim here is that in some sense we are still walking the same path It's just that we're generating high-dimensional multivariate multimodal datasets We're analyzing them in More sophisticated ways and writing down the tables in a variety of different ways But we are still doing this fundamental process of deriving models or understanding making accurate predictions of the world And this world has been changing and i just want to step you through many of you will know about these changes But i want to step you through those changes One of the two technologies and then ai So this is a busy slide about the successive technology innovation Around dna sequencing on the left hand side. There's sangha didioxy sequencing perhaps only eric and myself I have ever done this in this room Maybe some others have the old school way of doing it where you Generated a large number of molecules and you separated them on the gel with radio activity. It's great fun This got automated in the 1990s By leah hood and founding the company a bi where they use fluorescence And rather than a slab gel eventually capillary gels and that was the bedrock for a lot of the human genome project And i've skipped now a couple of technologies, which were really good and really powerful Lot of love for 454 soft spot for helicos But really the dominant technology Was selector which became it was bought up by a lumina Which was a reversible termination Of pretty much the same process as sangha didioxy when you could reverse the top Thing and extend the the strand by one more the other trick That selector worked out and then And obviously now in in a lumina was that rather than doing a one dimensional or two or single lane Interrogation you could do that interrogation on a 2d plane With imaging and that 2d plane has become very very dense And so you get many many spots and so you can sequence many things at the same time And as many of you know, that's been a tremendous boon to all sorts of aspects of genomics But that innovation has absolutely not stopped and i've put these two New technologies both work off single molecules So the first one is a nanopore sequencing Where a single strand of nucleic acid goes through a pore And you measure the ionic current which is being Changed by the transit of the sequence through there This was working conceptually in the 1990s It's just that the strand of dna went way too fast through the pore So you could tell that it went through that it was complex and you had no hope In reading the sequence and really one of the big breakthroughs was to Slow down the progression of the dna sequence through the pore Using what is kind of counterintuitive really called a moto. It's really a break And it's a unidirectional break Of the nucleic acid going through that so it steps one nucleotide at a time At a slow enough rate that you can measure the change in current So that's Oxford nanopore. The second one is equally wizzy in technology Which is pac bio. I love this phrase from pac bio, which is the only moving part in our machine is the polymerase Which is a cute way of saying that they are measuring individual pieces of dna Being synthesized by the polymerase and using this optics trick that there's a Quantum effect if you pass us very strong labor just underneath the surface It will only excite the floor falls which are right tethered Just above it So it's called zero glide wavelengths So and that both of those technologies have been Stretching nucleic acid technology in a in a way and just a note here. I have a honking great big Complicte for interest. I'm a consultant and shareholder to Oxford Nanopore So the animation has not quite worked out here But I just wanted to go through the axes of improvement over these different years So the first axis is the number of simultaneous molecules you can measure In say a fixed window of time and that's gone from one to would you believe 10 to the one times 10 to the nine and the length of reads Actually, we went backwards with selector and I remember processing the very first 19 base pair selector data and Working out whether it was useful or not and it was indeed useful Um, and now, uh, it's one or even more than one so four Times 10 to the six base pairs of a single molecule And if if you had told me 10 20 years ago that we would get megabase reads out of a sequencing machine I would be saying that you were fantasizing Um One interesting thing about Oxford Nanopore is that it can see many different types of nucleic acid So, uh, you know, so I've been using the word nucleic acid not DNA About this and so it can sense both DNA and RNA and for the DNA It can sense different types of modification on the DNA just as well as it can sense the four bases And so we've got a diversity of molecules these technologies can sense The one of the big drivers is naturally the cost Or more accurately the price For base pairs and that has dropped precipitously over the years faster than perhaps any other technology over the last 20 years And the error rate Has gone up and down This arrow, but it does range from one in ten Which is pretty uh, eye watering at some level, but doable. It's still got signal But now down or in some scenarios being down in one in 10 to the four So one in 10 000 letters are incorrect in some scenarios And then finally there's the kind of all in time to result and just to stress the all in here doesn't mean Total number of base pairs divided by total amount of time It really means if I have a sample on my desk now How quickly do I get a result out in the computer? Or on the device and that has gone from months or a month or so Down to would you believe it two hours and in Norway? They did a testing of a gene fusion during an operation for glioma Uh, um using uh oxygen nanopore technology A lot of people feel there's some kind of inherent trade-off In this space that actually isn't there isn't some law of physics that doesn't allow you to be on the edge of all of these things Sort of remarkable fact and people often forget about it Now, of course when companies make different devices They often find that their technology is maxing out in one direction or another But it's worth stressing there's no inherent Aspects of the way these things are traded off In these technologies So what have we done with this? So of course one thing is we've created reference genomes This is one of my favorite representations of the reference genome the human genome Which is a bookshelf which you can see at the welcome museum Houston road and next door the welcome trust where they have printed out in not that small text all the The entire reference genome as books and it's quite a big bookshelf Every cell in our body has that bookshelf asterisks not red blood cells So uh every every nucleated cell in our body has that bookshelf inside it But of course there has been an Tremendous benefit of when we've done molecular biology if we can make that molecular biology have a nucleic acid In particular dna readout then we can scale that technology massively And so that's about measuring rna abundance rna seek measuring epigenetics measuring chromatin A transcription factor binding or histone modifications that chip seek and there are these Increasingly complicated. Da da da da dash seek methods And this is this trick about you do something at scale or rather in solution or in the cell You get the readout to be dna And suddenly you can scale your molecular biology into a new space And then of course it's been very exciting about the way we've integrated this already into clinical practice I'll come back to this in in one or two areas But it really works in rare disease. It works really well in rare disease Which is not actually in aggregate that rare It works well in cancer and I think there's huge amounts of confidence that it will continue to really inform cancer clinical practice And of course in covet 19 we discovered that indeed doing tracking these nasty infectious agents using their In the case of covet's rna sequence has been tremendously informative about infectious Biology and again you can see that just being a a routine part of how we think about infectious biology So I'm going to switch gears and talk about a different technology, which is imaging And imaging has had this beautiful explosion of different scales different time domains different fields of views the different sizes of view And then finally different fundamental measurements, which I'll come on to it's not just scattering and reflection and these other things And there's a host of different technologies that underpin that at the atomic scale It's when we use electrons rather than photons as our probe And so these are electron microscopes And that due to the incredible advances in detection devices have allowed us to get to near atomic scale resolution from direct imaging It's kind of just amazing really amazing There is a big space of aspects around super resolution microscopy. This is where you trade Information resolution that we have in time to gain resolution in space And so you can go below the diffraction limit of light Because you know that the molecules that you're looking at are often sparse or have certain particular properties And that way you can take multiple photons that you are pretty sure come from the same molecule And you can now isolate better than the diffraction limit where that molecule comes from Rather amazingly we can now track little microtubule motors live Not quite in a cell in a in a in vitreous solution using super resolution microscopy and watch these little motors drag little things around Really is remarkable And then we have a series of other higher resolution Schemes and that includes a lot of very clever use of photons in particular two-photon three-photon technology Uses these things such as the way scattering works and this is optical coherence tomography And it uses other ways of doing interrogations of living matter. For example magnetic resonance imaging Now one thing that really spans all of these things is in none of these cases Do we use our eye anymore as the image collection device? You cannot put your eye you do not look at these things with your eyes anymore Of course the the result you visualize on the screen But the capture is happening via some detector And then it's being processed inside of a computer and this is most obvious in super resolution technologies In super resolution technologies you just you know what the the thing you capture are just blinks They're just tiny little blinks of photons And so you you really cannot that I mean there's absolutely no transformation From the the video that you receive into the three-dimensional or two-dimensional space. It must go through a computer To be visualized basically Now there are too many things to talk about here And I just want to pull out some of the cases that come from embal Embal hard work is particularly strong in this The left hand side here is from julia mohammed from embal heidelberg She did in-situ structural biology. So e.m. tomography on host cells And so in this case there were bacterial cells and she was able to Get near atomic resolution of the ribosomes And these weren't just Collections of in-vitro ribosomes. They were the ribosomes actually in this bacterial cell And so she could then also do things like introduce an antibiotic To to a different sample do the same procedure and see exactly what the antibiotic changed Structurally of these ribosomes in vivo This is being redone for many other different cellular aspects And so we're getting cellular or sub-cellular resolution of how these molecular machines work Using this em tomography. It really is quite remarkable second one is again from colleagues including conelius grace in montrotondo Using x-ray high resolution x-ray imaging. What's quite exciting here. We've got quite a lot of space which is kind of We want more resolution than light microscopy But we don't need the full-on resolution of em and we want a wider field and a great example is tracing axons in brains Axons are small things. We want to know how the connections work But we want to do a lot of them at the same time and x-ray scanning high resolution x-ray imaging Is a very good solution to this And the final one is this absolutely amazing thing Brillouine I can't it's a french name. I've murdered it. I'm sorry for the french colleagues microscopy This is where you measure the stiffness of a material in vivo By imaging the material and passing a sound wave through it at the same time And you look at the deflections of the wavelength of the light As the sound wave moves through the material This is a non-invasive way of getting a property now out from your Living material, which is quite an interesting property. How stiff is this material and does it change? So here's a biophysical property that you can extract Out of these this particular device For this and and we call it imaging, but you know, it's kind of stiffness You know, it's no longer a scattering a photon scattering Component So what is the third technology? So I'm sure you're all aware obviously that in those two technologies They've been sort of welded with computers for a very long time. No one By hand aligns alumina reads no one puts their eye to an em microscope and works out what's going on You no longer use Us as the data collection device and the deputation interpretation device You have to have computers and you have to have systems to to build an interpret it But I think there's something different which has happened over the last five years And that has been the development of Artificial intelligence technologies deep learning technologies and I'm going to spend a bit more on this and take you through it So what is this sort of artificial intelligence? So this is the way I think about this There are two different key things Which are kind of inputs to it one is good very good data engineering and very good collections of data This is incredibly easy to put up in powerpoint slides. It is really annoying to do It's like the grungy work of data science every tech industry person knows That this is some of the really hard yards where you actually just get your data straight. It sounds completely trivial Lots of things fall over at this stage The second breakthrough was this development of what I like to call calculus engineering So what is deep learning at its core? It's gradient descent So this is where you take a function you differentiate it and you want to try and find a minimum And so you follow in a high dimensional space the gradient to go down Now the thing about this is previously. I mean this technology has been around almost since newton Um But doing it not merely in a multivariate way, but in a bonkers big multivariate way with very very complex functions And it's that bonkers big multivariate gradient descent, which is at the core of neural networks And to make this work you need good hardware Which has turned out to be the same hardware that you use for computer gaming. I mean, who knew? But you know, if you get a lot of these computer gaming chips, that is a good hardware to do this You can have specialized hardware for it, but actually The raytraces GPUs do really really well And then I've got three kind of different kind of flavors of deep learning. So what I describe as labeling Another one, which I'm going to describe as modeling and the third one is I'm going to describe as alternative alternative uses of this gradient descent Now if you want, you know, it does this does feel very bad black boxy and for me This was an annoying black box Until only two or three years ago and now some of you are ahead of me But I suspect most people are with me or behind me in terms of the frustration about this kind of weird like magic happens I'd really recommend this Um video short called alpha go. It's it's on um youtube And it's about the case where demos has to miss his team from deep mind created a deep learning program Which is about Plane go which was considered to be a game that humans Could use that intuition and and other aspects and beat many computers and in fact deep mind created a deep learning architecture That could do this It's a lovely story There's just a lovely aspect of kind of Korean culture and go about it. It's really sort of beautiful in many ways But when you're if you watch it wait for the moment Where they realize that alpha go Has a new way of thinking about go so humans Made a program That came up with a new way of playing go And that set of strategies has become an important part of go strategies in the future so, um, I just want to uh, yeah sort of step you through some of this so as I mentioned you've got this data engineering aspects of like pulling data together And and getting it ready So there's kind of two flavors of this one is when you're doing it because you've generated the data But now you want to generate lots and lots of data and it's incredibly easy to get lost And and do silly things when you have these large datasets and it's a bit embarrassing nobody talks about the fact that oh, I lost track of my data and Don't know where it really where it's all is. It's surprisingly common failure mode In data science. It's very important thing to do well And that goes up to at the biggest level. I've got a picture here of the uk biobank cohort That is a big cohort in some sense. That is a massive data organization project Longitudinal over time keeping all these things in sync in many many different places Oops, sorry Another source of data is the data deposited by many different scientists over many different years and here colleagues at ncbi here and ourselves at ebi represent that collective memory for the scientific community and allow These data sets that have been generated by many different people over many different years to be reused in these methods And then I won't talk too much about the calculus engineering the calculus engineering You can write it. You can try to write it out like this My colleague Moritz Gersteng says, you know, we can't even write these out using mathematical notation I mean the only sensible way to do this is to is to show someone the pytorch code Although it's maths it's sort of impossible to digest as maths with notation Um, uh, really the the succinct way of doing it is you show it through these code kits such as pytorch or jacks So let me go through the first application and this is where we take a high dimensional data set and and we label it And what do I mean by that? So here's a picture of the human heart from an MRI scan And the labeling here is to label the left ventricle and the right ventricle and the blood pool and all of these other things This has become almost routine Using deep learning you you set this up with some training data Some there's some tricks of the trade about how you do this and you can pretty consistently come up with a good model That will label future hearts and it's completely transformed the way we think about images Because I can now look at a a big set of images and say to myself well, that's fine I'll make some kind of sensible image derived phenotype from that big set There's something which is kind of again labeling, but you're letting the computer the network do a bit more This is a very good example from Platonarius about looking at cells that grow in this little organism Where they had to label cells but also cluster them the interesting thing is that we didn't tell well my colleague Anna Khrushak Did not tell the The computer what the right answer was there's more like please find a space for the clustering of these cells But segment them in the same time And then just to go over to the nucleic acid technology This is oxford nanopore oxford nanopore signal is is determined by Base calls by a deep neural network So it's using a deep neural network to go from the signal to the base calls which in some sense is a labeling problem I've got signal. I want to label it with the base pairs Sometimes they're just for ATG and C sometimes it's five ATG C and metal C sometimes it's six ATG C metal C hydroxymethyl C And so you can start to build up this labelling A real step forwards has been this thing which has been closer to alpha go where we make the get the computer To make models of the world and sometimes just like alpha go the computer Makes a model that we ourselves don't truly understand And the breakthrough here and it was jaw dropping when I first saw it was alpha fold This is the work for again from DeepMind, Demis Hassabis and John Jumper Where they took multiple alignments not very deep multiple alignments just about 10 sequences 20 sequences And they trained a neural network that they had designed themselves To predict the three dimensional structure And that neural network did two things Firstly it very often got the right result and it met John Maltz Criteria for a near equivalent method to experimental results through the very rigorous casp competition Which is great The second thing it did is it has an internal view about whether it's got the right answer or not So alpha fold itself has a calibrated view of whether it's right or wrong And so there's times when it says no no no i'm i'm wrong Or it says no no no i'm definitely right and that calibration is accurate So the alpha fold itself knows whether it's right or wrong Now a lot of very very clever people i've met some of them in there definitely clever than me Have tried to solve this problem over many many different years We've had about 30 to 40 years of biophysicists Bioinformaticians computational people Attack this problem And it's really interesting that it's fallen at this point So we have similar things in the prediction here of genomic tracks This is using data from encode often or from the other kind of chromatin aspects Where you try and work out what is it about this dna sequence in this particular cell type That means that you get this particular chip c called this particular open chromatin And that again has been a very robust process You can train models that do this and work well. You can also do this for splice sites You can make models that predict where splice sites are In transcription Now what's interesting let's do the splice site ones because it's a case where we understand the molecular machine that does it The splice is same What's interesting about that deep learning model that does very well Is it has some of the classic things that we know the five prime splice and the three prime splice site? Looks like they're in that model somehow the model has learned about those things It's also some a lot of kind of weird and fuzzy stuff There's a point where I First looked at these things and I was very frustrated about these models Because why is it that we're generating? Why isn't it not clean and then you have to realize the molecular biology is allowed to use quite a lot of fuzzy stuff itself So these models which are accurate Generalizable predictors of molecular biology. I think are tremendously useful In lots of different areas And then we can do for example other things and this is more of a data fusion thing where we're bringing images and genomics together trained on To to classify cancer tissues And to can't classify the kind of mutation type that's happening in cancer And so there's a class of scenarios where you're doing this for a kind of data fusion task Just diving into alpha fold a little bit better So it's been really interesting so that I should mention I don't know. Do I do I talk about this? No, I should mention that we had a Alpha fold themselves use the databases from the ebi which are done in collaboration with colleagues here at rcsb and in Rutgers and and other places the worldwide pdb. It's been running since the 1970s to do this So this database was absolutely critical in the ability for alpha fold for them to learn to train We were very very lucky that DeepMind partnered with us for another task Which was to partner us us to Um distribute the results DeepMind decided to make the method The code and the data from alpha fold all completely open And they partnered with us to make all the data open And Samir who did this we did this for 29 key species including unsurprisingly human And then we did it for a hundred thousand proteins and then Samir said solid Let's just do it for everything every single known protein and it's worth saying that computers scale in a way that Experiments struggle to scale And so when we can solve things in computers and we can't always solve these problems in computers But when we can solve them in computers, we get this remarkable piece of scaling for this Um, and it's been interesting to ask why this has been achievable and there's some very technical aspects about where the way The the neural network has been put together. I do not understand all of it One important thing is that this whole process Is one differentiable Function so they can do end-to-end training. They don't have steps. Although this is written out as a pipeline of steps It is presented as one function to the training system and that one function is differentiated and trained by gradient descent the other aspect of this is that uh, john jumpers stresses that the Diversity of data in the pdb the diversity of structures in the pdb was absolutely critical It wasn't simply about the number of structures It was the the structures represented such a diversity of different scenarios where proteins fold That the neural network had it had to kind of learn the physics presumably the evolution in the physics behind it And then I finally want to touch on a different thing and just to stress although we Often use these gradient descent functions For neural networks, which are these really these combinations of linear models Um, there's nothing in the maths that says that they must be used for neural networks You're allowed to use them for many other different things And this is from maritz geistung who you can see I love working with And he has repositioned the cox partial likelihood On this these maths engines this calculus engine And this means he can do bonkers big cox proportional hazards this classic 1970s epidemiological model That helps you understand risk factors that underlie a disease But now we can do it with 10,000 different variables simultaneously. It's kind of trivial If you're an epidemiologist, you might say are you sure you want to do that number one number two Where on earth do you get the data to train a 10,000 variable epidemiological model? And the answer there is you get it from son brunach and the wonderful amazing danish national patient register Which would you believe Has got data pretty much complete data of the danish population Since 1979 as they've interacted with hospitals across danmark So that is there's about five million danes alive at any point in time Some of the People in the database have now died Many of them are alive and we have the data from 1979 to do this So we have the scale of data. It's about a billion or two billion Interactions of those people with the healthcare system or more I think and so we can both computationally scale it using this calculus And we have the data set to derive this And that has been really interesting. There's a pre-print this about individualised cancer prediction We are rolling this out from for non For common disease And there's lots of obvious things and then there's some non-obvious things And of course, it's always a bit of a head scratch about with the non-obvious things are new or Misleading or what's going on. So at that stage of working out these things, but I recommend the pre-print So what do we need to make this future? continue And that is a future where biology is done at scale with technology using these neural networks and deep learning technical technologies So one is very close to my heart. We need open organised fundamental biological bimolecular data I wanted to stress alpha fold would not have been feasible without the wwpdb without the worldwide pdb When structural biology community came together in long island in the protein data bank to start that off They did not foresee this use case. They could not write down that this was going to happen But they did know that what they were measuring was really important about understanding life And they were not going to have their experiments go to waste By simply having them on their own There weren't even hard disks at that point. They were probably floppies or some iron drive or something so just as our colleagues from 40 30 20 years ago, we need to propagate our information correctly in the future It needs to be as open as possible as restricted as necessary the only restriction, which I think we really should Totally honor. I mean we should obviously honor all restrictions, but the one which is most understandable is patient and citizen privacy and laws But we should try and make it open as as possible And I should stress that we've got a kind of new skill set which I like to call data engineering And that is about just helping Flow manipulate structure store Or Reflow to somewhere else these very big data sets and it's again often something that we Skate over because it feels so trivial. And yet if we don't get it right nothing works And of course, this is close to my heart because of what Emily bi does and just like ncbi we go through this loop of Having scientists that generate data and make discoveries. They deposit on publication We archive and share the data with global collaborators and all scientists We classify rich combine and analyze so there's this data engineering Which is more than just storing the data But sorting it out And then that allows us to distribute not only the raw data or the deposited data, but also value added data resources And then scientists can go and build on that in the future And this is a little map. It's not live because I'm not that bold But this is all the different people who use the ebi at a particular snapshot in time And I should say People often wonder how is this done? And there's a feeling. I think somehow it's somebody else's problem Like somehow it's like the pixies of the data world will somehow look after our data The kind of house elves and if you're into harry potter sort all this out and these people That's sort of true But we we must do it. This is what science is about. It's propagating this information in the future for scientists to build on scientists as that has happened All through science and the global biodata coalition the gbc Is a international forum to help coordinate the funding of that. This is a very deliberate pitch to the project offices and people in the nih to Enjoy the integration or enjoy this forum as a place to discover this and credit to eric green For helping setting this up But we also have another amazing opportunity in um In the future coming our way and that is because our science of molecular measurement And imaging has become really useful in clinical practice And that means we have an entirely new industry, which is not research Which is measuring things on one organism those organisms are humans And they're doing it at remarkable scale And this is a picture from which i'll come back to a bit about genomics england Which is the clinical genomic sequencing in the nhs So all the sequences done in genomics england are for clinical benefit of the patient But at the same time they store the data to allow for future ongoing research For when patients give consent And patients can withdraw And the genomes by the way are kept in a kind of wall garden a private cloud that researchers can get access This is true not only for genomics But also for imaging and of course for healthcare data just as i showed you with the danish data And i like to think of hospitals as massive phenotyping centers for humans So we can phenotype humans we can image humans We can measure the molecules that are present in humans not least the dna And we the researchers do not have to do the heavy lifting the heavy lifting is done by the healthcare service And that is a tremendous opportunity for tapping into those data flows Now to do that We need standards that span this research world through to this healthcare world And because these are about citizens and patients Basically one has to be a lot more compliant to national laws and the way people operate healthcare And although if we have mouse data We can have these data commons like the ebi where people deposit data for the pdb or for mice or for Whatever or for human cell lines into a single place and then somebody can download and use them Or in a situation where they're deposited and the analyst comes and talks to them on the cloud When it's about Clinical data the tendency and this is I think is just going to become the norm Is that you must go and visit This data in their separate locations and definitely my group Does this if a uk biobank we use the uk biobank wrap which is their name for their cloud For genomic singlin we use the genomic singlin tre trusted research environment, which is their cloud For the danish data we use the danish computer own which is their cloud for accessing the data It is quite irritating my postdocs have about four different laptops sometimes because some of the paranoia involves you must have a danish laptop Um, but it is doable. It is a doable way of doing research And just as the internet needed standards to enable Good interoperation across a kind of federated peer-to-peer world We also need standards in this world and this is that standard setting body I'm very lucky to chair this which is the global alliance for genomics and health For responsible sharing of genomic data for the benefit of human health And we have tight collaborations with hl7 which does that for in some sense the Just the health care data that that is very much driven from the health care side Whereas geo4gh I think it's much more evenly driven from the research and the health care side And I won't bore you with the wonderful graphic about geo4gh things But if you please do use geo4gh standards and as cram bam sam And vcf are geo4gh standards if you do genomics you are already using geo4gh standards. So just be Be particularly proud at that moment But we also need to evolve our workforce and I think this is a really interesting challenge from many many different institutions Many many of our discoveries are going to happen inside of the computer sort of by definition because our Our measurement devices are so wedded to computers in genomics and in imaging and other things But also because of this AI aspect And it's important to realize we don't need to have just one type Of skill set. There's actually quite a lot of different types of people. We need that So at one level there are I'll do the middle one first the people who come for maths often This is virgini allman from the ebi. She's deputy head of research She has a mass background She's into AI deep learning and machine learning And theory she likes to think about images in a mathematical way It's all very clever stuff and she really does the methods development that is appropriate And then those methods have to often be used now sometimes these people Will use them and deploy them and make discoveries But very often they're motivated by the methods and that is great. We should celebrate that We should not ding them for being methods developers. It's a real mistake to say these people are you're not quite A scientist in biology because you're doing methods only, you know your computer computational person That is that is shrinking our space and shrinking our ecosystem. We need to be very generous about that We definitely need a lot of these people. This is evan galea. I'm afraid I've taken lots of people from amble She is a data. She leads a data science group So she's a faculty member who leads this group of data scientists at amble ebi She's obsessed with pathways And cellular switching and cellular decisions proteomics and phosphorylation and that sort of stuff And she of course has to have a good interface for people like virgini But she wants to make biological discoveries. She wants to discover Understand bits of biology by using a computer. She's totally happy if she uses a chi-squared test or an AI Machine learning test. The focus is do I understand this bit of biology better? But as I said, this gets underpinned by this class of people and here I'm showing the associate director for data resources Joe McIntyre She leads basically the 400 strong set of predominantly data engineers at ebi to deliver the data resources behind this So the ability to flow shape deliver and integrate data at scale and it's a It's engineering, but it's not software engineering. It's data engineering How do you put these big bricks of data together? Because it involves an awful lot of software But the end result is a data set and databases and data sources And then finally, this is Jan Kulbal who's at EMBL Hodelberg who's some probably Probably many of you know many of these people Joe and Jan and even got in virgini Just to stress that We need these people at all levels at the phd level at the staff level at the staff scientist level At the faculty level and in our leadership And actually I think there is a gap here because we By definition our leadership tends to be older and so as we go through this change It is difficult to bring people in Who can understand how What does a good data engineer look like? What does a good methods developer look like? And so it's very important. We we have to accelerate the track of these people Up to leadership levels So i'm about to finish Some of you might be skeptical and say well Does it really matter? I find the skepticism you know That was a really easy problem. There was no way the protein folding problem was an easy problem Absolutely no way that was an easy problem and it fell to a computational method Using these kind of techniques But I just want to make it really explicit and I'm going to give you two examples Example one is an example from basic research. This is work from Jan Kozinski and EMBL Hamburg Um Where he developed a very accurate and good model of the nuclear pool now This is one of these kind of crazy big death star like complexes in ourselves It's got multiple components. It's got multiple proteins And although we had EM maps of these things We really didn't understand how the different proteins fitted in there because we didn't have atomic scale resolution of The components and with alpha fold and with experimental work and with EM models and with chemical footprint hydrogen deuterium exchange And single molecule fluorescent resonance energy transfer He was able to put together a credible model really credible model of this big beast of a complex And you can see that we're going to walk around a lot of the things in the cell Using this kind of integrative structural biology and then the work that I showed from Julia Mohammed about that in vivo captured Biology of these complexes in action And then I just want to give this the the opposite end of this which is these clinical operations going back to genomics england And I just want to talk you through what's happening here So for the genomics people I we should all be very very proud about How we have changed clinical practice for rare disease So i'm just going to make this concrete for the uk case. This can map to us times the numbers by five So there's um about six thousand six hundred thousand live births A year and about 50 000 of those in the uk Within the first six months are drawn to concern of a clinician The pediatrician is concerned about the baby And thinks that the baby has a rare genetic possibly a rare genetic disease Very often that diagnosis starts 24 hours after birth when the pediatrician first looks at the baby. It's a kind of remarkable process In genome in england at the moment Those children can be scheduled To be sequenced have their genome sequenced And wherever possible the nhs tries to recruit both the mother which is usually feasible And the father to have their genome sequenced at the same time. So for each child we get three genomes Kind of on average it's about 2.5 And so i'm just giving you the number of genomes per year that come through This process by the way at the moment We're only seeing about what we think is one percent of the two percent I'm not one percent we've seen 50 percent of the two percent one percent overall coming through So we think this number is going to double once we get full penetration About 25 to 30 percent of these children are diagnosed And this here there's a component where we're rolling in bits of deep learning into The variant calling But also the interpretation of the variants for example splice sites or protein structure from deep learning techniques And that's where the computational techniques will make a gain Now let me just tell you about the outcomes So there's quite a good publication here in the new england journal of medicine So if a child is diagnosed the diagnosis post diagnosis on average They have 50 percent less visits to the hospital over the remainder of the follow-up time in that study So it's very clear that the clinical practice For those children have improved 25 percent of the time there's an immediate change to How those children are being treated? And then five percent of the time there are these transformative changes really transformative changes to the child's life And also this has a big impact on the families so families Who have where they have a children with a suspected rare genetic disease and then they get a diagnosis Have a much Better informed choices if they want to have future children. They want to have future children. They can make better choices about this And the second thing is that the family conditions have closure or end this diagnostic odyssey And that is actually a non trivial thing about removing the worry and concerns from this And just to say that many of these people also Consent for their data to be used for ongoing research. I'm not everybody but many many people that's now more like 90 percent Now the interesting thing here is every time we roll in a new enhancement in the interpretation of the mutations We get more diagnoses and more children end up on the right hand side of this picture And just to give you some of these profound changes in outcome So the most amazing one is this rpe 65 loss of function Which is a gene in the retina in the retinal pigment pigment epithelia If you have a loss of function in this gene And the children the child nearly always goes blind And there's a gene therapy that will fix this And the children don't get blind Now of course that therapy only works if you have an accurate diagnosis that the gene that's causing the blindness is this gene Doesn't this will not cure any blindness or only cure the blindness caused by this particular gene Now that's about 30 children a year The next gene coming through this process, which is rpg are It's about 500 children a year in the uk And I hope that many more of these things will come through as special with crisper The second example is this Deficiency in this gene. This is a type of severe immune deficiency But actually it wasn't severe enough For the clinicians to really recognize it at birth And so this child was coming back and forth into hospital with infections Scheduled for genome sequencing. They said, oh my gosh, you've got a defective immune system And so scheduled the child for a bone marrow transplant And another example is the maturity onset diagnosis of diabetes for the young And this looks like diabetes while it is diabetes But it's not one of the classic type one or type two diabetes And you have a completely different treatment for that type of diabetes With a completely different set of drugs So thank you very much for listening I hope I've Helped you think about AI genomics and imaging and I'm very very happy to take questions Okay, well, thanks Ewan. I know there's over 300 people listening by zoom So I suspect there might be questions coming in. There's a bunch of questions But people should also come to a microphone I'll kick the first one off you and some of the the last data that you were talking about from genomics england This was all the the intervention to get genome sequencing was when the patient was symptomatic They were of concern. Yes, it was of concern. And was there any modeling as to trade-offs in terms of getting it just outright? Yes, sir You're You're Forstorting the genomics england's project precisely to do a really good study on this which is to sequence 100,000 newborns completely flat ascertainment and just to understand sort of to what extent could we Augment the very successful metabolic screening by blood spots If we did genomic sequencing how many more Very clear-cut cases would we get? Where you would be confident of doing an intervention for an apparently not sick child and that started That's in the planning in the planning. So not launched yet 100,000. It's agreed. Okay. Okay. Um, if you know what I mean Yeah, it's not recruiting Oh Go ahead. It's on After such a forward-looking talk. I hate to ask an in the weeds kind of question But when you were talking about alpha fold and saying well, let's just apply it to everything I mean there you have this interesting opportunity that the model knows kind of what it's wrong What was the rate on that like our have we have we surveyed like with crystallography most of the structures Or is there a big chunk that's missing? Is it kind of direct? So there's there's some good papers on this by my colleague Christina Rango and Ex-embolio ebi colleague Pedro Beltrero There's a bit of a kind of argument about whether new novel folds have come out and whether alpha fold can see novel folds All hinges on this definition of the word novel fold. It's a bit boring Two things though, which is really clear It can definitely predict structures where you're like, well, maybe it's an alpha beta beta whatever they come up with But it's very different. So there's definitely that and it's also really quite good at doing membrane Bound it doesn't seem to have a problem with membrane brand structures So these things are always alpha heat nearly always alpha helical bundles So in some sense, it's not a fold But for the membrane people, it's like this is amazing. I'm so excited Numbering people never get a break. So sorry remembering people never get a break. So they did yeah So this is I you know, I've met quite a few like this has changed my life conversations with people in that space What is interesting about this sort of aware this calibration error model? And there's a big debate we had smear with the people in deep mind in particular about whether Deep mind alpha fold should show The the low confidence regions now if you look at alpha fold model, these are the bits that look like spaghetti And they're red And there's a big thing no let's turn it back to only the things that alpha fold knows is good because that looks kind of pretty and You know knows it's good It's ended up being and I'm very glad that samir said no, it's really important We show the whole thing and we talk about it and stuff like that It looks like a lot of those are intrinsically disordered regions And so they really do have no structure at least in isolation And I think that's stimulating a lot more thinking about these intrinsically disordered regions So it's quite an interesting question about whether you even call that a misprediction by alpha fold or not In some sense, it's not a globular fold alpha folds kind of doing the right thing to draw a spaghetti Even though it looks somewhat counterintuitive. So there's quite an interesting story behind all of that Behind you and then we'll go over here right here first Thank you very much for your talk. It was very interesting. Um, my question is about the More a broader question about the interpretability of AI methods and kind of I guess first things like alpha fold This might be more straightforward But for some of the other like diagnostic applications you described for like, you know, how important is that? Yeah, so for starters, it's not straightforward and alpha fold. It's actually quite Still a bit mysterious about why alpha fold works just like alpha go Was making moves and go that people are like, that's a crazy move and then they're like, wait a second That move was really important. How did they know that was such a good move? And so there's a similar thing with alpha fold. We don't really understand it In terms of interpretability of a lot of these models I have grown much more comfortable with it where safely take the splice site model You build the splice site model you train it You convince yourself through a variety of methods that it really is generalizable and then you study the model So you start deleting bits of the model or changing some of the inputs or seeing how it behaves in different Scenarios in the computer. That makes sense And so so you stop you stop asking the model itself to be obviously interpretable But you think of the model as almost like a synthetic data generator Yeah, as a as a thing to study an artifact a kind of proper artifact in itself And I think that solves a lot of problems I will say For I mean, I know this is obvious. I haven't stressed it But if we when we launch any type of machine learning, but in particular deep learning on On data that comes from observational humans We get all the biases and all the complications that we have in society is represented in the data So all the stratification Or the or the race ethnicity or the social structuring or the language stuff All of that comes through and these techniques are very good at picking up on stuff So we have to be very very careful About how we use these models Because we must remember that Very often we will Replicate the world that we live in now Not the world we want to live in And so we've got to be super super careful about that when we're talking about human Observational human stuff that makes sense. Thank you Hi, great talk. So my question sort of relates to the fact that you've talked about sort of technological advances And also the way that the computational part comes in So my question is will we ever get to a point where you could get a blood sample from one individual Put it on one machine and get the sort of full biological spread Of dna rna proteins everything or Will it be that you get one thing and the computer will be able to tell you all the biological parts With some degree of certainty. So I'm worried that I will Um, if I said yes to that I'll get sued in a theranus style Style way and I I definitely don't want to go down I I mean, I don't think that's going to be I you know, it's going to be a messy middle. Okay, we'll get multiple measurements over our life course I definitely think we'll all have our genomes or the future Population will have genome sequenced at birth will be one thing But we'll probably check in have a bit of like Bit of metabolomics a little bit of immunogenetics or whatever sub key moments in our check-ups And then there'll be some kind of massive learning machine That's kind of crunching the numbers being very careful about and I'm trying to replicate The world we live in now, but rather the future world that we want people to live in so that's going to be an interesting problem So, oh my gosh, there's a lot of moving parts to that. Let me remind you about the data engineers I mean, you think that this problem is a statistical problem But the problem before the statistical problem is just it's all the data straight in the right place at the right time I'm I pulling my hair out together But still it's a future we can imagine I think quite well and it's really interesting seeing this starting to happen Baby steps starting to happen be it in Denmark or the uk being two places that I that I know Thanks, that was a fantastic talk you talked a lot about Um How these over many years of hard work the data was ingested by various organizations that are using it to produce the unbelievable stuff like alpha fold And you know, you've talked to some of the organizations that are producing and the code that does this and it's incredibly It's amazing that it's available, you know, whether it's tensor flow or pie torch or alpha fold So I guess what is your advice to all of us and about making sure that things flow both ways in terms of You know data going in and especially code coming out so that can be available to everybody out of these private organizations I mean, I think, you know, although there's a lot of people who can kind of complain about the commercial companies realistically, I think they've played a blinder to use the british phrase um be it deep mind or Facebook or all these other places So I think we have to realize that You know molecular biology and and and this and the Biological life sciences has always had this close relationship with pharmaceuticals We kind of understand how that interface works. Some people think interchange with that We are going to have the same relationship. We do have that relation for the tech companies. And of course, it's different Different career structures. They pay more. That's one headache massive headache But but you know the the way we have this kind of You know, we're in it together, but of course they're commercial with pharma We've got to have that same kind of positioning with tech Be very confident about that open academic side of us Um, but but I think we've really opened to when these companies be at pharma be at tech want to play with us because we want to play with them Adam you want to ask a two or three questions from okay? Yeah, there's a lot and 11 or so that came online So I'll give the the first askers the priority The first one is with regard to the cox proportional hazard models And whether those variables are being tracked across time to identify individual transitions between higher and lower Outcome states and so that was just kind of a temporal question. Yeah, I mean You're going to push my the question is going to push my understanding of cox proportional hazards to its limit But you know, I think the the default way of thinking about this is just that time is your Is one of your just axes and you're looking for the the shift in risk And so that doesn't come out of the vanilla model the vanilla model is just What is your probability of having an event on it on on on this time axis? Yeah relative to each other So So the the vanilla model doesn't have have kind of transitions in it I think there are un-banana models, but we haven't deployed them Um next question from Laura Gorell How do we enable the scientific community to use technologies when they might not have the technological background to write code workflows? Kind of to your growth. Yeah. Yeah. So we've got a great skills I think the first thing to say is you I've always said this you'll be amazed how just a little bit of unix a little bit of python and a little bit of r will get you and I think we For for a variety of reasons we make the computer scarier than the You know chemical herd or the the tissue culture when when really it's it's quite a learnable skill So the first thing for for the colleagues are like, oh, I can't do the computers It's just you know, just spend a little bit of time, you know, you don't have to go very far It's like the one little baby step outside of excel and then you're like the world is open to you in a completely new way Um So so do do that little conversion course and then yeah, we got a You know train and and allow people to move skills, I think You know we got a As well as retraining people I think one has to really focus or at least have a similar amount of focus on the future pipeline of people coming through And then I'll I'll choose this one because it's it's pertinent with regard to chat gpt, which has been on the Yeah, the twerisphere a lot. I love my large language models And the question which I'll summarize is that some of these large language models can make very confident garbage As an output and so the summary of this question is do you have any concern that a i generation of data such as alpha fold models Have the potential to poison the source data for future studies by having correct looking but untrue Yeah, so the first thing about it is I don't know about you, but I'm Unimpressed by chat gpt Um, it's not I am impressed about some of the things that you know It can compose poems and lemur x and all of that the creative side of does well Yeah, but when I ask it a scientific question boy, does it get it wrong sometimes? And um, I think that says something important about the difference where the output is Do something creative that a human can do There's I mean, it's a very different task to Please correctly summarize the medical records of this individual Yeah, I mean, they're very very different tasks and we should not kid ourselves that they're the same task And what the questioner just mentioned that it is striking that an important part of alpha fold that it came with an error model And that error model was well calibrated and I do think that's one of the criteria we should have For what you know, what is a good model? Not only do you make a good prediction But do you do you have a well calibrated and an error model of your prediction? And then the the dame poison stuff I kind of it's obvious which is just keep track of why you've recorded things Um, it's again goes to this data engineering. So this is in the parlance of databases evidence codes So it's totally fine to fuse prediction and experiments side by side, but you must keep track of it You must keep track of what is prediction and what is what is What is experiment and why and stuff like that? And that's a really good example of quite a subtle piece of data engineering You don't get that right from the start five years later You're hosed because you can't untangle it afterwards Carolyn I'm gonna give carolin the last question if that's okay. Yep So One of the things, you know that I spend a lot of time thinking about is when are things sort of still In a developmental phase and when are they ready to be sort of deployed at the size and scale that you're Talking about with a lot of these examples you're giving and I guess is there a way Do you think there's a way that the like machine learning can help inform that decision? Or do you think that that's gonna lead to bad decisions? I trust you over machine So I I don't know that that's a well formulated A kind of well formulated problem. Yeah, this goes to an interesting thing One of the reasons why DeepMind chose the protein folding problem. Not only was there a good data set Not only was a well-defined problem There's a very good competition very well run competition cast by john mold funded by nsf And it was the presence of that competition That was so so you have a score to compare it to yeah to and so they So they knew when they were doing better And stuff like that and that goes to these these these problems which you think Ah, surely our computer can integrate all of this and you're like well, can I define the problem? If I can't even define the problem. I can't get a scoring metric No, yeah, not gonna happen and I feel that the problem there is the score is as As much as the problem's interesting. I don't think there's a very well-defined scoring function For thanks. Okay. I think we're going to cut off there We have got to keep you in on a very busy schedule So on behalf of the dozens of people in the room the hundreds of people that watch by zoom Thank you and what a great start for a seminar in 2023 much appreciated