 Good morning or good afternoon or good evening to everybody. It's my pleasure to welcome you to the NHGRI machine learning in genomics workshop My name is Mark Craven I'm on the faculty at the University of Wisconsin and I serve on the NHGRI data science working group and the NHGRI Council and It is my honor to be co-chair of this workshop along with Trey Idyker Although really it's the NHGRI staff who've done all the hard work in organizing the workshop So we have a truly stellar lineup of speakers and session moderators Who are working at the forefront of machine learning in genomics and in just a minute We'll get to the program but before we do that. I'd like to introduce and turn the floor over to my co-chair Trey Idyker Hi, so as Mark said, I'm Trey Idyker. I am with Mark the co-chair of this meeting I am a professor in the Division of Genetics in the Department of Medicine at UC San Diego You'll be hearing a lot more from both Mark and me as the meeting goes on in the various sessions And then in the session wrap-ups at the end of every day So I think with with that and without further ado It's my great honor to introduce our Institute director Director of the National Human Genome Research Institute Dr. Eric Green who will give some welcoming remarks Well, thank you Mark and Trey for your introductions and actually for your Remarkably valuable leadership and help in putting this workshop together And I want to welcome everyone to day one of NHGRI's machine learning in genomics workshop Now for those of you who may be a little less familiar with NHGRI The Institute is one of the 27 institutes and centers that makes up the US National Institutes of Health Of course the NIH is the world's largest funder of biomedical research now as an Institute we Every once in a while put together and publish strategic visions That really helped guide the field of genomics and our latest was published late last year in 2020 We released our new strategic vision, which we think Really details the most compelling areas to pursue in human genomics in the coming decade Now the strategic vision is organized into four major areas guiding principles and values for human genomics sustaining and improving a robust foundation for genomics Breaking down barriers that impede progress in genomics and and finally compelling genomics research projects in biomedicine Now needless to say artificial intelligence and machine learning came up multiple times During our strategic planning process that really are identified in the strategic vision as areas that present tremendous opportunities for genomics and by prioritizing novel statistical methods and considering aspects of machine learning that could be Complimentary to more traditional analyses of genomic data. It seems very likely that machine learning will become integral To the next breakthroughs that we expect in the coming decade in genomics In another area focused though It's important to understand that we need to understand how machine learning is going to be used And how it sort of fits into the context of important ethical legal and social implications of human genomics And that will be another really important Area to consider as we think about the productive use of genomics and the implementation of genomic medicine If you want to read more about our strategic vision, it can be found on NHGRI's website genome.gov But it wasn't just NHGRI that does strategic planning in this area at the NIH level There's been increasing interest about artificial intelligence and machine learning and back in march of 2019 the artificial intelligence working group That was established by the NIH director Francis Collins as part of his advisory committee to the director They finalized their report which detailed the potential role of artificial intelligence and machine learning in the future of biomedical research And the report identified opportunities challenges and outcomes of implementing artificial intelligence and machine learning approaches Across all the NIH institutes and centers in all of the areas of biomedicine that NIH is interested in Now the working group's final report concluded that the computational and biomedical communities are poised To jointly drive transformative progress in biomedical research Leading to new insights into how all living systems work and in care delivery Leading to improvements in the health of all humans and all communities And a specific recommendation of the working group worth highlighting is recommendation number eight The NIH should continue and expand support for engagement with wider artificial intelligence and machine learning communities This approach was piloted at the neuro IPS in december of 2019 and should be expanded to other conferences And other opportunities for convening experts from different fields And thus this makes a lot of sense relative to this workshop And so at this week's workshop We hope to hear from different members of the artificial intelligence and machine learning fields To get a more complete picture of the opportunities and challenges that will guide Future NHGRI directions in our area of interest of course that being genomics Now related to the recommendations of the artificial intelligence working group of the NIH director's advisory committee The NIH recently issued important Funding opportunity announcements for something known as bridge to artificial intelligence or bridge to ai Program and the FOAs are listed here. And if you want to read more about this I would send you specifically to this website That's listed here on this slide, which i'm sure you could also find by googling So for this workshop, what is our intention? So in light of all these developments both at the NHGRI level at the NIH level We have decided to spearhead efforts to push the boundaries of machine learning in genomics Now we understand the value of engaging the community for this as we always do Which is the reason behind convening this workshop And in fact the initial feedback from the community was a guiding factor in shaping how we put this workshop together Now the major goals of the workshop are to stimulate discussion Around the opportunities and obstacles underlying the application of machine learning methods to basic genomics and also genomic medicine Also to define the key areas in genomics that would benefit from machine learning analyses And also to identify and shape NHGRI's unique role at the convergence of genomic and machine learning research The four sessions of the workshop will reflect broad topics of the machine learning field with ample time After presentations for questions and comments and all the feedback from the question answer sessions and post meeting survey Will be recorded and available for review subsequently Now works likes like this do not happen with a lot of work from a lot of people So I want to thank starting with the co-chairs Mark and trade but also the other members of the genomic data science working group of the national advisory council for human genome research For their support and planning and also moderating this workshop for those unfamiliar with this working group It was created as a subcommittee of NHGRI's advisory council to provide ongoing guidance related to data science As it relates to genomics and NHGRI programs I would of course also like to thank the NHGRI organizing committee As well as two very critical components of the institute our information technology branch and our communications and public Liaison branch for their hard work and all of the it Communication and other logistical aspects of putting a workshop like this together And finally, of course, I want to thank the speakers Listed here for presenting their research throughout the workshop and making themselves available for questions and feedback from the audience This could not come together without the generosity of your expertise and knowledge and willingness to join us throughout the next two days And lastly, of course, I want to thank all of you for joining us today and tomorrow the interest in this topic Turned out to be well probably the best way to describe it would be huge I can tell you that as of friday there were over 3,400 people from 73 countries Who registered for this event Overall this workshop then promises to be highly influential in guiding the field of machine learning and in genomics And I really do look forward to the next two days As we see this unfold And so with that in mind i'm going to hand The the session and the screen over to dr. Shannon McQueenie who's going to start the keynote session Thank you very much for joining us and I look forward to spending time with you over the next two days Hi, I'm Shannon McQueenie. I am a member of the genomic data science working group. I'm also a professor and division head Uh at oregon health and science university in portland, oregon And it is my honor to be the moderator today for the keynote session Starting off we have eric topel He is the director and founder of the scripts research translational institute and the gary and mary west endowed chair of innovative medicine at script research Well, thanks for having me in this machine learning and genomics workshop I'm pleased to join and discuss how genomics is really moving forward with ai First to point out of course that as you know, we've hit the 20 year mark for human genome sequencing And what that's done this incredible milestone has led to extraordinary amounts of data And this is really interesting because we've gone past yata bites And the question is how many Data sets that are going to start to exceed levels that we had never anticipated with genomics being a part of that So it's been Possibly that we should start Renaming a new unit of helovibytes to exceed the level of data that we have today So how are we going to deal with all this massive data? Some of which of course is through not just dna sequences, but all the different layers of biology And that makes us turn to neural networks And so the neural networks these deep neural networks Although they've been compared with the brain. They really are not all that similar. They're basically using Layers really hidden layers of artificial neurons to process the data of inputs Which can be of massive scale and to come out with a proportional level of layers to get outputs of interpretation And genomics is just one of the many areas in life scientists going through an ai revolution And that's of course the focus of the workshop But also to give a point. This is very early in time point It's only in 2015 Not even six years ago when we first started to see these convolutional networks for genomics like deep sea and deep bind And already now we've seen just uh in recent days the emergence of a dedicated Institute at the brode institute the schmidt center Which has a bunch of partners as shown here and it's basically taking into account that this has now Been a field that's developed where the data is massive And we need ai tools to process it and extract all the valuable information and knowledge So first let me point out that most of ai in medicine So far has been with images. That is like this chest x-ray a scan And that's a whole lot easier for ai interpretation than a dna sequence The reason being is that the pixels in a scan have only limited interaction With their neighboring pixels Whereas in a genome there is 3d relationships that can be long range And so the ability to interpret a genome accurately is a far more challenging task But we have seen imaging being used for ai in genomics and a great example that is uh this using children This is the ability of deep gestalt a deep learning algorithm developed in israel Where by a smartphone picture Which has been trained in hundreds of thousands of images to accurately Make the diagnosis at least tentative diagnosis of the chromosomal genetic abnormality it has over 90 accuracy and the number of syndromes that can yield accurate diagnosis keeps increasing So we have seen some image interpretation used in genetics and genomics But what we're really talking about today is the actual data Myologic data whether it be dna or rna or methylation or 3d Transcription binding sites all these things and applied in specific instances like tumor and or cancer Being able to distinguish the regulatory genome The functional genomics have been approaches of deep learning in genomics today And there are generally four different classes of neural networks The the most straightforward one is just classification a fully connected or feed forward Most of what we're talking about is convolutional for dna sequence, but some using time Are recurrent neural networks? And then there are graph convolutional So these different neural networks that are using genomics are example here This is to determine a single transcription binding factor But it can be multitasked for two transcription factors And then it can be integrated with dna sequence and chromatin Accessibility and this is a really good review article that even though it's almost two years old In nature of your genetics. I certainly would recommend it to you And in that same review, uh, there's no the idea of Haplotypes and this extends that this is just the work done in just a couple of weeks ago whereby A haplotype for variant calling was using Bayesian model This is so-called octopus Getting much higher sensitivity and specificity than prior variant callers And this is one of the big issues when we have genomic data is being able to call What is a variant and what isn't? And uh, this is the output of octopus compared to its predecessors like deep variant or g atk And others and consistently Octopus outperforms. So that's just to show you how we continue to eke out improvements iterative improvements with these different deep learning algorithmic efforts Now another part of dna sequencing is the ability To predict these variants and here you see amber and how it predicts Better than the original one deep sea and deep bind And the herald ability enrichment. So this again is another recent paper again exemplifying this staged improvement enabled to predict variants There's also many other aspects that we see Uh Continued jumps in the use of deep learning. Here is an example of differential gene expression And uh, this was seen across all the different tissues in the body But we also have seen it for predicting Transcription factor binding and this is agent bind a lot of these algorithms have deep as their first part of their Name, but here we're starting to see a little more creative names like agent bind And this is for splice prediction from the sequence through a deep learning algorithm published A couple years back in in cell There's also the ability to distinguish rare variants for undiagnosed diseases that was published last year in genetics in medicine Now one of the areas where genomics is becoming more Used in the medical space is in cancer and understanding Deconstructing the genomics of a person's cancer And what's striking here Is more reliance on the liquid biopsy that is uh, cell-free tumor DNA self plasma DNA that is from tumor Which now through deep learning algorithms can be analyzed to determine what is the primary Source of the cancer Which previously it was yes. No, there's cancer now the ability to take this approach to understand Its source and that's a step in the right direction and we're increasingly seeing liquid biopsy used in cancer This is a welcome Sign of progress Also the ability to pick up the mutations. This is an example in prostate cancer Here the standard method On one side and then on the other is the deep learning and picking up far more variants as you see compared In the deep learning. So our ability to extract the relevant relevant information Through deep learning and cancer has certainly been enhanced But also unexpectedly our potential ability to predict the evolution of the cancer So when you know a particular variants that are present in a a tumor sample You could actually predict When and what variants will be appearing over time and this is just the beginning of that work It's an exciting direction Unfortunately like so many other things that happen in the reports it gets somewhat Exaggerated and here is the uk coverage of that robot war on cancer AI predicts tumor growth. Well, that's of course a little bit of hyperbole now on the cover of the new york times magazine at the end of march was a SARS-CoV-2 Virus sequence which has about 30 000 bases And it's bringing it into the mainstream and we haven't paid enough attention to pathogen sequencing We've been thinking and talking so far about whole human genome sequencing But what's interesting is again just like that prediction of cancer where it's headed in a particular patient The idea that we could predict A strain of influenza or where is SARS-CoV-2 going by learning the language of the virus and this was published earlier this year in science For influenza being the rich data set we have and one that's developing right now, of course is for kovat Now another tool that we have Before getting into some examples of how they can be used Is transfer learning or meta learning that is taking in this new york article a very interesting piece that the use of a Deep learning a deep neural network for interpreting what type of pastry that ultimately was used in transfer learning to understand cancer And also we have seen this meta learning that is learning from learning being applied to genomics like single cell sequencing or adult Cancer genome atlas now to pediatric genomics And this is just an example of transfer learning in single cell RNA seek Which are big data sets that can be hard to interpret and putting them through a transfer learning approach Now the idea also of course is that we could analyze samples from Large numbers of individuals and integrate that in the immunome and here is from 11 million t cells From 40 patients controlling for things like batch effects and sample preps And this is yet another example of what you can do With a deep multitasking neural network And of course there's ability to analyze gene regulation And understand epigenetic regulation as was published just in march in nature computational science Now crisper obviously is the biggest breakthrough in life science of our era of our time And the idea that we could use ai to guide crisper Is now been firmed up through many different studies. This was from the microsoft group to that using ai to predict off-target effects of crisper guide RNAs And then is a series of articles that i've just highlighted here of ai crisper editing Here you see Whether it's in t cells or whether it's with Deep learning with network network based gene features for guiding RNA For gene therapy. So not only has ai helped with crisper Predicting off-site and predicting the right guides, but also in coming up with the best approach for gene therapy I just want to point out that sequencing is getting more complex Now we're seeing in-site to sequencing which is going to increasingly generate massive data sets again Why we need deep learning tools to interpret these data And just for uh those who are Intimidated by equations This is an article that talks about multi omics and this is a problem We have is integrating multiple layers of different omics like the genome sequence single cell RNA epigenome proteome and and these different layers and This is a complex task that is yet to be fully actualized Now one type of ai nearest neighbor analysis deserves mention because this is the idea that we could have digital twins from a massive resource infrastructure and This is something that is really coming into play now because of tempest which is a Company based in chicago that's largely been cancer dedicated. I'm an advisor for that company It has generated a massive amount of data in cancer now with 200 000 patients sequencing tumor And multi omic data along with electronic health record treatment outcomes pathology and Scans relevant scans were digitized and even more they don't have full data sets And the idea here is that we could do neighbor nearest neighbor analysis and find individuals At the time of one person being diagnosed Match as closely as possible to then be able to predict treatment and outcomes So this is instead of clinical trials, which often have only a small percent of patients That are relevant for the patient Presenting in question. Now we would have very precise potential matching. This has to be validated It has not yet been to date, but it's a rich resource and cancer is just the beginning of where this can go Now I want to go through two projects to close that are real world projects that exemplify some of the issues one at script research is our quest to take sparse data namely the data that is found in Erase such as aphometrics or axiom with four to five million variants and to then be able to predict the 80 million genetic variants And so this is a task that has not yet been accomplished that we've taken on and it's a very challenging one And Raquel Diaz in our group Is a KL2 scholar in our ctsa program along with Ali Turkimani has been leading this effort starting on 9p21, which is the region of the genome which is extensively Enhancer rich, so it's a very complex area Then currently moving on to chromosome 22 Which has about a thousand very complex regions and not yet and you'll see why to the level of the whole genome And the whole idea is using unsupervised learning. So rather than the types of ai like Hidden a marco model or bayesian or even just regular supervised learning with deep Neural networks the idea is to use an auto encoder which works through a bottleneck to get data out and Also gans or a generative adversarial networks that were really originated with in good fellow here being shown about Real or synthetic data from genomics Well, we have taken the approach of using an auto encoder, but not just any auto encoder What's called a denoising auto encoder for this imputation project at scrubs And so what that denoising? Accounts for is ability to deal with missing data. That's where the x's show on the input And everything is about the inputs if you're going to go through this auto encoder and impute the right Variants at a much level of granularity or depth So here is comparing what has been the accepted standard which uses a hidden marco model so called mini mac Compare with our denoising auto encoder and what you can see surprisingly It didn't do very well. It underperformed and in fact, it was something like 79 versus 21 for having the correlation So we said well, why is that happening? Why is our what should have been this great deep neural network? denoising auto encoder not working as well as a more primitive form of AI And what it turned out was all related to the inputs as you might have suspected And when we adjusted the inputs using 30 000 virtual babies this synthetic data with far better at mixture that is A ability to generate a much bigger data set with much more diversity Well, then we had a far this genomic data augmentation gave us the ability to simulate the mini mac And that is important because an auto encoder like this can move at Much faster speed much less computing time and it can be continued to be tweaked to be far better So this then gives us a double benefit not just to deal with the Imputation but also to then apply that to things like polygenic risk Prediction and multitask deep learning, which is the end goal of this initiative And you can see here the advantage of the auto encoder that we use versus mini mac The runtime and the accuracy while getting a equivalent accuracy how much The time reduced to process this data, so it's really an exciting advance because the more genomic data we have the more we can use these unsupervised learning tools of deep neural networks But to point out this is a graphic processing unit gpu guzzler We're using four models per gpu running 18 gpus for this project of chromosome 22 Each model takes three to five days to train And to extrapolate that for a whole genome. We'd need 500 000 gpus. So that's why we haven't gotten there yet We're just getting a chromosome 22 nailed down So the point being here is to get not for the end user once the validation is done But to get this imputation work accomplished it takes a lot of Computing power and this is what the a 100 current state of the rgpu from nvidia looks like And this is how much it costs these are the specs and this is over 13 000 dollars So putting a whole bunch of those in your cart would really add up very quickly, especially if you're trying to do a whole genome Now that's one project. The other one I wanted to highlight is with our partner Rady children's hospital the largest and really the children's hospital pediatric center of sandiagum county, which has three and a half million people Centralizing this one facility and steven kingsmore and his group working with us in our ctsa has developed the program in the world for rapid genomic interpretation acquisition and interpretation and management change for sick Neonates not just newborns, but also even in children So this typically is in a neonate and what's fascinating here is Of course every minute counts because of a diagnosis isn't made quickly. It could lead to Brain damage. It could lead to the death of the newborn So that's why time is is essential And now this group has been able to take from sample from a critically ill infant to management of that infant in 13 and a half hours and this is using multiple different ai tools and it to me is the best example of ai and medicine today Which is interesting because it upends the usual story of a starting in adults and only eventually getting into pediatric This is just the opposite. So the steps are number one taking the Structured and unstructured data from the medical record from the electronic health record that takes less than a minute actually 20 seconds And then it's the variant calling which is happening concurrently that takes A bunch of hours and the automated diagnosis And eventually the automated management and this all can be done in a time of frame of just over 13 hours So just to take you through the steps Rady is using cliny think Or clamp more recently to do this clinical natural language processing Which is working also at the unstructured level of data in the electronic health record to pick up all the relevant terms Then there's the automated interpretation And this is using two different tools fabric genomics gem and in v ties moon Which prioritized and ranks the variance and gives a score as it gets that score over one and a half It gets to you know very high likelihood of being an accurate diagnosis And then the third step is this genome to treatment management. This is a home baked Effort at rady that incorporates multiple different ai algorithms alexion, which has the data from all the literature as well as Rancho biosciences Taking the compilation of all the resources and then coming up with a management for the Neonatologist or pediatrician on how to manage the condition that's been diagnosed This is really extraordinary a whole bunch of different deep neural networks to get the answer to manage a sick neonate And that takes me to the end, which is We think in genomics too much Genome-centric we want to see as much integration with other layers of biologic data The electronic health record the environment as possible And so eventually we will get to this point with genomics being a fundamental layer of a multi-dimensional Effort of real-time processing of a person's data Be it for a virtual medical assistant or for a clinician caring for a patient We aren't there deep learning is not enough hybrid models will be required But eventually this is something that is an exciting frontier in the years ahead So with that, let me thank all my colleagues at our institute SRTI within Scripts research with whom I have the real pleasure to work with on a day-to-day basis and all the different funding support that we have from NIH From NHGRI and from all of us and I look forward very much for your questions and our interaction Thank you Thank you so much. That was fantastic our next just as a reminder We will be taking the questions at the end for both of our keynote speakers Our next keynote is Bradley Mellon who is the Accenture Professor of Biomedical Informatics Biostatistics and Computer Science as well as the Director of the Health Information Privacy Laboratory at Vanderbilt University Morning I'm Brad Mellon from Vanderbilt University in Vanderbilt University Medical Center And I'm speaking with you this morning about various challenges and opportunities for machine learning and genomics So let's jump in One of the first things that we need to talk about is that its process The entire learning process Is facilitated by the data that we collect this data comes from a variety of institutions individuals And organizations all around the world And once you collect all the data We're going to be pushing it through some type of a magic box And I don't think that everybody completely understands all of the frameworks that machine learning facilitates But at the end of the day What it's doing is learning over the information to try to generate a model And give you a relationship between all the different facets of individuals that have been fed into the system Whether it be about the genes Whether it be about the single nucleotide polymorphisms or just general variants However, the process is not one way Once you get a model You need to go back and figure out what's this actually a good model Did the system learn in a manner that provided me with intuition that supported the biology that we're aware of already? Or provided intuition that gave us new directions perceived with that we weren't aware of before But it still makes sense Now if it doesn't we go back and we check to see if the model is any better However, it may also be the situation where we don't have the right data And if you don't have the right data now, you need to go off and augment the system as it's been designed And what that means is trying to collect additional variables about individuals But also possibly about trying to collect new data about people that you haven't seen yet So there's four different topics that I want to cover today We're going to dig into them, but I'll leave a lot of time at the end so that we can have more of a discussion during the Q&A So the first topic that I want to discuss is that bigger is not always better, but it can be So I call this the safety and numbers problem First, I think everybody recognizes the process that I just explained a moment ago But when you have a single site that's actually performing machine learning, at the end of the day, they want to know Did what I see actually make sense of other institutions? So it makes sense for an institution to say, hey, what is everybody else seeing? And it might be that I don't have sufficient information in order to verify that what I saw was statistically significant So I want to go check and see what others have So there's a number of other organizations that are going to be collecting information And we realize that there's a lot of barriers to sharing this information But we have various networks that have been established to facilitate this process Now when you do this and you check to see if what you have replicates in other institutions Or with other people's data, then this is about robustness. This is about ensuring that Everything that you have is either replicable or generalizable to a certain degree Now this is not easy to accomplish. We've had several consortia that have pulled this off But we're moving in this direction But to really do this in a meaningful way We're not going to have to just replicate what we've come across We need to broaden the data And what this entails is typically we're dealing with the DNA of an individual And everybody's recognized that there are other resources that need to be brought to the table in order to facilitate deeper investigations Now this could include the socioeconomic status of the individuals To whom the data corresponds It could be information that comes out of their medical records And for the last 15 years We've been doing this within the electronic medical records and genomics network that NHGRI has been sponsored But it also might be about taking clinical trials that have been performed and expanding them with real-world evidence Regardless, this is about linking all of this together Now these are only a couple resources that I've shown you There are a number of others that could vary in the amount of information that we have and the comfort level that people have in providing them And these are the non-traditional pieces of information that we have And these are the non-traditional pieces of information everything that includes an individuals over the counter or retail purchases possibly at a pharmacy or at a grocery store What they're doing in a social environment, whether that be through social media or just general social interactions in terms of who they're Hanging out with or where people are receiving influence from It could be about how much energy they have for what they're doing on a daily basis in terms of moving their body around Actually come from fitness records such as an Apple watch or a Fitbit And then it could also be about lifestyle decisions things that we ask people or just generally observe such as how much alcohol do they consume? Do they smoke? general Now this is one little piece of the puzzle still because this is about what one person who might be a parent for instance Has actually done and has been documented But we don't just want to look at one person We want to look at the relationship between individuals And so we want to look at either mom baby pairs or father child pairs and not just a single child We want the entire genealogy or the pedigree And then track and integrate all this information to facilitate The type of investigation that is not just a one-off Now I think everybody recognizes that over the past several years it's become clear that the Going broader perspective is not really sufficient To provide society with the type of insight that we need when we're performing any type of learning with this type of data The notion of bias has crept in now. What exactly does this mean? um This is an example of information that uh, Steph DeVaney from the all of us program put the paper a couple years ago and what you can see here on the left is the number of people who fit a certain racial profile that has had their information used in gene association studies And what you can see is that it's on the order of close to 80% where it's Caucasian background European ancestry That have had their information used now it's A little better in terms of the bias when you move over towards what studies have been done So as you can see even though we have 80 approximately of white individuals in the resource Or resources that the total number of studies in which they've been used is only around 50% Now this doesn't necessarily mean that these studies are all equal This is just a count of the number of studies that have been performed and some of them are of a larger scale than others Regardless, it's still clear that there's A bias with respect to one information is being used and studied Now the implications for this have been non-trivial Because when you learn models Such as in this situation there were models that were learned for 17 qualitative trait loci When you make these types of when you make predictions based on this information It's pretty clear that the generalizability of the models Really falls to the wayside as you shift away from the population from which the model was based And so if the model was based on European individuals and the predictions are made on European individuals Then the variance associated with its prediction of capability remains Pretty small And so the point estimate that you had in the initial studies holds true However, as you start shifting towards other populations such as going towards Southeast Asians East Asians and African populations in particular or populations of African American heritage Uh, the the predictive capability drops to a point where it could be It could be as good as what you saw, but it could be almost 75% worse than what you expected it to be It's a it's a known phenomenon It's just that this happens in genomics At a scale that we we really haven't seen simply because of the number of variables that come into these models This was this was known for instance At least 30 years ago when the Framingham Park study data was used for facilitation of The design of heart attack risk or cardiovascular risk But we're now seeing this on a on an even larger scale So there has been some movement To try and change the situation And I'll highlight that in the all of us program, which which I have a relationship with This was a notion that was really taken to heart from the beginning of the study And it's a it's about really enhancing the resources that we have to not just include people who have typically been involved Within biomedical research, but broadening this towards underrepresented groups As of as of about a year ago With the quarter million people whose whose specimens and data that we had collected We had deliberately selected for populations that were not just white in nature And so over 22 20 percent of the population that that we had established Our collected information on was was of black or African American descent And around 17 percent were of Hispanic background Now this doesn't necessarily mean that that we have completely changed the problem or fixed the problem What what i'm showing is that you need to be deliberate About changing your perspective and creating data sets that are going to lead to greater equity in the research enterprise At the same time I said bigger is better and you probably were thinking about data But in reality It's the entire process and that process includes the people who are performing the investigations Because when you're going about learning models and just performing learning in general People make conscious decisions about what models need to be used as well as what is the pre-existing knowledge that we bring to the table And that knowledge is dependent upon the people who are actually performing the investigations So you need to broaden The population that are actually performing these investigations Now this has been recognized by NHGRI and NIH in general But simply broadening the set of investigators who are at the table is one thing You need to be able to broaden that population with a skill set That can actually do the types of analytics that we're talking about Otherwise you'll have scientists that are sitting at the table saying You know, we should do studies in x but they really wouldn't understand how exactly you do that study Okay, so this is this is the first challenge that I believe is going to Help drive where we go over the next five to ten years The second one I want to talk about is about cost effectiveness When we're performing machine learning One thing that I hear over and over again is that we're going to move to the cloud and we've been moving to the cloud For at least five to ten years And the cloud is Wonderful it provides dynamic compute and elastic compute capabilities, but at the same time There's a cost associated with this and there's and I'm not talking about hitting costs So I'm talking about explicit you put your money on the table type of costs So the process that I described below was we're earlier was really just a small Summary of what's going on with the scientific enterprise Where you take data forward you push it into your machine learning framework, whatever it may be But then in the cloud environment, you're going to pay for every analyst analysis that you're going to run So you pay a little bit Or you give a little bit of money to your graduate students and they pay a little bit and they run their study and Outcomes junk absolute junk. This is fully expected the first time around You either didn't tune your parameters correctly or the data wasn't loaded correctly Whatever it was the money you spent and junk has been generated. So what do you do? Next step You wipe that data You give your graduate student another hundred bucks and they run the study again and outcomes It's something that's a little better You know, it's still a little junky and you look at it and you say well, you know It's heading in the right direction, but let's rerun the study again And you do this over and over and over and this is a normal scientific process But unfortunately in the context of the cloud You're in a situation where you are constantly paying for compute And so after you've done this on the order of a thousand times and spend a million dollars Then you get to the point where you go, aha I think I've got an interesting paper This is not effective. This is not scalable. This is not supportable In order to make this better We need to create algorithms that Make sense of data with a smaller model or figure out how to more cost effectively run lots of models simultaneously So that we can sift through how we're actually generating the data How we're actually generating our findings to direct workflows in a manner that makes it Much less likely that we're going to spend our entire budget Simply to try and sift through how best to use the online cloud computing environments that have been established So the third point that I'd like to talk about today has to do with data sharing And making data more widely accessible Now before I do I'd like to begin with a cautionary tale And in this tale We are in an environment where People have begun sharing their Genomic information into the public setting and this is an artifact of the direct to consumer genomics revolution One of the ways in which people have decided to take genomics into their own hand to facilitate learning Is making their data accessible in websites where they get to discover relationships with other kin Now this sounds like a great idea it facilitates discovery, but at the same time it can lead to unintended consequences One of those consequences is typified by the way that law enforcement has begun to use these resources as well And the example that I offer to you is the case of joseph the angelo who is the golden state killer The long story short is that this was a serial killer who in the 1970s Committed a number of crimes in in california and when the case went cold, uh, they lost the ability to discover them However, the fbi did have dna from the same and at a future point in time actually several years ago They were able to go to sites like ged match Where they could put his genomic record into there and then discover who his relatives were And they were able to find his third or fourth cousin and then build a pedigree And figure out, you know, where is the individual that does not show up in the modern day society and eventually the family members led them back to This individual who was in state of california and once they took his dna They were able to figure out that this is the individual that they've been looking for This sounds like a great opportunity for law enforcement and yet it turns into a bit of a concern for the rest of society ged match You know, it's not just a single one off turns out that this type of a problem for Caucasians in the united states leads to Uh information of a third or fourth cousin on around somewhere between 50 to 80 percent of the population So if we have concerns with making information public Then one of the ways in which we might be able to solve this problem is to Make the information accessible to only the algorithms themselves and learn over data Behind closed doors and so one of the ways in which this can be done is through the notion of secure multi-party computation and in in smc What happens is that you encrypt all the information and then you compute over the information To generate aggregator results without ever revealing what any particular record corresponded to This sounds like a great idea except in order to make this really work in practice There's several things that we need to do first. We need to have software that facilitates the rapid Reconfiguration of any computable model so that we can involve it with respect to statistical techniques as they change Secondly, we need to be able to take advantage of the fact that some of these computations are best done in a software System some of them are best done on hardware However, it's probably going to be a combination of the two and figuring out how to optimize that is going to require further investigation third People are going to be using data But they're not necessarily going to be telling you what they're actually using it for And so this creates concerns about accountability Now one of the ways in which we might be able to address the accountability problem is through distributed ledgers I'm not necessarily a large advocate for blockchain technology But it might actually serve as the basis of what we're looking to support in the future And finally if we're going to create an environment where we allow people to compute over data that they can't see We need to make sure that they're comfortable with that A lot of scientists like to be able to scratch and sniff and make sure that the data is what they think it is But once you take that out of their hands, there's going to be lots of questions over how trustworthy is this so Another way in which information is going to be or there are expectations information can be shared Is that we move from sharing real data to providing synthetic data out into the world and instead of allowing people to Perform direct hypothesis tests. We allow them to perform hypothesis generation to determine if it's even worthy of moving forward with deeper analyses now we've we've been conducting research on the notion of synthetic data for for a couple of years now Where we really cut our teeth on using deep learning frameworks to simulate electronic medical records data And the the framework associated with this is is through what you call adversarial learning But the details of this i'll i'll leave to another time What I will illustrate is that it does have the ability to scale up towards high dimensional environments And what you're seeing here is is an illustration of the correlation of the rates at which diagnosis codes For individuals show up in two random samples of electronic medical records And what you're seeing here on the y-axis is the rate at which This information shows up in the synthetic data that we end up generating You can see visually that these systems are not exactly the same, but they're moving in the right direction um In terms of translating this into genomic data Fortunately just about a month ago There was research to illustrate that it appears to be a feasible application as well But we're really at the beginning of the innovation curve So the last point that I want to bring up has to do with the movement of machine learning from research into its application And the notion of decision support We're moving into a world of genomic medicine But in order to do that We're going to need to take into account the fact that there are hundreds to thousands of variables that are being used to create these technologies The food and drug administration has recognized that these types of machine learning or artificial intelligence driven technologies Are going to be useful and they've already provided approval for over 100 different technologies And this has really picked up over the last two years But if you're going to go in this direction Several things you can keep in mind as I already alluded to these are going to be large frameworks There's going to be thousands of variables and lots of things can go wrong We need that system to be verifiable We need to know that the right model was used at the right time and who used it And so in that respect they must be auditable We know that these systems are going to evolve over time. So we need to know which one was used when And they must be explainable We need to have the ability to tell people that this is the reason why this technology was used And building trust is going to require having the ability to relate what the technology is doing into some understanding of the real world And as I alluded to these are going to need to be equitable So several parting thoughts one we do need big data, but this data cannot just be deep It needs to be diverse Secondly, we need systems to be cost effective. We can't be spending millions of dollars just for computation Third, we need to have trust in this environment We need people to feel comfortable with just providing the learning or the results of the learning over the data instead of actually seeing the data itself And then finally, we need to have explainability We need to ensure that the way in which our system is making recommendations for action Are actually an association with how the world works So I thank n h g r i for their invitation I thank you for listening to this presentation And I thank the all of us program and the electronic medical records and genomics network for providing proving ground for some of these technologies as we begin to move forward So we're going to begin the q&a You have the ability to ask Questions of either of the keynote or for both of the keynote speakers using a q&a polling session Only the zoom participants can do this. And we already have a couple of questions that have come in The first one is for dr. Topol. This was about the Really beautiful example you showed at the end the collaboration between scripts and raidi children's And the question was how can that be distributed to other cities or centers? The person who's asking this had Child who was offered this at the nicu in another hospital But the experience was that they'd have to wait there would have been a long delay and they had to make a decision Right away And so getting samples from spokane to raidi would have been a bottleneck in that case Are there any thoughts or ideas on solving the challenge of getting This distributed to other centers and then what are the challenges in broadly distributing a pipeline like this one? Well, thanks, shannon. It's great to be with you and brad in this session I do think the example with raidi is perhaps the most advanced use of ai and medicine today And it upends the model where everything starts with adults and moves into kids. This is just the opposite Now what steaming kingsmore in the group is doing? Is trying to spread that throughout the country? so there are now multi multiple sites that are basically Are using the same tools the same kind of flow Of how to extract the data and get to the management side So the hope is in the short term All these refinements that you know, it's constantly being tweaked will be universally available You know, not just in the u.s. But broadly It's interesting because you know just a couple years ago. This was um more than a 24 36 hour story and it wasn't to management. It was just to make a diagnosis It's just getting so much better. But as the questioner brings up diffusion of this is equally as important We've seen enough validation that it really helps these sick babies and children Now we've got to get it to become the standard of care Fantastic. Thank you. Um, the next question is for uh, dr. Mellon This is in regards to cloud computing And one of the questions is around trust So this is coming from the perspective of a medical institution being able to to trust a private site Um in in terms of thinking about things like data leaks that have happened and continue to happen Are there other alternatives or are there other ways that we could think to mitigate this? I think it's a fair question. Um, one of the things I always encourage people to think about when they're Trying to decide whether or not to move operations into the cloud is or is it more secure than what you're currently doing? You know in many ways what we're currently doing by managing our own servers locally Is not necessarily any more secure than the actually moving data out into the cloud Where you have teams that are constantly dedicated towards applying the the most up-to-date patches and the best security practices Um, does that mean that breaches won't happen? No, I still think that it comes down to best practice with respect to you know, encrypting data when it's at risk Um, making sure that the access and authentication protocols are correct Um, but you know, there's also questions about you know, this question about trust is that has to do with Agreements more than anything else in terms of what the cloud service provider is allowed to do with the information That's been uploaded into their resource Um, you know most of the time people are using aws for instance. It's it's local private. It's virtual private machines So amazon doesn't really have access to the information itself. It's really you just using their platform as a service Um, and so in that respect, there's there's a lot of control that that you end up having Um, but as I said, you know, no system is completely secure. It's really just a matter of where you think it's going to be best allocated in terms of your Your energy in terms of management Great. Thank you so much. Um, this question really could be to either one of you. Um, the questioner asked What do you do the role of medical geneticists and so physicians trained in residency for medical genetics and the continued adoption of machine learning applications and genomics? I'll take a shot at this maybe and i'm interested to get brad's perspective too One of the things that people think about ai is that it's going to replace Uh expertise human expertise and that couldn't be further from the truth And that's extends to you know, whether it's radiologists pathologists and medical geneticists the human in the loop thing is so important That's why you know, the whole idea that we were just talking about with respect to the sick children and neonates You got to have an expert overseeing this because there are glitches They're they're you know, these algorithms no matter how good they get are always going to have imperfections and human judgment Especially by people with expertise is so critical. We're talking about you know, not just important but potentially life or death decisions here So we we need medical geneticists We need a lot more of them actually and what we're talking about now is just having this leaning on machines to deal with the data And then the oversight, uh, which is the fusion of this which is the best possible scenario I think eric hit it right on the head to be honest with you, I mean The the notion that machine learning is going to be a substitute for human intuition, um It's not going to be the case all the time There are some places where I think you will see it with With a very clear application. I think I think image interpretation. For instance, like radiological services Where we're already seeing that systems can do as well if not better than humans But you do run into these problems where um What this does is is that it frees up opportunity it frees up time For humans actually reason about the more difficult challenges And so you're going to have computers that can do things 80 90 maybe even 95 percent of the time But you're still going to need humans that look at problems that are not actually being addressed by the machine learning or whatever type of artificial intelligence you've used So you don't stop training people in in specific areas If anything what you have to do is supplement their training to recognize that While the machines are going to provide some services to them It's not going to provide all services Um, and it's in many respects. It's an evolution of a field You know, this is it's not the case that that physicians are actually doing bench-based pathology work In order to facilitate their patients, right? This is this is something that They send things out to the lab the path doctors or the pathologists are doing what they're supposed to do And then the information comes back and it's up to the clinician at that time to decide whether or not they trust What exactly this that has been put in front of them and what to do next No matter what it's Usually what happened well, not no matter what but usually what will happen with machine learning is that you're not going to just get a single best response In a complex environment, you're going to get a general rank ordering of what might be going on And then there's going to be some further decision making that needs to take place and Usually not all the information will be at the table and so computer will tend to say you actually need to get more information Here's my information. I think you need to get But the human now has to go do that So the relationship it's it's it's I totally agree with human in the loop But it's it's going to be a symbiotic relationship that will continue to evolve over time I love that phrasing the symbiotic relationship. Um, that's perfect Next question is for dr. Topol. Um, so you gave some beautiful examples of machine learning and AI in genomics in Um, uh cancer and obviously in pediatric diseases or rare diseases There was a question from the audience about where you see the initial implementation of machine learning for genomics In cardiovascular disease and obviously this could be examples that you've already seen or areas that you think are ripe for this Yeah, I know it's an interesting question Shannon because Um, the natural affinity of using genomics has been much more in Cancer rare diseases not so much in other parts of whether it's neurodegenerative or cardiovascular um, I think eventually we'll get there the problem is that You know, that's where the multi omics comes into play We we you know these these tissue specific signatures Whether it's through RNA seek and the epigenomics and all the others That's what we need. You know in order to get a handle on Um, the the heart the vascular system and the brain for example whereas What's so easy with cancer now that we're kind of moving forward in the liquid biopsy space Is we've got we've got a very easy access whether it's for the tumor sample for a biopsy or through a blood tube of blood Um, and then with rare disease, you know, we're basically looking at a whole genome sequence to understand that individual's clinical condition So, you know, there there certainly is a wealth of knowledge about cardiovascular genomics And you know that the problem is though We aren't nearly as queued in I think over time Particularly in our challenge of this multi omics multi-dimensional layers of data. We'll get there Okay answer. Thank you. Um, the next question is for dr. Mellon Actually, there's a couple related to this so you might both want to comment this one is directed to you first They're asking about the differences between interpretability and explainability and machine learning And which one if there is a way to decide that is more important For implementation in the clinical setting and I will comment that there are several questions related to this thread Yeah, that's a hard question. Um, okay, let's let's take the semantics apart I mean interpretability means that you actually have an understanding Of what the machine learning model is doing So the deeper a network, for instance, the more difficult it becomes to determine what the actual function is that's been computed to make the decision That's different. So so interpretability just means can you actually peek inside the box and figure out what is going on? explainability is a different situation explainability means that You have the ability to reason for the human you can provide intuition into why the system is making the decision that it is doing um The most important. I mean, they're both equally important Um, explainability the reason why they're not necessarily mutually exclusive is that explainability means that The person who's making a decision at the end of the day Has some intuition Into why the system made the recommendation that it did. That's not the same as how exactly the function works Right, so you can for instance give me a model that says that it's going to give greater precedence to Imagine you're making a diagnostic Workup and and you are reliant upon both genomic information and synomic information And the system might come back and say um, I'm making this particular decision or this recommendation Because this region of the genome has the greatest influence on the model that was that was learned That's not the same as saying Why specifically or how specifically is that factor influencing the model? It's just known that if you change that variable It's going to have some type of an influence so now so the explanation could be Rely on this piece of information But interpretability is so what happens if you know, you don't have the simplest of rule sets Um, did the system create something that was an extremely complex representation? That is not really representative of the way the world works You know, it ends up violating Occam's razor for instance, you know, where you end up with with Imagine like a 10 layer deep neural network to explain some functionality when reality a one layer even a two layer would suffice but the computer over trained and kept Jumping in layers or you dumped in more layers to try to get like a little bit more accuracy out of the system So, you know, if you ever want to tweak what the model is actually doing, we're going to have to have interpretability We're going to have to have the ability to allow what the computer has learned To merge with what a human knows should actually be going on or have an idea of where the system is going wrong So allow it to be tweaked But you're going to need explainability in order to get people comfortable with the use of the technology Fantastic. Um, anything you'd like to add dr. Tobel? well, I think the The simpler version would be we'd like to have everything explainable And what we're starting to see not so much in genomics, but in other areas is that using the ai to deconstruct neural network So to understand what other features that it's seeing That the humans are missing or can't can't grasp So we're just seeing the beginning of that and what's interesting about ai is for ai scientists The answer for every ai problem is an ai solution Now what the problem we have right now is that um, it isn't clear that we're going to be able to deconstruct and provide Explainability for everything that ai can do but that is what we would like to see wouldn't that be nice? And there's some hints that maybe we'll get there. So um, you know, I think it's interesting to follow this space It's basically kind of reverse Engineering the neural network to go backwards and find out. What is it? And you know, there's some pretty good examples on the medical side not so much yet in genomics Great. Thank you There's a question that's kind of near and dear to my heart because at the center of it is around How do we nurture innovation and something that was brought up? definitely Dr. Tobel even put a price tag on it in one of your slides was how expensive some of these learning models can be to train and so the Audience member is asking does this pose a risk to innovation and health care? Especially if only large companies can afford to train them and this individual actually is a cso of a startup who's Worried about the cost of these models. So would appreciate your thoughts about how we can create a situation in which ai Can be used by everyone for innovation and not just the biggest players in the world Yeah, this is really a central point, uh, you know brad touched on this, you know edge versus cloud computing You know, we wouldn't be in a position to use ai and genomics is one for graphic processing units So these gpu's basically set the whole deep learning space into high gear The problem is they're they're expensive and if you go ahead and buy the hardware And then, you know a few months later, there's a new version And so then you if you rely on the cloud that that's expensive and you know, it's basically my analogy is You're you're you're renting your house instead of buying the house, you know, which is the right investment And it's a mess. It's a mess because you know, the the big chip manufacturers want to keep, you know coming up with better hardware and the cloud and the cloud is expensive And it doesn't all that money you put into that doesn't really get you any ownership So, um, this is a problem. We got we got guzzlers of gpu's as I mentioned Now, how do we get around this? Well, if we keep getting smarter in terms of coming up with like the example I gave with the denoising auto encoder. We got to get less computing time. You know, it's reminiscent of what google did where they use GPUs to reduce the energy consumption throughout their server farms We need to do that in genomics. We have to get smarter on computing time and resources So we aren't we can essentially democratize this because right now This is a kind of a rich person sport or science. It's very computational Consumptive and hopefully in the years ahead, you know, we will see the gpu's come down in price We'll see the cloud come more competitive. Well, you need we need both, of course But we also have to be using our ingenuity to make this Cheap which it isn't by any means today No, it's a fantastic point. Um, dr. Mellon. Did you want to add anything to that one? Oh, I could add lots of things No, I mean, I mean Eric's right This there is uh, just you know, you have decreased one of the biggest problems we're running into is that we have had rapid decreasing in sequencing technology costs And at the same time, we've had rapid uptake in the ability to use computational resources to analyze the data Yeah, so we've gotten from something that was like a million dollars to generate a genome down to the point where we're going to like You know, we're heading towards like a hundred dollars And but but the costs are, you know, the analysis costs are on the order of like 10,000 to 100,000 depending on the size of the study What so I totally agree that we need to have more efficiency in the computation But I think what we're seeing some of the stuff that my group does and other groups do Is create hybrid structures of computation where you you you basically prototype on hardware that you've bought And so that's like a fixed cost And then once you're ready to do a large study, you then send it out to the cloud and run it once Right, but you you test test test test test It's really about understanding software engineering principles and how to build automated workflow pipelines And this is this is something that some groups have become quite adept at but it's something that we have to continue to train people on Um, I don't think it's quite as simple as just saying that you know The technology is going to get better. It's really just about being smarter about the engineering at some degree That's a great point One of the questions that came up and I think this is really for both of you is how do we meet the divide between the slow adoption by the FDA of and they give examples of either polygenic Or genomic based approaches and and I think what's really under this question I'll give you the algorithms right that are guiding These approaches as clinical tools beyond experimental or research purpose only And the ability for these to become standard health insurance sanctioned interventions Oh All right Well, this is one of this is one of my pet peeves Shannon is that we have this wealth of knowledge That has been developed in genomics and it sits in a separate orbit from helping Patients and people prevent diseases and conditions that they otherwise might be at higher risk So we've tried really hard to implement that At scripts whereby we're using polygenic risk scores particularly for heart disease to help guide whether someone should take a stat And what is their their overall risk for advising them? And we like to get that so that it's universally available And as other polygenic risk scores are fully validated That we get to implement those in practice for those people who want the information not to force it But you know for those who want it now most people don't have that And they're missing out on just you know years of data and you know just extraordinary amount of effort So, um, I really hope that we will that's actually fundamental to that project that I outlined that Raquel's leading in our group Which is so we can make polygenic risk scores more widely available very inexpensively across the board You know, there's so many now that are really been studied. Um, I do think it's part of the dream of preventing illness That we talked about for decades and we basically have never actualized But I actually think that now that we know so much about common variants that are Predictive cumulatively of risk that we eventually will be able to help people who otherwise would be at risk to To not have those conditions or at the very least Sharpen our ability to manage or use medications that you know, select which ones that might be of help So I'm keen to move forward on this but unfortunately When I've talked at some of the genomic conferences, I get confronted with a lot of the my colleagues and esteemed Genomicists who think it's not ready But you know, I've seen too many patients feel really helped by this. So I I hope we'll we'll move in that direction The only thing I want to add to that is is that um There's an economics problem and it's an unfortunate economics problem Which is that health insurers are slow to adopt new methodologies and new technologies for for reimbursement and so almost maybe 10 15 years ago Vanderbilt started doing um a prospective genome genotyping of patients who were at risk for cardiovascular disease and adverse events And and I don't take credit for this. I give Dan Broden and Josh Peterson and others a lot of credit for this But but you know, we we prospectively genotype them. So that in the event that they actually had some type of An in an acute event and then had to go on to warfare and or or something like like Simba sat and something that was a static control um We had an idea of how best to tailor the medication to them from the outset Now the prospective genotyping was not expensive But getting insurers to pay for this This was this was almost impossible um, and even after numerous demonstration projects were run um, still trying to convince them to pay for that Was harder than way hard still way harder than trying to convince them to pay for whole genome sequencing of a cancer patient When you can know that the return on investment on this on a population based level is is the return is there So, you know, it it's not that it won't happen or it isn't happening It's it's that it takes time to get these large industries to really evolve um You know, but you'd need to continue to do the demonstration projects to prove that that this is actually worthwhile to them and that it would be a no-brainer for them to to Change the way that they do reimbursements That's a really important point. Um, there's a follow-up question. This is a very popular topic in the Um, and and they're asking both of you about examples of FDA approved algorithms in the genomic space for diagnostics I'm not aware of any FDA approved I'm not aware of any answer There are many now that are cleared through 510k for you know images and medicine But not that i'm aware of for genomic Algorithms not brad. I I don't know of any I I don't know off the top of my head. So i'm not going to speak out turn on that Yeah, I do know of a really nice paper recently and i'll post it in the chat that actually kind of showed them But it's exactly as you said dr. Tobel that many of them are Not yet don't have clearance yet. So it's kind of what's in the pipeline and As both of you kind of mentioned it's very dominant in what I would say is the earlier space which is imaging right? So radiology that type of space. So I'll put that into the chat Dr. Tobel there was a question for you specifically about the digital twin Yeah, yeah They're asking to understand the difference between that and tcga with respect to the data And then this follow-up question is it's possible to access The data from tempus. Yeah, we're working on trying to access that data We haven't cracked into it yet, but we're working on it. Um, you know, that's they've developed an incredible resource But they haven't tapped into it theoretically It will be of great value. So The point I just to amplify or extend the comments there We don't have a good example of an of an information infrastructure that would provide for medical digital twins That is Where there's, you know, a wealth ideally millions or billions of people and matching up for every which way To predict the best treatment and outcomes What we have relied upon until now are clinical trials But as you well know And even the best of clinical trials maybe five out of a hundred people benefit So this is a whole new A way to derive You know insight about what would be good for that particular patient Now because cancer has so much data Um, that's become nominated as the number one And of course the problem we have in cancer is the treatment and outcomes are often uncertain Um, and uh, you know, they change all the time So the whole idea, you know, kai fu li he was a leading AI scientist, uh in china And I wrote a nature biotech paper. It takes a planet Uh, and the reason we wrote that is we should be doing this For all of medicine not just in cancer. That is It's a way to help each of us in our species talk about a Health, uh, learning system. I mean, this is the ultimate So if we had, you know, billions of people with all their data that was new through federated ai and homomorphic encryption Which is tools that would keep the data At this place, whether it's the country or the health system without any danger of Privacy security breaches that we could do this and this is the future I know it sounds, you know, a little zany for some people some of the listeners here, but This is a whole new opportunity Like I mentioned, um, it isn't validated, but it makes a whole lot of sense It's simple nearest neighbor analysis And if you have for each person if you matched up several other people and you could look at your treatment and outcomes Whether it's for cancer or other conditions, it would give you another level of insight But beyond what we have today, which is based on randomized clinical or prospective trial That's a great answer. Thank you. I thank you for your insight on that. Um, Dr Mellon, there's a question for you. Um, and this goes back to the cloud Platform aspect and thinking about also the multi-party Compute so many of the cloud platforms are now imposing egress charges when data is moved out of the platform So to compute for the compute to happen elsewhere Is that constraint taken into account when designing either distributed or leaner algorithms? And if so, how? Yeah, it is. I mean Basically, you have local compute and then you have so you pay for compute and you pay for bandwidth Right. So these are these are two factors that can be trade often traded off in an optimization Um, that that's not uncommon with federated learning models And one thing I wanted to point out with with the notion of federated learning is that it's not new Um, the the notion of federated learning goes back almost 40 50 years. So, um, I mean the algorithms that have been designed they're You know, there there are some new things to them You know accounting for new types of statistical imputations and and and some distributed regressions But a lot of the stuff that we're talking about It really just requires engineering more than any real innovation in the mathematics and algorithms Fantastic. Thank you Question for both of you and there's actually a couple of questions. So I'm going to try and synthesize it This gets at the idea of I would say imputation data augmentation synthetic data And there's concerns or just questions about You know, what are going to be the issues in terms of either Misrepresenting the general population or populations that are underrepresented also About misdiagnosis or poor predictions. So if you both could kind of comment on And these as you both highlighted these approaches are really needed because of how data hungry These algorithms are right and this also allows us to get around some of the privacy and other issues in terms of data sharing immediately Right if we have synthetic data So if we could have your thoughts just on How we can kind of balance that so that it's still Meaningful and useful and maybe what you see the limits of the synthetic data is being But I'll jump on that real quick. Um, because it's kind of where we're still doing a lot of research Um, I think the issue with synthetic data is that you really have to Understand the population you're trying to represent and you have to understand some type of utility function And I think Eric provided a great example of what you could do with augmentation When you know exactly what it is you're trying to look for Um trying to create synthetic data as like an all-comers type of data set is a really Challenging thing to do. I don't know if it's even completely possible It's one of those those grand challenges in my opinion that Um, we need to have a better understanding of of what the fundamental Things are about people that were biological organisms that we're trying to represent in order to be able to generate synthetic data But but the thing is is that if you understood everything about how biology worked Then you wouldn't need to generate synthetic data to pass it along So there's a bit of a chicken or the egg problem here that we're dealing with with synthetic data I I do think though that that we have to think of synthetic data in in two different ways One is the privacy issue is one thing And we should talk about like what types of things we're trying to actually protect against When sharing synthetic data, and what do we think it's actually going to be useful for what what are the warranties and what are the What are the guarantees that we can offer for it? But this other aspect this notion of augmentation is not about privacy It's about filling in the cracks So that a machine learning model doesn't get lost and try to create a framework or a model that is not properly representative of the data And I think that there's a lot of evidence over the last couple years that this notion of augmentation Is really one of the real wonders Of what's going on in machine learning today Because it's facilitating breakthroughs in in imaging informatics as was genome science and lots of different areas But it's still somewhat of a new venture I mean, this is if if you wanted to create opportunities for graduate students to have dissertations for years to come like this This can go in many different directions Yeah, I would just add, you know, Brad gave such a great A perspective on this, you know, both during his talk and just now He's really so been so thoughtful in advancing this but we wouldn't need synthetic data If we had, you know, the data sets from billions of people who were diverse So this is basically a default that we've had to move to because we don't have annotated data sets We have largely European ancestry and genomics still We don't have other ancestries adequately or symmetrically represented So this is a great way to compensate for that. It's and it's working and it's important But it doesn't substitute for the real deal So, you know, this is unfortunate if we could go back 20 years, maybe we would have put more emphasis on this And this is certainly one of the issues in the all of us program of a million Participants that's got more than half who are underrepresented minorities So ultimately if they all do have genomics and multi layered data that will have another rich data set that will help Be a cross reference for synthetic data set. So, you know, it's great I don't know what we do without synthetic data, but it's out of necessity Absolutely, it's kind of a follow-up question. So you focused on both of you right now quite eloquently the The data component, but there was a question from the audience about how do we think about diversity? And I'll add fairness when we're thinking about the models themselves and the algorithms That's not so much a question How to think about this, you know, there's There's the knowns There are the there are the things you know, you're missing and then there are the things you do not know that you're missing And and I think one of the problems that we run into is that you know, you can you can develop risk calculators So you can develop models that will look good in a population But until you test it in another population, you will have absolutely no idea if it works I mean, this is what happened with framing how the But but, you know, there's a calibration problem. So you actually have to ask the question of Do you go for diversity from the outset? Or if you can or do you try and build models and then test it on the diverse population to see if it holds In many respects, you need both Right, but until you actually get the right samples to the table Right, you you're kind of stuck in the situation where you develop the model You test to see if there's any holes and then you recalibrate and or or you go and solicit additional individuals accordingly But that's that's it's a It's a tricky proposition Because you know that there's going to be holes. You just don't necessarily know where they are And we don't necessarily know what all the factors are They're going to be influencing a model that that we're dealing with right because it's not just about Race, it's not just about ethnicity. It's about as I was alluding to like socioeconomic status It's about lifestyle and so the more variables that you bring to the table The more challenging it's going to be to have a single model that is representative of all of the population Yeah, that's a great point I think there's you know Also the work now that the number of tools that are out there on a detecting bias and mitigating But as we've seen with a number of high-profile cases It also depends on the metrics we're using and how we're analyzing it in terms of If we think that you know an algorithm actually has been biased or not So it's very complex, but but great answer. Thank you. Thanks to both of you Well, well, but we should we should be clear that there is a difference between bias and fairness Absolutely And and and you know bias just has to do with you know, is the data You know trending in a certain direction such that an influence is the model a certain way Fairness is about you know, even if you recognize bias Can you correct for it so that you don't necessarily discriminate or provide opportunities that that perpetuated divide and and the challenge there that I think you run into is To some degree you may have to accept That we're not necessarily going to have the best system For a particular subgroup at the end of the day if we want to try and bring the system into balance Now how to convince people to accept that situation That I don't know Or if it has to be, you know, I would just add to Brad's point there. I think it is important It's the humans that are responsible for the bias and the fairness not the AI Okay, so it's what we put in as inputs That's biased and then the lack of fairness is how we apply that without, you know, the thoughtful work To try to try to prevent lack of fair application So, you know AI gets blamed unfairly But it's really the eyes and o's and not the the algorithms and the models that are the problem 3,000 percent the human context is critical and I think that we miss that it's very easy to blame the data or the algorithm And and not recognize our rules in this and this kind of goes back actually to a comment both of you have made about the You know education training the practitioners, right just as important I know Brad you highlighted the NHGRI effort on that as well that we have to make sure that that we have representation across the board There's a very popular question in the q&a around the failure of IBM Watson health And people are asking if this is a prediction or a crystal ball about how long AI will be in the business of diagnosis Diagnostics and health care. So I leave that to both of you. However, you'd like to comment Well, I mean I've written about this fair amount. I think What what happened with IBM is they tried to Say they could do things and they and it was many years before that was possible So when they established the partnership with md. Anderson For tens of millions of dollars at md. Anderson to say they could extract unstructured and structured texts out of electronic health records They couldn't do it. There wasn't any way Only now we starting to see that like I mentioned the project with radies So the issue here is that they had the right ideas But they completely hyped it up and they sold it And it was a bust and there was a very important lesson for the AI industry Is that you got to have the good, you know, you can't be out there, you know, they were blitzing, you know tv commercials about IBM Watson health and It just uh, it was a lot of air It wasn't real and it's unfortunate because it it could really hurt the field. Fortunately, I think what we're seeing now is A lot of things that they say they could do are actually getting done Are starting to get done now But all of us have to be worried about that because um, you know, these companies have different interests And we have to know for sure before there's any major investments and something like this that it really works so It's unfortunate that we went through that And I think IBM regrets a lot of what they did Um, and hopefully we'll learn No, fantastic. We are I'm getting a nudge that we're almost out of time. This is amazing discussion But I do want to put in there's a couple of threads in here around training and education and There's one in particular from a resident physician Who is thinking about how he can Quote unquote get in the game in terms of this So do you have any advice for you know trainees or actually I would say trainees of any level About if they're interested in and either learning about this or applying Advice for them. This is a question to both of you I think there's several opportunities, you know, I think it depends on what level of of Detail somebody wants to go into the field, you know, it's it's one thing to be knowledgeable It's another thing to be a tool builder or an analyst Um, and I think you have to recognize where you want to be first of all That's that because the reason why I bring that up is that there are there are certainly training programs and tutorials That are more focused on on the mathematics and the informatics itself Um, and so you can you can look for graduate training programs in in that regard or for postdoctoral opportunities um, but then there's also Things like like the t32 programs that are focused on genomic medicine Where some of its research, but some of it is really just in the application of genomics in the context of a clinical environment And so those types of training opportunities are are worthwhile for people who want to become practitioners more so than than general hardcore genome scientists I would just add I can't imagine there's a more exciting area to work in As opposed to the medical side where there's a lot of regulatory stuff and it moves much slower This moves so fast and we're still at the earliest phase and that's what people should realize This is just starting to get going and it has an immense future So, you know, I think I think it's great to attract as much talent and enthusiasm as we can get in the field That's a brilliant way to close. I thank you both for your time and expertise I know that this has been incredibly so many questions. We didn't answer But thank you both for your presentation today and just as a reminder to the audience We're going to take about a 20 minute break and then we'll start up with the next session. Thank you Thank you. Thanks for hosting challenge. That was great