 Well, it looks like we're right at the hour so we'll go ahead and get started with session number four on machine learning and clinical genomics and give a couple of brief introductions as a co-moderator. My name is Casey Overby Taylor. I'm an assistant professor of medicine and biomedical engineering at Johns Hopkins University. Good afternoon everyone. My name is Eric Warwinkle and I'm the dean of the School of Public Health at the University of Texas Health Science Center in Houston. On behalf of Casey and I, we thank you all for being here and welcome. There's no other business from the organizers. Why don't we just dive in and introduce our first speaker. Who is Sue in Lee from the University of Washington and the title of the talk is explainable AI for cancer precision medicine. Hi, I'm Sue Lee in computer science and engineering at the University of Washington. My lab develops explainable AI techniques for a wide range of problems in biology and medicine, including clinical genomics, which I'm going to mainly talk about today. So when we have a set of patient features and a well trained machine learning model, we can predict various kinds of a clinical outcomes or clinical phenotypes. However, in many cases we cannot answer a very natural question of why a certain prediction was made. And similar problems occur when, when we use a complex black box machine learning model to understand the relationships between molecular profiles and phenotypes. For example, gene expression levels in brain tissues and neuropathological phenotypes. The model may predict the phenotype accurately, and then describe the data accurately, but it does not give us mechanistic explanations of how gene expression levels influence the phenotype, or the other way. Machine learning is basically a black box to many of us. So recently my group focused on addressing the black box nature of a machine learning by asking largely three questions. First, how to learn or select features that are interpretable. Second, how to explain a certain prediction by estimating the importance of each feature and their interactions. How to biologically or clinically interpret a complex black box model, such as a D neural network. So to address these questions we develop various kinds of explainable AI techniques for decision support systems for hospital to better understand cancer and then precision medicine and Alzheimer's disease for, for which there aren't any drugs at all so we need to understand the molecular basis for the disease, the disease first. So that we can discover therapeutic targets. So in this short talk, I'm going to focus on cancer precision medicine and general AI techniques that are important to these problems. To say that there is a patient X who has an AML, an aggressive blood cancer. Like other cancers, there are many anti cancer drugs this patient can be treated with however, standard therapy is not personalized. And our long term goal is to build an AI system that can take molecular profiles of a patient and then predict the response to a large number of the anti cancer drugs that can be used and then in our prior project studies we showed that the show that the effectiveness of the method for jointly training the prediction model with various kinds of a prior knowledge on genes a driver potential is that we can identify explainable gene expression markers for each drug and they showed that it improves the models of performance and then also the, you know, the ability to identify the molecular markers that are actually likely to be biologically meaningful. So, and then in our ongoing project which I'm going to focus on this talk. So we, we are interested in, you know, improving the process of selecting a combination of drugs. The reason is that single drugs are often not effective and then in this case it this makes our problem harder. So, if there are 100 FDA approved drugs suddenly, there are over 10,000 drug pairs so when we have too many choices it'd be extremely useful to know why or why not certain drug pairs are good option. So for that we are collaborating with Camilla next to Rova at Harvard Medical School, we are in the process of developing the express method. For each patient and a particular combination of drugs A and B. Our method express takes as input the patient's gene expression data, and then drug combinations features. And then outputs the predicted drug synergy between A and B for this particular patient tumor. And then not only that it also explains that for each prediction by giving the gene level score, how much each gene expression is important to predict synergy or not synergy. And then by using that we can infer the we can gain pathway level insights into what processes are important for a certain synergy prediction. So, we trained this express model based on the beat AML data from Oregon Health and Science University which contains tens of thousands of samples and then from nearly 300 patients and it contains 133 combinations of drugs. And then, to perform the cross validation test we created many, you know, multiple settings, how to, in terms of how training and test samples are divided from the easiest setting where you simply randomly split to the more difficult setting where we made sure that we only see novel combos in the test set that's number two, and then novel patient or novel drugs in the test set so for various such cross-vail validation settings we compared between our method with alternative methods by learning the prediction model using the training data and then testing using the held out test data. And then the performance is measured by one minus MSE so higher the better. And then the, the y axis here is the XG boost model we ended up using deciding to use and then we compared that with the three alternative approaches, because you know we had to choose the best performing model in terms of the accuracy. So, as you can see in this scatterplot this XG boost, it's a fast implementation of gradient boosting trees. Its performance is a better, it's a y axis is better than that of the other methods which is x axis linear model neural network and random test in terms of this cross validation test the prediction accuracy and then, especially the predictive performance of this linear model was, you can see that it is substantially worse than the complex models of this pink colored dots are more deviated from diagonal than other other methods so and this is a very common situation in machine learning the tradeoff between accuracy and interpretability. What that means is that with large modern data sets the best accuracies often achieved by complex models which is even experts struggle to interpret. So, and then but linear models are easier to interpret but they often perform worse than the complex model so to resolve this is a challenge we have developed a series of machine learning general machine learning techniques to measure each features importance for a particular prediction made on a particular patient. This is the first method is named the chef and I'm going to show you give you a very, you know, basic idea of how it works. It's a it's a general machine learning framework that can be applied to any model it doesn't matter whether you use a tree ensemble, like we used this XG boost or team neural networks. And then it has three desirable theoretical properties that are important, which, you know, many other methods in this domain do not. And to us its ability to provide a simple explanations of predictions from arbitrary complex models that eliminates the typical tradeoff between accuracy and interpretability. So, how exactly today do we estimate the features importance and then to give you on, you know, very high level idea, I'm going to use a very intuitive example. So let's say that this is, you know, there, this is john and then he is a typical bank customer like many, you know, customers today he when he applies for a loan. This information is sent through a prediction predictive model. And then this model is designed to calculate the risk that john will have a repayment problem, which unfortunately for john to 55% and the bank declines his loan application. So the natural question is, why. Okay, so to explain his denier it is important to start with the base rate of loan repayment problem. What is the percentage. What is the chance that any loan is declined. We noted here by the expected value of the models output. And then to explain john's risk, which is 55%. We need to then explain how we got from the base rate to his rate and then that's going to give us explanations. So to do that we start at the expected value of the model output. And then let's say we condition on a single feature of john, which is age, and then since this increase the john's risk by 15% we attribute that increase to john's age that's how much age is important. And then next we condition on john's occupation, which is not a may not be the most stable. And then, again, attribute the change in the expected model output to his occupation as a day trader. So if you repeat this for all the other features to say open account and the capital gain and then you end up conditioning on all features and so you arrive at 55% of the model which is john's predicted risk. So basically, these lengths of arrows, how much you attribute just attribute to each of the features may indicate their importance. It's actually important that in this process the order matters. For example, the impact of john being a day trader it could be particularly bad if you're young day trader. It means that the impact of day trader could be large. If you already know that he's young. So which means that if you reverse the order it could be that age is more important day trader shrinks. So this shared values, our method. It average tries to average consider all possible orderings and then average dizzy attributions over all. And factorial possible orderings and then we showed in our paper that it's the only solution which maintains important three properties and then these are these properties make any comparison among features in terms of their important values meaningful. Then, but of course you know this is how certainly not how you want to actually compute the value the shaft, the, these values, because we have to consider all possible orderings it can be computationally very intensive so further research needs to be done we. We are doing, you know, some of the follow up research to develop efficient and algorithms to to compute these shaft values. So, just, you know, just tell you one, you know, our in our prior study where we use the shared method. The preliminary version was used to help us the prescience method to predict the near term risk of hypoxemia during surgery and explain the factors that led to that risk. And then we showed that the explainable prediction of a hypoxemia improve the clinicians ability to predict the event and then again for this. This project we used earlier version of the chef method. So, and then we recently developed ML methods to efficiently compute. The chef values for a particular model types. We have done for three ensemble now we are working on the neural networks. And then, first of all, why trees because it's extremely popular models. Especially many of the kegel competition winners used XGBoost. And then, you know, the result of the kegel data survey done by a couple data scientists to show that, you know, three models are most widely used model in industries these days, and then we designed the three algorithms that reduces the exact computation of the chef values from exponential to polynomial time and then I, you know, I will show you soon that you know how this method was effective in identifying the explanations underlying a drug sensitivity. So, and, you know, in this tree shaft paper we applied it to various clinical and prediction problems in medicine, such as a CKD or mortality or surgery duration. It's currently used by a lot of companies and then in the link at the top you can see, you know, where you can find this, you know, code and then so that, you know, you can you can use for your own research. So, and then as a part of this work we developed a rich visualization visual representation of a global importance of a feature so we call it chef summary plot. And then you used enhanced data to predict mortality of nearly 10,000 people who were interviewed and tested in 1970s, followed up for 20 years. And then we extracted nearly 60 common measurements and demographic values as a features for Cox proportional hazards model in XGBoost, and then for that way for each individual we can compute, not only predicting the mortality we can also compute the values and then for each feature we can visualize its distribution. So here, x axis is a chef values and then these dots represent the chef values of all individuals and they are plotted horizontally and stacked vertically when they run out of space to show the density of each dot here is a colored by the value of that feature, for example, for age to young blue means young red means old. So, and then you can see, you know, surprisingly that age is the most important factor. And then overall, you know this some of the features show large rare large effect and overall many features show long tails are reaching to the right, but not to the left for example, systolic blood pressure only has the large impact for minority of people with high blood pressure. So this general trend means that the extreme values of these measurements can significantly raise your risk of death, but cannot significantly lower your risk. So, in other words, there are many ways to die younger but they're not many way to be out of a rage and leave longer. So, and then, um, now let's, you know, return to this express explain our prediction of anti cancer drug synergy. So to understand what makes a drug combination have a synergy. So we computed the chef values for each sample again each sample means a patient and a combination of drugs. And then we create the chef value matrix for gene expression features. So where each row is a sample, and then each column corresponds to a gene. And then as we saw before, in the chef summary plot, if we focus on each feature here gene expression level that we can generate the chef summary plot that shows the idea of which gene expression features are important for drug synergy. So then, here is the chef summary plot for genes with the positive trend meaning high expression usually indicates synergy and then the negative, negative trend and some of the genes are very well known to be relevant in AML biology. There are known drivers such as Miss one and the dealer three. And then some of the other genes in the top, you know, and genes, and then this, you know, Jean list enables us to perform a pathway enrichment analysis to identify the pathway gene sets that are there could be mechanistic explanations of a drug synergy. So now we, so we did that pathway analysis and then the most important genes for our model turned out to be enriched for metabolism or hematopoietic lineage so tight junctions and then and the jack says signaling and so on. And then you're looking at the result of this keg pathway curated pathway. The enrichment analysis result and then in addition to that we also look for enriching in C to GC is a C to gene expression signatures are more, you know, general genus expression signatures outside of the canonical pathways. So that many of the top genes are falling the cat the pathways that are related to hematopoietic stem cell like expression signature so so called stemness, and then these results to show that this, you know, hematopoietic stem like expression signature is relevant it's to a drug synergy, and then we performed the drug or combo specific analysis to reveal, you know, similarities and differences across these drugs. Again, there are 133 drug combos in the training data from nearly 50 drugs and then one thing, you know, one of the results indicate that when we embed the shared values for each drug combination. We see that you know certain drugs tend to be closer together like the needle clocks and then panel be no stat. And then the, the combinations containing drugs between the two cluster tend to be embedded between these two clusters and then when we examine the combinations containing, you know, other sets of drugs for we noticed that they are not clustered as well. So, we have, you know, many, many sets of results on this. So, you know, important scores, the shared values for the drug combinations in our bio archived preprint. Okay, so to summarize my talk today. So, in this short talk I presented our new, you know, the new and all the explainable AI techniques, and then more, and then, and then, you know, I focused on this cancer precision medicine, the project, and then the high level conclusion was that these complex models are useful they're more accurate they can also, you know, express complex relationships between biological models. And then, but we need a new methods of machine learning methods to make biological or clinical sense. So I'd like to thank the students in my lab. And including 11 PhD students one undergrad, and then three mstp students. I'm the PhD student who chose computer science as their major. And then I'd like to, you know, the, we work with large number of collaborators at you the medicine and Harvard Medical School, and then other many other institutions and then this the the project we are doing is largely funded by NIH NSF and the American Cancer Society. Thank you. I can start the, the q amp a there are a few questions that came in. Again, so one of the, one of the questions that that I had was in your discussion of future importance to help make models more explainable is really clear that there are several demographics that are important and I'm just wondering if you could comment also on the importance of modifiable factors like behaviors since that has implications for discussions with clinician between clinicians and patients. Yeah, right that's an excellent question so when you have this features that have that got the high importance of score your first question is, is there any way to mitigate the, you know, the, the modify the outcome, clinical adverse outcome we usually are interested in predicting. So, so the features that are modifiable versus not so demographics it's obviously not modifiable by features such as the same blood pressure or some other features. So there are things depending on the clinical context there are features that are modifiable so those features would be important by one more thing that's important is a causality if some feature is modifiable and has a causal influence on the outcome you want to predict which are the most important in, you know, clinical, clinical context or even you know biological understanding so you can, you can now then know, you know which gene you want to over express or knock out to this to see the effect of its impact. So, yeah, I mean, you know, so these, the, this kind of important score that's given to any feature is important to understand, you know, interpret the results to the results and then the machine learning model but also, you know, this especially this modifiable features and combined with those that have a causal impact are the ones that that are the most important in many many applications. So first Sue in great work. So someone's asked a question actually that was on my mind I was going to ask if there was a law but I'll. So if you compare your important so the results of chat from the AI. How does that compare to or how is it different from what you would just get from a linear regression type analysis where you would then look at the beta coefficients and are they correlated or is it just apples and oranges they can't be compared. That's also an excellent question so it depends on the data, if, if the data has, you know, implies that this you know many important features and then the outcome have a linear relationships I think that then, you know, in terms of both of the prediction accuracy and also that the genes that are important is for the features that are important. I don't think they will be very different but all the data sets we have been working on, which include both you know gene expression data, and also various kinds of EHR data. They were very different to each other and we often found a lot of interaction effect. So, say if you, you know, plot the impact of the feature value the relationship between, you know, feature values in the x axis in the y axis if you plot this a shaft and you will have a lot of your differences which means you know individuals who have exactly the same, the same feature values for a particular feature have different, different values in terms of the that features importance it depends on some other features so and then we, you know, we found in many cases where, you know, the conclusion from this, you know, beta coefficients and this is sharp values are very different to each other and then one more, you know, one very obvious difference is that this kind of sharp values or you know feature attribution methods they will give you the personalized importance so for each sample, yeah for each in the you know instance, you will get this set of importance values given to all features which like personalized marker. Whereas in the linear model the beta coefficient that's applied to a population of samples. Thank you. Thank you. Excellent talk so it looks like a, it's time to transition to our next presentation. So next up we have Dr. Sunkara Roman, and he'll be talking about machine learning for large scale genomics. Hi. My name is Sriram Shankar Raman I'm a faculty in computer science, human genetics and computational medicine at UCLA. And I'd like to thank the organizers for this opportunity to present. So my lab works in machine learning applied to genomic and clinical data. And today I'm going to focus on a problem that comes out of the field of complex trade genetics, specifically trying to understand the genetic architecture of complex traits, which are a function of genotype and environment. And a major progress in being able to understand genetic architecture over the past 15 years has risen from a study designed called the genome wide association study, or GWAS. And one of the big outcomes of these genome wide association studies is the realization that complex traits are often controlled by hundreds, often thousands of genetic mutations or snips. Despite the progress coming out of GWAS. We still have a lot to learn about genetic architecture. And this has been spurred by the growth of what we called bio banks. So these are data sets that collect genomic variation and trade variation across hundreds and thousands of trades, often in the context of healthcare setting. A very prominent such example is the UK Biobank which is this data set that contains data from about half a million individuals. This includes genetic data, as well as trade data across thousands of diverse phenotypes. So this leads us to the question about how we can learn about genetic architecture from data sets that contain hundreds of thousands and imminently millions of genomes and thousands of trades. What are the machine learning problems and challenges that arise from Biobank scale data. Turns out that there are several such challenges. On the one hand, there are statistical issues. For example, how do we build machine learning models that accurately capture the aspects of genetic architecture that we care about. There are computational issues which focus on how we can scale inference in these models to deal with these millions of genomes. There are issues of interpretability, which deals with how we make sense of the inferences that arise from these models. And finally, there are issues that focus on data privacy and data sharing. In today's talk, I will focus on primarily the statistical and computational issues. And so to ground this discussion on genetic architecture, we'll focus on one aspect of genetic architecture, which is this quantity called heritability. So heritability refers to the proportion of variation in the phenotype that can be explained by a linear model of the genotype. Typically this is referred to as narrow sense heritability. And a powerful way of estimating heritability uses a class of statistical models called variance components models. So these models model the phenotype as a linear function of the genotype plus some environmental noise. And because of the high dimensionality of the genotype data, typically we have millions of SNPs that are measured. So we need to put some assumptions on the effect sizes of these SNPs. So the assumption is that the effect size of each SNP comes from a distribution whose variance parameter is related to something called the genetic variance component. The environment is also assumed to come from an underlying distribution whose variance parameter is termed the environmental variance component. And given the genetic and the environmental variance component, we can compute the SNP heritability. Now heritability is a single parameter. However, there have been a number of studies which have gone beyond a single parameter that describes genetic architecture. There have been a number of studies that have looked at how heritability varies across chromosome, across specific genomic loci, across functional annotations, as well as across traits. And each of these analyses have revealed interesting important insights into genetic architecture. All of these analyses have been enabled by extensions of this variance component model. So broadly speaking, the variance component model models the phenotype as a linear function of genotypes where each SNP has been assigned to one of k possible components. And so in each of these components, the effect sizes of these SNPs come from the same underlying distribution. And our goal is to be able to estimate the variance components associated with the k genetic variance components and the environmental variance component. The typical way of estimating these parameters is by using classical approaches like maximum likelihood. So writing down the likelihood of the model and searching for values of these parameters that maximize the likelihood. The challenge is one of computation. So typical maximum likelihood estimation techniques scale cubically in the sample size. And so this makes it challenging to apply to biobank scale data, which has sample sizes in the order of hundreds of thousands, if not millions. So to tackle this, we have been developing an alternate class of estimation approaches. So these estimation approaches are based on an alternate approach to estimation called method of moments estimation. So these are estimates that try to match the theoretical moments underlying the model with the sample moments. It turns out method of moments estimators by themselves are still not scalable enough. And so we combine the method of moments estimator with the notion of randomization. The key insight is instead of working with the genotype matrix, which is this large matrix, we work with a sketch of the genotype, which is obtained by multiplying this genotype matrix with some number of random vectors. So the resulting sketch is a smaller matrix so it is much more efficient computationally to work with. So what is interesting is this sketch often preserves the statistical properties that are needed for the problem. And so this sketch is accurate, even for a small number of these random vectors. So the resulting algorithm has a computational complexity that is essentially linear in the number of SNPs, the number of individuals, and the number of random vectors be that we use to form this sketch matrix. These kinds of randomized estimators are accurate for B as small as 10. So this resulting algorithm is something that we term randomized HE regression. HE regression is a classical method of moments estimator that has been used in quantitative genetics. So we performed extensive benchmarking of this randomized HE regression estimator. The first thing that we find is across a wide variety of genetic architectures that were simulated. The estimates from RHE tend to be quite accurate compared to other scalable approaches, which are based on summary statistics of genotype data. And the reason we hypothesize is because the RHE is preserving more of the statistical properties of the original data to begin with. Further, RHE turns out to be much more scalable compared to likelihood based approaches. And as a result, it is something that can be practically run on data sets with hundreds of thousands and millions of samples. We then applied RHE to a number of traits from the UK Biobank. In our first application, we computed the heritability of traits across different classes of SNPs. So focusing on SNPs that are common and presented by RHE data, looking at common SNPs on imputed data and then looking at low frequency SNPs on the imputed data. And we find that the heritability for many of these traits increases as we include these lower frequency SNPs. We are also able to apply this method, not just to estimate the total SNP heritability, but also to understand how SNP heritability is distributed across the genome. For example, in this analysis, we partitioned heritability across different functional genomic annotations and asked whether there was an enrichment of heritability in specific annotations. And we find that across traits there is an enrichment of heritability in conserved regions of the genome. This has been documented in previous studies. However, we also identify trait specific heritability in annotations such as the enhancers. We are also able to partition heritability across population genomic annotations like the allele frequency, the LD patterns of a SNP. What we find is that the per allele heritability tends to increase as you have SNPs that are lower in allele frequency, as well as SNPs with lower levels of LD, consistent with the effect of negative selection acting on these traits. So we talked here about heritability and an efficient method for estimating heritability, but what is interesting is that these general insights can be applied to a number of analyses that allow us to go beyond heritability and ask questions about the contribution of nonlinear effects, the contributions of environmental interactions, as well as how genetic effects are shared across traits. All of these are unique aspects of biobank scale data. So more recently we've extended this basic model to allow us to estimate nonlinear effects, specifically effects that arise from dominance deviation. So now we have a model which fits an additive variance component, as well as a dominance variance component, and now we are able to estimate for the number of traits what proportion of the variance comes from additive versus dominance effects. And what we find is across about 50 quantitative traits, dominance heritability is substantially smaller, less than 1% of additive heritability, and across all of these traits we find no statistically significant evidence for dominance heritability. We are also able to interrogate the contribution of gene environment interactions. And the reason this is interesting is for many of these data sets, we not only have a trade of interest, we also have a large number of ancillary measurements in terms of environmental variables. So now we have a method that can also estimate these kinds of gene environment interactions on large scale data. And in this application where we looked at the environment these weeks smoking and try to quantify gene environment interactions across a number of traits, we are able to document a substantial contribution of smoking in terms of its interaction to the genetic variability underlying the trait. Finally, we are also able to quantify how genetic effects are shared across traits. So this is a quantity that is done genetic correlation. Because this underlying model is able to quantify directly the genetic correlation, it turns out we have greater power to estimate genetic correlation compared to existing summary statistic based methods. So in this example, the application of this method to estimate genetic correlation identifies a number of novel statistically significant pairs of traits that are correlated in terms of their genetic effects. And because in this example application to the UK Biobank we also have access to blood biomarkers identifying genetic correlation across blood biomarkers and different diseases reveals a novel genetic correlation of corollary artery disease with serum liver enzymes. And that is particularly exciting because it's pointing to connection between liver and heart biology. So just to summarize what I've discussed so far are a class of methods that scale machine learning models to Biobank scale data. And the key takeaway from this is the effectiveness of randomization in a principled manner. There are of course a number of other approaches that have been proposed in the field of scalable machine learning to scale inference to large scale data. For example, there is approximate inference, which allows you to give up on guarantees of exact estimates for more computationally tractable estimators. There's distributed inference which allows you to perform inference when the data is distributed, for example in the cloud. So one of the things that would be of great interest is to explore further how techniques from scalable machine learning can empower and enable Biobank scale analysis. So in the remaining few minutes, I'm going to talk about some promises and challenges that arise from Biobank scale data. So our big picture goal is to connect genetic variation to different health outcomes. What is extremely interesting in the context of Biobanks is the availability of not just data on the health outcome, but data from multiple modalities. So this includes imaging, gene expression data or other molecular data, biomarkers, as well as sensor data. So a key challenge is to be able to device and develop models that allow us to integrate different modalities while still being scalable to these large data sets. A second major challenge is to get a causality. So ultimately, we are often interested in being able to say whether some kind of an exposure, for example, a biomarker or a drug is causal for an outcome. The gold standard way to do this is to conduct a randomized clinical trial. However, such trials are often expensive, sometimes unethical. So this leads us to the question of whether causal statements can be made when we have observational data. So in the presence of genetic data that has been collected, these genetic variation data often serve as instrumental variables that allow us to make these causal statements. So this is a technique that has become extremely powerful and has revealed novel insights in the form of Mendelian randomization. So Mendelian randomization has really been influential with the availability of these Biobank scale data sets. However, Mendelian randomization does require assumptions about the relationship between genetic variant exposure and outcome. And so violations of these assumptions can often violate the causal inference that stems out of Mendelian randomization. So for example, the kinds of assumptions involve the fact that there is no underlying population structure which serves as a confounder. There is no horizontal pliotropy which leads to another violation of the assumptions of Mendelian randomization. So this leads us to the question of how we can design Mendelian randomization techniques that are robust to these kinds of confounders and how do we actually apply these techniques at the scale of Biobank data sets. Another important consideration is one of generalizability. So what has become clear in the last few years with the ability to predict complex traits from genetic data is that the prediction accuracy depends critically on the match between training and test data set. For example, when you have these genetic predictors that are trained on primarily population of European ancestry, these predictors tend to have decreased accuracy when they are tested on populations of non-European ancestry. Further, this lack of generalizability goes beyond ancestry. For example, a predictor that is trained on primarily women tends to do worse when it is tested on a data set of men. So this leads us to the question of how do we design techniques that actually generalize well across diverse settings? How do we design evaluation strategies that test the algorithms in the right kinds of settings? Especially in the context of these kinds of genetic predictors being increasingly adopted in a clinical setting, a good way of handling generalizability becomes critical. Finally, although there is growing amounts of data in these Biobanks, there's always going to be the need to be able to pool and analyze data. And so this is where the constraint from privacy kicks in, where often there are restrictions on how this data can be shared. So this is where techniques from distributed and federated machine learning which allow you to train and deploy models without having all of the data sitting in one location is going to get increasingly important. With that, I'd like to thank my group that has been driving a lot of this work and I'd like to thank my funding sources and I'd like to thank you for your attention. Great work. Thank you. Thank you for an excellent talk. So I'll kick off the questions just to get us going and I'll maybe remind all the participants to please use the Q&A feature at the bottom to ask questions because Casey and I have been conceding them. So my question has to do with very simple in the beginning. So if you run the random HE regression to help save time, but let's say height, let's just pick height and then you just do the usual analysis. You know, you basically take, you know, you spend the resources to do the usual analysis. How do those two answers differ? I mean, I think you showed that graphs, and I don't know if that was simulated data or real data, but I just wanted to pick one example where you run a phenotype. I just arbitrarily picked height. You run it both ways. I assume the analysis, the answers are very close to one another that got us to this point, but I guess I wanted to hear just for one trait. Yeah, so that is, that is mostly the case. So what we have not been able to do is run it all the way to the size of half a million individuals. Instead, the experiment that we're able to do is sub-sample where the alternate approaches are feasible to compute and then we can compare our results. And our broad observation is estimates from this randomized HE correlate very well with likelihood-based methods, which we can think of as kind of the gold standard. Typically, there's a slight increase in there in the standard error of these randomized methods, maybe a 5% increase, 5-10% increase in their standard error. So you do pay a little bit of a price, but often in the large sample setting, that is not the dominant error modality that we care about. We want our biases to be as low as possible. And I see that you're answering some of the questions online. Please continue to do that. It looks like we're, it's time to move on to the next presentation and we can come back to questions during the Q&A. So our next speaker, it's a pleasure to introduce an old friend, Russ Altman from Stanford University. And Russ is going to be speaking about deep learning to predict the impact of rare variation in drug metabolism genes. Thanks very much. This is Russ Altman from Stanford and I'm happy to be talking today about deep learning to predict the impact of rare variation, particularly in drugs that are involved in the metabolism of genes that are involved in the metabolism of drugs. So to make sure I thank the people who did this work, I want to particularly thank Greg and Adam in the upper left, who are the graduate students who worked on this project. Rachel and Erica in the upper right at University of Montana at the time collected the experimental validation data, and then of course the entire research group in pharmacogenomics at Stanford. So pharmacogenetics is the variation in drug response due to genetic differences. So the idea and the vision for pharmacogenetics is every patient will be sequenced and we'll know what variations they have in important genes so that we can figure out if the drug will work as expected, or we should change the dose of the drug, or we need to be on the lookout for increased toxicity, or maybe just use a different drug. And this is rapidly becoming a reality. Pharmacogenomics is in the clinic in fact I have a clinic at Stanford, where I see patients, we genotype them, or sequence them, and then we figure out what that might mean for their drug response. You can see it's cut off here but farm gkb.org is the resource supported by NHGRI and others. And it's about 21 years old and it's filled with information about drugs that interact with genes. So for example, Coding, one of our favorite pain relievers is significantly metabolized by CYP2d6. This is an example of the data in farm gkb. And in fact on the left you see it's level one a which means we have very good evidence for this interaction. And this is going to be the example gene CYP2d6 that I talked about in today's talk. And a little bit in the context of coding. So, there is a spectacular three dimensional structure of CYP2d6. I should say that's the cytochrome P450 family to sub family D sub sub family six, a beautiful protein that metabolize that's in the liver and metabolizes xenobiotics of foreign substances, basically to make them more soluble so that they can be excreted. So Coding is shown on the left, and Coding is a substrate of CYP2d6, and it is transformed into morphine. And morphine of course is a very powerful pain reliever in fact Coding has very little impact on pain, and it must be metabolized to morphine in order to relieve pain and the whole chemical transaction is just changing the methyl group in the left, you can barely see it in the Coding to a hydrogen that demethylation winds up creating this powerful medication. The important point for our discussion, however, is that as many as 23% of people in the US have a compromised ability to metabolize coding into morphine. It goes in both ways. Some people are poor metabolizers and coding does not work because they never turn it into morphine. So they take coding and it's like a placebo. Other people are what we call ultra metabolizers, they turn the coding into morphine too quickly, and therefore have an unusual response where they get this high from the morphine for a few minutes, but then they don't get pain relief for the next three or four hours. So this is a archetypal example of pharmacogenomics and one where if we're going to practice precision medicine, we're going to need to know for every patient, what is their CYP2D6 status, and what kind of adjustment should we make in their medications. Now, CYP2D6 has a number of alleles in fact 161 haplotypes have been reported in the literature. And we know that some of you may be familiar with called the star nomenclature. Star one is typically a wild type. Star two is some combination of snipple leels that lead to a variant. Star three is another set of combinations, etc for the 161, and we're actually still discovering them as the sequencing projects do more and more. The key for pharmacogenomics the challenge is to not just understand the role of CYP2D6, but for all 161 or 200 or 300 haplotypes, and their pairs as they're observed in humans, we all have two pairs obviously, to understand what the variant that a particular patient has means for their response. And this is not an insignificant question because as many of you may know, CYP2D6 is responsible for the metabolism of hundreds of drugs, a small subset are shown here, and then some of my favorites are highlighted in red so you can see some antidepressants. We have Tamoxifen a very important breast cancer drug on Dancitron, which at Stanford is the number one prescribed CYP2D6 inhibitor to D6 substrate, and then coding which we've been just talking about, and there are literally hundreds of others. So just knowing the activity of CYP2D6 for a patient could have a profound effect on your ability to give them the right dose of the right medicine. In fact, we and others are involved in writing these CPIC guidelines clinical pharmacogenetics implementation consortium, where we tell practitioners or advise practitioners how they can use genetic information to to optimize their use of, in this case, coding. Here's the problem. The problem is that these guidelines in general are addressing the common variations of CYP2D6. And we know that throughout the population on earth are many, many haplotypes that are not in these guidelines, and those folks people with those haplotypes cannot benefit from pharmacogenomics. In fact, in one study that we published recently, and rare variants in the UK Biobank, we looked at eight key genes, including 2D6, we saw almost 500 predicted variants that would be deleterious across these eight genes. Most of the half of those were not found in NOMAD, one of the resources for population variation. So these were somewhat novel. And in fact, 6% of the individuals in the UK Biobank carry at least one novel deleterious or predicted deleterious variant. In fact, overall each individual has an average of 12 drugs for which they might be expected to have an unusual response because of their variants that are seen in their genome. And of course, I must mention that the UK Biobank is not the most diverse resource, but we did see in the non-European populations contained in the Biobank that novel variants were enriched. Not only is it a question of bringing pharmacogenomics to the population at large, but it's also an issue of justice to be able to understand what these novel variants that might be more common in non-European populations, what they mean for those patients. So here is just a graphical depiction of the eight genes that we looked at, their allele frequencies, and on the far right you can see a huge number that are simply haven't been seen before. And therefore, as pointed out at the bottom of this slide, we have no idea what they mean for the CYP2D6 function, and therefore in the setting of coding, we would have no idea if we should use this drug or not based on the pharmacogenomics background. So that's what I wanted to talk about today, which is how can we predict the function of these rare or novel haplotypes that are observed in population surveys so that we can bring pharmacogenetics to all patients. So we use deep learning. Many of you know a lot about it. I'll just remind you that it's a new machine learning paradigm, kind of based on an analogy to neural processing. For those of you who benefited from an education in biology, you will remember how the retina works, that we have light, in this case coming from the bottom, but it hits the pigment epithelium at the top, the little square cells. Those have the rods and the cones that detect light. Those cells are interconnected with a next level of cells, they're labeled here as the horizontal cells, which integrate signals from the rods and the cones in order to begin to see if we're looking at shadows, edges, corners. Those horizontal cells then feed into the next level the amicron and the ganglion cells, which further integrate and combine the raw data from the rods and cones to start to see image features like edges and lines. This then goes to the visual cortex where further processing happens and then our brains recognize, you know, our mother, our car, our refrigerator. So that analogy is what inspired in a loose way deep learning. In this case it's been very well, it is performed very well on image analysis. The pixels go in on the far left on those gray nodes that you're seeing in the far left, each one indicating whether it's what the color is in the pixels of an image. And then on the analogy of the amicron and horizontal and ganglion cells, we have all these levels where there's high integration between the neurons at one level and the neurons at the next level in order to start to see features. So you start to first see edges and corners, and in this because they're looking at faces they start to recognize in the middle levels you can see facial features like eyes and ears as a nose, and eventually you see faces and you can say, is that Joe Biden or not. And these things have become very good at this and we want to use this for our analyzing the variations in the DNA we want to put the DNA and the far left, and we want to get the function of that CYP to be six out at the far right. But we don't have a ton of data. We're going to use something called transfer learning and transfer learning, you train on a different task for which you do have data and then apply a small amount of data to make a special purpose classifier for the problem that you really care about. So let's say that I went to the internet and I found a ton of cats. I have plenty of cats to build a deep learner to recognize what type of cat is in a picture, Tabby Siamese, etc. I can take that and I, and many of the image processing steps that it's doing are very relevant to dogs as well. So we're going to lock that middle section. And then we're going to use a relatively small amount of dog data, just to turn it from a cat recognizer to a dog recognizer, and this has been shown to be a very effective strategy. So in CYP2D6, our cat is going to be something called an activity score. This is a very heuristic score used currently to estimate what the function of CYP2D6 might be. It's based on looking at known functional haplotypes and saying, are they non functional, a little functional, or perfectly functional, and then counting up points. And it doesn't work great, but it works pretty well. And importantly, we know how to calculate that score. And so what we're going to do is we're going to use that as our cat. We're going to train a neural network to predict the activity score, then we're going to give it real data to predict the actual activities of CYP2D6. So here's the overall strategy and you can see the reference to the paper which just came out relatively recently. We generate 50,000 sequences on a natural background of nomad sequences of CYP2D6, then we spike in known variations, that is to say variations which we know what their activity score would be. Then we compute the activity score, and we train a model simply to learn how to go from those sequences to the activity score. It's really just reverse engineering addition of these activity scores. But in the fourth point you can see this forces our neural network, we're calling it a CNN, to learn key features of sequence that might affect activity scores and other features that might not affect the activity scores. And then we're going to bring in our very valuable but sparse experimental data, plus some database information to refine and create a dog classifier, if you will, and then we're going to predict. So, graphically, we are going to take simulated data at the bottom here, put it through all those layers and come out with activity score predictions. We're going to use the real data just to refine the green segment here. And then we're going to do the database information from FarmGKB and other resources to make a final predictor. So the green is the dog I think you get the idea. So the data we used is from Erica and Rachel, they took 360 liver samples, sequenced the CYP2D6 with colleagues, and made two activity measurements per sample. And they found 161 variant sites. And that is our very valuable activity data. And then of course we have some databases of star alleles where we have several that have been annotated normal decreased or no function, and then there's 71 for which we have no function. They cannot be used in the training because because we don't know the function, we're going to have to, we're going to have to use those to evaluate the classifier. So just to be clear, what we give at the very beginning, the equivalent of the pixels of an image are the ACTs and Gs in the sequence, plus some annotations which I'll talk about in a second. We take them through all of our layers of connection, and we wind up coming up with two scores, the score for normality, and the score for no function. That's just how we did it you could have done it as one but we did it as two it works better. And then I'll talk about that in a second. The annotations that we included are these. Is it in the coding region, is it a rare allele, is it deleterious per some of these algorithms that look at sequence information, is it an indel, some methylation or epigenomic markers, transcription factor binding sites, or a known active site of the protein. So that plus the sequence is what we give it. We do our little trick that I've already described. We take our normal score and our no function score, and we make a little decision tree that tells us for each apple type whether we're predicting that it's normal and green, decreased in yellow, or no function in red. That's how we use the output of the neural network we take that blue and orange output, we put it through this decision tree in order to make a haplotype prediction. So, here's the results. We were lucky, because a group in Japan, shown in the lower left, published a relatively recent paper of 49 CYP2D6 alleles, where they measured the activity. This was not available to us in training. So for us it was perfectly great validation data we had never seen these before. What you see on the far right, on the upper right, is the different star alleles that they had measured. That should add up to something like 49, I hope, maybe a little bit less, but it's 49 or so star alleles from this Japanese group. So we have the predicted, we have the actual activity that they measured on the on the x axis, and then we've colored what our prediction was green is normal, yellow is decreased, red is non functional. So I misspoke some of these data were in our training set and they're the triangles, but the ones that are circles we had never seen before. We're very happy because as you can see, the green ones are right or are to the far right in general with a couple of exceptions, the yellow ones take up the middle region, and the red ones the ones for which they measured. Well, the red ones are the ones for which we measured no function predicted no function, and they generally are either very far to the left, or pretty far to the left. So 70 work 1% of the variance in these measurements was explained by our labels. And so we took this as extremely reassuring and encouraging that these kinds of transfer learning approaches with small amounts of data do enable us to build these models. We looked at what each feature brought to the to the dance. So the full model is shown on the far left and then what we did is we removed each of the annotations to see what the impact was on the performance. So you can say we get a big drop in performance when we provide no annotations. We also get a big drop if we don't do the transfer learning, we get an even bigger drop if we do neither. And then this shows that the fact that something is a rare variant was very important. The fact that it was predicted to be delirious is important. And then the other ones all add a little bit of information as well. So we found, we went back to our uncurated star alleles, some of those were curated during the time that we were writing the paper. So we ran the model again we did none of this data. You can see they're all circles here no triangles this is all data that we've never seen before. And once again we were very encouraged to see that the greens were for the most part to the, to the right, the yellows were in the middle with a little bleeding into the green and the reds were predicted very nicely in these in these just two cases. So we got them, we basically nailed that they were non functional. The final thing I want to say is that we want to provide some explanation to a interested pharmacogeneticist or biologist about why we make these predictions. So what we do is for each star allele that we made a prediction on that's in the y axis. And the red, yellow and green shows what we predicted. And then we have these balls that show which variants that we observed drove the prediction. So just to take one star allele 15 which was read that whoops, the dump, sorry about that. The dominant feature that drove that red prediction is this ball, this big black ball here, which is a frame shift at position 47. So at least we can tell our biologist something about what's driving these predictions which is of course quite important where deep learning can be somewhat inscrutable. So let me end there, pharmacogenomics is in the clinic, the UK Biobank and other resources tell us that there's a lot of people who cannot benefit from pharmacogenetics, because we don't know the function of their rare or novel variants. These deep learning methods such as I showed you today hold a lot of promise for predicting which ones will be useful. Hold a lot of promise for producing clinically useful predictions that will be useful in the clinic, even when they're not perfect. And that's something else we can talk about later. Thank you very much. Thank you Russ. And so I'll kick it off. And I am surprised at one thing that I would I'm not doubting it just surprised is but I'd like your insight to, you know, there's this sort of rule of thumb that, you know, common variants with small effects rare variants with large effects, you know, and their exceptions. And so I'm a little surprised then, unless I misunderstood, which is also possible, that by training on common variants, you train to model on common variants, the cats, you can then apply the model to rare variants the dogs. Or, or is it that the, the training was done on both common and rare variants, I believe, because of the sample of livers that were used if I understood correctly, but turn it over to you to get your insight. Yeah, it's a great question. Thank you Eric. So first of all, and I don't mean to be glib, but variants are variants from that with respect to their impact on the biology, the common and rareness of something has to do with evolutionary history and everything. So that's the first thing to say which you know, of course, the second thing to say that I think is quite important is that these cytochromes are under a very weird kind of pressure that you know they don't cause disease, so to speak, and they're very much. In response to what our ancestors ate while they were migrating around the world and so they are quite free. They're quite free to to mutate. And unless the mutation is in some way really deleterious to some basic biological function, you're going to be okay and in fact, folks who have non functional CYP 2D sixes, they can do fine in the world because it's not critical for life. So, so a lot of what is common what is rare is a purely a historical historical accident, and not because of very strong pressures. So it's drift and not selection, for the most part, and I'm speaking a little bit generally. So that's the those are the backup comments to say and then I can to answer your question. There would have been, yes, some rare variation in our training set although it would have been dominated. But the machine learner, we didn't tell it about frequencies, and we only gave it one example of each variation. So we were trying to get it to learn biology so to speak we, we were thrilled that it learned that frame shifts, especially early in the sequence are very bad that was because now we can give it a frame shift that it's never seen, but it figured it has figured out and I don't want to be too anthropomorphic about it, but it knows that frame shifts, especially early in the sequence are bad. And it also of course figured out to some extent that big changes in the amino acids, especially in certain areas, like the binding sites for the small molecule substrates are also very, very important. So I think that we were heartened that we were it was able to learn these things, but where our performance as you all saw very clearly, especially for those yellow ones that can be all over the place is not perfect yet and so the next generation of these models has to look explicitly a 3D structure and has to look a little bit more carefully at the, at the nature of the sequential changes. So I'll stop there. It's a great question. So maybe I'll ask one other general question and turn it over to Casey is there's a question in the chat that I think the whole group will be interested in is, you know, in pharmacogenetics as you point out, things have moved into clinic. And now in the, you know, talking to a large group of machine learning kind of nerves, as, as we produce models results that are predictive enough to move into the clinic. What do you think, how do you what how do you think the FDA is going to look at the results and will they start to be will machine learning start to be treated as a quote, you know, sort of medical device that will need to think about how it can be validated in that context. And my question is not a question, your talk specifically, but is a translational question of Howard, how you think we will be interacting with regulatory agencies, moving forward with when our tool now is more stochastic in nature. It's a great question. I know that the FDA is very interested very aware of all of these developments and is considering what its appropriate role is they, you know, they have a very difficult balancing act between allowing the community to figure out ways to do this effectively, but to make not charlatans or even well meaning folks who are just not adding anything other than noise. So you can imagine trials, and I think it would be very reasonable to do, especially pragmatic trials, where you compare. In my clinic, I have all these patients. If I get their exome data, I will see these rare variants, and you could imagine a trial where we would use this not perfect but pretty good 71% machine learning to try to guide CYP to D six based recommendations versus not. And what not would mean would be used a common variants only and ignore the rare variants. And then it's an empirical question the degree to which we get better outcomes. Those could be difficult trials to do in a randomized controlled way. But at the danger level is not that high for many of these drugs. And so therefore a pragmatic trial or a non randomized trial might be cheaper and still justifiable you'd have to go through that logic obviously. And then the FDA will be left with deciding, do we insert ourselves to approve or not approve these machine learnings, or do we. Do we say this is like the doctor reading the literature or reading a book, and as long as a doctor is in between the machine learner and the patient, we're going to allow this to be part of the general practice of medicine that we're not going to regulate. I could imagine arguments on both sides and I, and, and so I don't know how it's going to go and I can see the benefits of each way, but it wouldn't be surprising if they took a close look at this and said you know what, we're going to have some guidelines that we'd like to see before you start giving doctors advice, because as a machine learning algorithm, it's going to be somewhat inscrutable. It could cause some problems and I think you probably want to have at least light supervision of it, so that we don't do the whole field injustice by having a few bad actors who are generating random advice, bring the whole thing down. Thank you. I think you might be excellent. I'm on now. Thank you for a great talk. I think we'll transition into our discussion section. And so I guess we can bring all of the panelists back in the spotlight. So, as a reminder, this session is about machine learning and clinical genomics. And one thing that seems common across all the talks, we're considering genetics, as well as other factors that can influence what we see in the patient and how machine learning and AI algorithms are useful in this context, and with reference to the usefulness and clinical genomics applications like drug synergy disease presence and drug response. So the panel just started off is, what are some of the biggest challenges to translating these approaches that you guys were described clinically, and if you're advising NHGRI, what might be some things that they could do to help with research in this area of machine learning and clinical genomics. I know I've been talking a lot, but let me just say that the one thing that I mentioned in my talk about the relative under representation of certain ethnic groups is it must be addressed immediately so I can do in my clinic are pretty not good job with northern European descent people in giving them the best pharmacogenomic advice I have, but my confidence in that advice goes down very quickly, as I deal with other populations and I don't like doing that as a physician and it doesn't seem fair or just it's not. So I think that the hard work of characterizing variations in all populations and understanding, especially for the common ones. What their impact is I think it has to be an A number one target. I'd like to second Russ's point. Yeah, I think that's one of the critical challenges for us and maybe another thing to be wary of is this principle of do no harm. Where there is the possibility that we should be acutely aware of these models and algorithms can exacerbating biases. We need to be always aware of about how there's an interaction between the data that's collected and how algorithms can can amplify some of the biases in the data. So one more thing I want to add is that the term that most frequently appeared in many of, you know, the slides of the three talks and then many other talks was interpretability. So I think the trust to wordiness of a machine learning models. Well feature attribution is, is one of the one of many ways, and then also identifying color factors and then even, you know what what Ross just talked about is a domain shift when you use machine learning trained based on certain data and tested in another set of population, you're experiencing domain shift and then, and then machine learning model can capture something that was not did not appear in the training data so even that kind of, you know, behavior can also be captured in interpretability the right interpretability methods. So I think the development of this, you know, the machine learning methods for improving the interpretability or explainability can be really important. Thank you all. And what one thing that came up in Dr Altman presentation was the value of using the knowledge resources and transfer learning and one thing I'm curious about for for all of your approaches is like as evidence and knowledge changes how does, how does that change the performance of your approaches or is it robust to to those changes over time. Incredibly important issue. We have to make the assumption that anything that is deployed clinically will be regularly evaluated with respect to the data it was based on. It's a very thorny problem though that once you deploy an AI system, it will affect the data that is generated subsequently, and therefore there's a real risk that we that we become impoverished with truly independent new data from the clinic. Since the first generation of AI systems is deployed, and I haven't heard great solutions to how do we make sure that we don't basically tanked all of the data, because the AI algorithm is driving us in certain directions which means other directions are unexplored, and especially by the homogenization of the advice, we actually lose the chance for serendipitous discoveries that we're doing things wrong, or that things could get better. I think this is increasingly going to be a big issue throughout all of AI, because once you deploy it self driving cars, medication selection disease diagnosis, once you employ it, you really can't go back to the pre AI days of kind of in the wild data. Thanks, and that's, that's great. Thanks. So there's another question about that came from the audience that's, I think, relevant to both Dr. Lee and Dr. Concaramana. This is about the performance of linear versus AI models, and it seems like there's consideration will probably for all approaches there's considerations for performance. In terms of feature importance or for heritability and I was just wondering if you all could comment on on on that. I understand your question better so you mentioned the linear model versus complex AI model. Yes, so so sometimes you can get, sometimes you can start with the simplest model and get really good good performance and you're okay and but since we're in this, you know, it's part of this workshop. There are six circumstances where machine learning algorithms may work better and I was just wondering if you guys have had experiences with comparing the two and start an understanding scenarios where one kind of model works better than the other and that's an excellent question so I have some one story so you know linear models work really well in many cases and then and then you know sometimes when you have a larger sample size and then complex model works better but sometimes, you know the difference in terms of accuracy or you know test assess error test set in MSC, it can be just incrementally better so we did some analysis it's now published in nature machine intelligence 2020 as a part of our tree shape paper so we are synthetically generated data increase the amount of non linearity so the powerful aspect of this complex AI model is the ability to capture complex relationships so you and you have a set of X and then you have a Y. And then naturally it may a lot of biomedical problems it's complex we call many phenotypes complex phenotypes right so which means that you know it's it's not really linear. In most cases and we see a lot of in clinical journals and you know you shape the curve there's always a good range and then it's by you know by itself it's not not linear so. So in our you know experiment we showed that when we increase the non linearity, the complex machine learning model for instance, you know the tree ensemble model or dim neural network they perform the better than linear model in terms of accuracy but at some linear model data still well when there's a some amount of a non linearity between X and Y, but still picked up wrong features so prediction accuracy or test to look likelihood or MSC. These are sometimes you know not the right measure to see whether linear model does well and then we you know we did the analysis using synthetic data because that's the only way you know that what the true features are. And then you know still the ability to model or capture the the complex relationship it's something that's really important and many of the biomedical problems we are solving these days. So I would still encourage to use when you have as long as you have enough sample size using you know complex non linear model. And then you can now you know this area of explainable AI or interpretable machine learning it's growing rapidly and then there are so many methods to help users to understand the inner workings of black box model so I would encourage you know keep using this model and then using this interpretability method, you know post hoc manner. I just want to acknowledge your question Casey I'm sorry serum but you know so many students want to use deep learning because they know that they is a good job market for deep learning, and so it needs to be on their resume. But so it should be a rule in academia that you must try a linear model first, because you, because there's so much pressure to do something overly fancy for kind of non scientific reasons that I do understand. But really, Sue's comments just now we're right on, you need to try the linear models and only move it when there's evidence of important non linearities in the in the solution space. Yes, so to follow up with Sue and then just mentioned I think the linear models are a baseline. And that's kind of what you need to start with and show that these non linear models are doing better and whether they actually do better is going to depend on the ratio of the number of features to the sample size and signal to noise. So, in the example of complex straight genetics that we're dealing with we are still in this high dimensional regime where we have hundreds of thousands of samples but the number of snipses in the millions and the tens of millions. So what happens there and this is an open research question is, we don't have good non linear representations of this million long, tens of million long genotype vector yet. I'm sure we'll get there but we don't have that so which is why, as a community, we have been dealing mainly with linear models and modifications to these linear models to try to introduce non linearities in a specific biologically plausible but one thing I want to add there is even these admittedly simpler models can be really challenging to work with that scale. So there are interesting issues, even with these linear models that need to be tackled at the scale that the data is being generated. Thank you. But I'd like to have Sue talk to us a little more about this concept of interpretability. So for those of us that are new to shaft, you know, how can we how do we or how should we interpret interpretability. Sorry for that. And is it, is it wise or a big mistake to have sort of causality of mechanism in our mind or the back of our mind when we're thinking about interpretability. That's, that's really, you know, really good question. So, you know, in this feature attribution methods, such as a chef. I presented what it really does is to measure the influence measure the impact of each individual feature on a certain prediction it doesn't have any implication in terms of the, the causality. The main question there is how this prediction was driven. So, one, you know, one of the big mistakes when you know from this a shop users or you know people who use this feature attribution method is that you know it has a somehow you know causal impact. So, you know downstream analysis needs to be done to, you know, nail down to the question of whether something is a causal right now. And then we all know that when we have observation or data so you are seeing, you know, most of the time in biomedical problems we see the question of just one time slice, and then we are measuring gene expression or epigenetic, you know, and then, and then you know the, we can never answer the question in terms of the causality we don't have any interventional data right so In that case, you know, there needs to be some other methods to actually answer, you know, for instance using prior knowledge, or, you know, clinical knowledge, you know, the, the things that are already known in terms of X influence why or gene, some gene X, you know, transcriptionally regulating why. And then, you know, there needs to be some methods to, you know, explain this explanation. So we have some several, you know, follow up papers on that for instance you can model your shop value based on various genes that say each feature is a gene expression, you can model each shape value of a gene as function of the genes properties say that you want to predict cancer drug sensitivity and then it can be, you know, genes driver potential how much it's a mutated epigenetically modified and so on right to you know, come with some certain explanations there needs to be some, you know, further analysis to answer that question. If I can just agree with Sue in, you know, one of the big problems with machine learning and I know that the organizers of the session said that we should wind up giving some advice to NHGRI and as Eric knows very well. I'm happy to give advice to people. One of the big things is deep learning has been very greedy for lots of data. And the fact is, in biology, with a few exceptions yes we have genomes that we have a lot of them, but in many, many important cases we just are not dealing with the amounts of data. And so what Sue in just said about the importance of having background knowledge integrated into these machine learning systems is absolutely critical there was a question in the Q&A, and I don't know how to use it so I deleted it instead of answering it, but it was about what is a macro shot or one shot learning. In machine learning this means you get it right, even though you've never seen an example before, which sounds like magic, but humans do this all the time I have a two and a half year old grandson, who has seen one cat, and when he looks at a new cat he knows it's a cat even though he's never seen it before. So I think another area that I think we have to ask that where biology has even more of an investment than other application areas is nonlinear machine learning in data poor situations where background models and background data and layers prior probabilities are in a very strong way integrated and of course the community works on those but I think that we have a special interest in this problem, because of what's at stake, and because even our biggest data is often not very impressive to Google or Facebook or Twitter. So that that would be a research program and machine learning that I think we could take a unique lead in. So I'll draw you out a little more I think on that concepts, because you have so much practical experience in the EHR space where there is a lot of sparseness. And, you know, it, you know, bringing together all of these talks and translation and thinking about interpretability and sparse data sets is what do you see is the future of machine learning again and translational genomics. For data in particular moving away from what, you know, in sort of very carefully controls, you know, our design studies where there's very little missing data, you know, sort of longitudinal epidemiologic studies, there's just very little missing data we've, you know, very carefully control environments for measuring blood pressure, let's say, and now you move over to the EHR space where there's a lot of missing data, and often the measurements are not as carefully controlled in the measurement methods. I think that that is a great example because why doesn't that mess up clinicians and it's because they have this background knowledge. So, in fact, Sue in and her colleagues just recruited an excellent new faculty member named Shang Wang, who was in my lab. And one, what Shang showed is the ability to do this zero shot learning, where he didn't do it for patient medical records, although there are students now in the lab trying to do this. But he showed that when you have a structure of a field, for example, an ontology, that when you look at data you can say, Oh, I've never seen one of these, but I know it exists and it's a pretty good match. I'm going to call this, you know, a liver cell, even though I've never seen a liver cell. And so I do think that not all missing data is equivalent. Some of it I think humans impute almost implicitly so docs, when they're looking at the chart they don't see certain measurements. They are very good at guessing what those measurements might have been. And so I think that part of it deep machine learning should be able to do the problem is when the missing data is missing and critical and very difficult to predict from kind of common sense background knowledge. And that's the harder case. But I am bullish on these methods, eventually giving up being able to fill in a lot of that background missing data. I'd be very interested in my colleagues thoughts on that. Yeah, I think, you know, Ross gave a very good answer and then, you know, just to add, you know, there are. So we have been talking about supervised learning methods to me today. There are a lot of unsupervised method to understand the structure of the data and then there are many, I see many papers on you know embedding learning, you know, low dimensional embedding from EHR data. So what that can do is that when you have very well controlled study data and then you can use that data as a way to impute this missing values in the, in the data sets that are not relatively not, you know, dense. Okay, so a lot of, you know, like this message for learning the structure of the data on supervised or separate supervised learning methods should also be developed in this space. Yeah, I think there's a lot of value in, in methods which can essentially take small numbers of carefully collected data and extrapolate that to settings where there is noisy data that's collected. So I think that the kinds of methods that so in alluded to in terms of learning common embeddings, those seem to show a lot of promise in being able to do those kinds of extrapolations. There are a couple of questions in the chat that I'll kind of summarize but the bottom line is the ability to incorporate longitudinal data into machine learning methods. Could one of you speak to sort of the methodological tasks or challenges of incorporating longitudinal data. I would say that you're getting a lot of silence, because in my career I have tried, whenever faced with longitudinal data, the first thing you try to do is make it unlongitudinal, but that's the wrong answer, but it just simplifies the analysis. But let's just acknowledge that this is absolutely critical because disease trajectory transcriptomic trajectories developmental trajectories. They're all absolutely critical. They're a big data set and all of a sudden make it look very small. So as part of the answer is going back to the previous discussion about prior models because gigabytes of data when you then map it out over the timeline becomes megabytes and then people say oh I can't learn anything from this. I think our silence is only, I'll speak for myself, embarrassment that the field has not been able to really make a ton of progress there. Now, there are people on Wall Street who are very interested in AI and following the temporal trajectories of stocks. So this is an area where we should probably look to Wall Street to see if they come up with anything good. But so far, it's very short term and it's not very causal, which gets everybody very nervous. So, yeah, I have similar thought in, you know, I actually think that longitudinal data is not, it's not a challenge. It's actually an opportunity. So for instance, you know, think about like causal inference problem in longitudinal data, you have a little more hint, right. I mean doctors do something and then after, you know, future time points you see the effect of it. You have a little more hint, yeah, about causal relationships then, then just, you know, one slice time slice the observational data. So I think, you know, I think the main challenge is that in the medical domain, that kind of data are less available because of patient privacy issue. This is what we are experiencing in general, you know, a EHR data, you know, field who's, you know, where we analyze EHR data. So I think that, you know, I'm viewing those that kind of data set as like an opportunity and then, and there should be, you know, more development of the method. For instance, you know, causal inference and also, you know, how to do, you know, explanation better because neighboring time points they must have, you know, similar explanations right so I think there is a lot of opportunity there. Right. And then there's an HRI for the all of us at the whole NIH, the all of us effort with a million people followed over time, plus the other biobanks internationally. We really won't be able to ignore this problem and hopefully there'll be some, some, some good new methods for watching these patients over time. So it depends on the density of the time. There are some kinds of time series data which is, let's say you're getting blood pressure monitoring, you have very dense time series data, whereas if it is getting somebody's CT scan, then the density gets much sparse. So you need to be careful about how you integrate across different timescales. Yeah. So Casey, I'll give you the last word there and then we'll turn it over to the organizers for the wrap up. Yeah, as I see that we're in our last few minutes. I was this last discussion is very interesting and I'm realizing that there's like opportunities for collaborations across clinical and genetics, genetic research just because you see a lot of disease that are either genomics with minimal phenotypes or very, very phenotype heavy with little genomics but that's a kind of opportunity I'm seeing to cross those two areas more deeply. Thank you all very much wonderful presentations and great discussion. So we can turn it back around right over to the organizers. We told my co chair markets back up here. Yeah, so what a fantastic workshop it's been. I'm sitting here reflecting making notes and the quality of the talks has been fantastic, the quality of the discussions has been at least as fantastic. After the talks. It's also been nice to see all the appreciative feedback we've been getting over over a variety of media channels including email. First I was sad not to have this workshop in person to the coven. Now I'm not so sure the, the, if anything I think what we've seen is the power of zoom to reach very large audiences. People might have noticed that at any given point there were between 500 and well in excess of 1000 people listening, and that was just the ones we can see who are directly locked into zoom. And, and the, the attendance I think easily spans from card carrying machine learning biologists to the general public. What Mark and I want to do in the in the last few minutes here we have about 15 minutes left here, just to summarize some of the brief themes we saw today, kind of like we did yesterday at the end of sessions one and two. I'll try to do that now for for summarizing briefly the, the, what we saw today in sessions three and four. So, so the first observation I had was, we saw lots of examples of how large NIH sponsored data sets have contributed to machine learning science. I think starting with with Alexis talk at the beginning of session three. She really highlighted very nicely GTX is a major data resource that came up actually at several points in the discussion today about arguments for and against investment investments in those large resources. I think we also heard similar discussions about roadmap and some others so it's not just GTX of course. So that was point one is is these large data sets really do help stimulate the field. Another thing I think the session three did was to start to outline some of the central problems people are thinking about in genomics with machine learning and and and problems they should be thinking about. And this nicely highlighted this this variant to function problem that has long been a central challenge of genomics and and now we're seeing deep learning and other associated technologies really make inroads in very into function. Also in integrating other data types like expression and the awesome power that she showed of doing that. So this discussion in a various in various talks today about some desirable characteristics mark and I have decided to call them of of of how machine learning studies should should expect data to be structured and and organized. So what are some characteristics of the data that make machine learning easier and more facile. So so one of course, and some of these may sound obvious but they're worth stating anyway. So of course is the data have to be fair and and and most important readily available to researchers to access those data and to to build analytical approaches against those data. The second point we wanted to make and that we heard was the high value, not just in the raw data, but in the processed data sets and so deep sea was one example that was mentioned several times where that had been done to great effect for for the research community. Also associated with this is good metadata so again not just the raw data, not just the process data but annotations on those data. One never knows how they could be used in future projects. So the official kundake in his in his talk had a really good point about being transparent about the limits of machine learning and data, the blind spots the biases the pitfalls. And so, as applied to data sets. It's always nice when data set providers are transparent with respect to what they think the weaknesses are. So I think that's a good point Mark, and about halfway through Mark so we saw that Mark take over and finish the summary. Thanks train. So another theme that that we noted that was discussed a number of times today I think was the emphasis on using machine learning, not only to make predictions but to try to gain novel insights into biology. Some of the implications of this we saw in the talks were an emphasis on interpretable machine learning. We saw a number of talks that that brought up using causal inference methods, since of course, very often what we care about in biomedical applications is understanding mechanistic relationships between inputs and outputs. And then on show made a very insightful point about in this context where we're focusing on gaining insights that the usual performance metrics may not really capture what what we care about. And there's probably a whole different set of performance metrics that need to be considered and applied in that context. Another, another theme that we saw a lot today and tied to the last one and this this really carried over from yesterday was a lot of emphasis on on interpretability. Today we saw both on shul and rest all men applying deep lift we saw Alexis battle and Greg Cooper talking about graphical models, causal graphical models in the case of Greg Cooper, and Sue and Lee talking about chat so a variety of methods for for hearing inside the black boxes of these more models to try to understand why and how they're making their predictions. Again, related to this we saw the the interplay between machine learning and causal inference and the complementarity there. So Greg Cooper talked about the TCI algorithm for causal inference. In, in, in one of the later talks by Dr. Sankaraman we heard about Mendelian randomization, and then Greg Cooper made this this very important point that that we want to keep in mind that most machine learning methods are learning and not necessarily causal relationships. And of course a key criteria and a key driver in many of these studies is to get at causality as we've talked about. So I think that that the talks, the discussions have given all of us a lot to think about, and a lot for our colleagues at the next year I to think about in going forward in in their sponsored activities. I would just second what Trey said I think it's been an outstanding workshop, and we would like to thank the speakers whose talks were uniformly excellent over both days and all sessions. We'd like to thank our fellow members of the NHGRI genomic data science working group, many of whom were session moderators throughout the workshop. Thanks to the audience members for your participation for offering lots of very excellent questions. We want to thank the people behind the scenes who kept zoom and the other technology aspects of the conference running very smoothly. And also the organizing committee at the NHGRI, which includes Carolyn Hutter, Valentina de Francesco, Sherjo Sen, Chris Wetterstrand, Natalie Kutcher and John Guerin. And thanks again everyone for your participation in the workshop, and have a great day.