 Hello everybody. I am once again your co-host, Trey Eideker. I'm a professor of genetics at UC San Diego. Today is day two of the machine learning workshop. Yesterday was great fun. I have no reason to believe any less of today's event. Just a couple of reminders before we kick off session three of today's workshop. First of all, we had a lot of participants online yesterday at some point exceeding a thousand attendees. If and when we reach a thousand attendees today, what happens is any additional people who sign up and join the thousand and first essentially begin to join not on Zoom, but in the live stream event on YouTube. So don't be surprised by that. In terms of yesterday's presentations, we've had a lot of requests for them. The PDFs are posted of yesterday's presentations on the website today. As we go throughout the day today, PDFs will appear piecemeal as sessions go on. So please look for those in terms of video recordings of the presentations of all of the presenters. We are also going to release those sessions, including the Q&A sessions, but that'll take a few days. So please be patient. It will appear. For those of us on Zoom, unfortunately not on YouTube, but on Zoom, you're more than welcome and encouraged to ask questions using your Q&A link. And I think anyone who attended the workshop yesterday sees the lively discussion that can, you know, is engendered by that. So please do avail yourself of the Q&A link. And finally, a closed captioning link is also available and will be shortly provided in the chat if it has not been already. It has not been already, so please look for that relatively soon. So with that, I will turn it over to my co-chair, Mark Craven. Good day, everyone. I'm Mark Craven. I'm, as Trey said, co-chair of the workshop. So welcome back to day two. I'm a professor of biostatistics and medical informatics at the University of Wisconsin, and also along with Trey and others serve on the data science working group for the NHGRI. So we had a great day yesterday. Maybe it would be good just to highlight the goals for the workshop, which I think are threefold. So one is to identify the unique opportunities and obstacles for applying machine learning in genomics, ranging from basic science to clinical applications of genomics. The other is to identify the key scientific topics in genomics that can benefit from machine learning. And then the third is to help try to define the unique role of the NHGRI at the confluence of genomics and machine learning. So we'll just kind of keep those in mind as we go through the day, and there will be a wrap up session at the end of the day. We'll come back and revisit how the talks have reflected on those goals. So the format of the sessions today is going to be like yesterday, where we will have talks by stellar speakers after each talk, there will be five minutes of Q&A that will be specifically addressed to the speaker who just presented. And then after the set of talks in each session, there will be 30 minutes for a broader ranging Q&A. And those of you on Zoom have a Q&A box at the bottom of your screen. You can use that to input questions. I know yesterday we had many more questions than could be addressed in the actual session. But please contribute those questions generously. And I know they're all being logged and will be considered by the NHGRI. So without further ado, I think we can start our first session. And the co-moderator for that will be Christina Leslie, who will introduce our first speaker. Hi everyone. I'm Christina Leslie. I'm a member of the Computational and Systems Biology Program at Memorial Stone Kettering Cancer Center. I'm very excited to co-moderate this session three on data and resource needs for machine learning in genomics. We have an amazing lineup of people truly at the forefront of genomics. Our first speaker is Alexis Badal, who is an associate professor of biomedical engineering at Johns Hopkins University. Hi everyone. My name is Alexis Badal and I'm from Johns Hopkins University. Today I'm going to give you a little bit of an introduction and background into the work that my lab does more broadly before I go into a bit of a deep dive on our work on understanding rare genetic variation and ends with some more general thoughts relevant to the use of machine learning and genomics. So one of the major goals of my lab is to understand the effects of genetic variation specifically on gene expression and how that then goes on to impact high level phenotypes such as disease. And we are a purely computational lab, so much of our work is on methods development to achieve these goals. So one of the complications that we face is that there are many factors that actually modulate regulatory genetic effects. There's not really just one effect of a given genetic variant, but it can be in fact very context specific and different aspects of biology can modify the effects of genetic variation, including differentiation and development, leading of course to cell type specificity, also environmental response and sex specificity have effects that we observe as well. And disease is actually affected by many of these specific contexts, and to in order to address this we therefore need both tailored data to represent these different contexts and possible effects and methods development that can use such data. Now one data set that I will highlight. In this discussion is data from the GTX project which our lab helped lead over the past few years, and GTX's goal was really to understand the tissue specific effects of genetic regulation of gene expression. To do so, we collected this very large data set that includes almost 1000 individuals with RNA sequencing across over 50 different tissues of the human body. And now paired with whole genome sequencing of these individuals, and this very large data set allowed GTX to run both cis and trans eqtl detection in each tissue, which ultimately then provides just this huge catalog of eqtls that can reveal the effects of individual genetic variants on gene regulation and all of these different tissues, and ultimately then to intersect these effects with disease to try to understand downstream impact on phenotype. One thing that I'd like to highlight about the GTX project and other very large scale data collection efforts, is that it has led to dozens of creative projects coming out of GTX within the consortium and outside of the consortium going well beyond the original goals of eqtl detection. And this is true for many other large scale data sets, you know, some of which I've worked on such as the depression genes and networks data, but you know, many, many others such as encode, roadmap, epigenomics and others that have gone on to enable really creative and interesting work well beyond their original goals. So to highlight of just a few of the diverse things that my lab is interested in right now, before I get to the deep dive, what we are really thinking about is how do we combine machine learning methods, machine learning methods development with very diverse and often context specific transcriptomic data, in order to understand the effects of genetic variation. So currently some of the highlights I'll point out are an interest in single cell and dynamic eqtl models. We have a specificity more broadly going beyond just cell type specificity work to then integrate the effects that we find on the transcriptome with GWAS and other disease study in this multiomic integrative analysis and large scale network inference and integration with disease. But for today I'm going to go into a bit of a deeper discussion about our applications of machine learning for understanding rare genetic variation. So why are we working on this. Well, our motivation is that rare variation is very abundant in the population. So for example, an individual genome is going to have on average somewhere around 50,000 rare genetic variance where rare here is defined as below a minor allele frequency of point oh one. For the more in aggregate we do know that lots of evidence suggests that rare variants are enriched for deleterious properties, and that they actually do contribute significantly to both rare and complex disorders. However, evaluating the impact of any given rare genetic variants from whole genome sequencing remains challenging of course if they're too rare they cannot be assessed by association studies and predicting their consequences can be very challenging. Finally, half of rare disease patients, for example, go undiagnosed with current approaches. So the goals of this project were to explore the impact of rare regulatory variation, and specifically to explore the complex of complex effects of rare non coding and regulatory variance, as we see from RNA sequencing data to provide evidence of their effects. And then ultimately to develop an integrative machine learning model to prioritize rare regulatory variance from personal genomes supplemented by RNA sequencing data. So this was a project that did come out of GTX. This was not a project that we originally planned when we joined the GTX consortium it was not a project that we proposed you know as part of any of the grants we submitted or anything like that. Originally, but once we could see the utility of looking at whole genome sequencing in combination with RNA sequencing in the GTX project. It was a big focus of our work there. And here what we focused on was 714 individuals of European ancestry for whom we had both whole genome sequencing and RNA sequencing again across multiple different tissues. So, how do we use RNA seek to help prioritize functional rare variance well the hypothesis that we begin with is very simple that a functional variant, regardless of its allele frequency and regardless of whether it's coding or non coding will cause some sort of disruption at a tissue and cellular level, in addition to any consequence to disease, and specifically rare regulatory variation, we think should often result in unusual expression of genes near those variants likely acting insist. But that a very large effect rare variant should have some, if it's regulatory should have some large observable consequence in gene expression. So the simple approach that we took to begin with is to identify individuals whose gene expression is very far from the population average. So we defined a class of individuals that we call outliers is in here I'm showing total expression outliers. And to define these what we do is begin with a large population such as GTX where we can estimate the normal distribution of gene expression. You know, as one specific gene at a time we just build a distribution of what gene expression normally looks like for for an individual, and we then can identify individuals who are very different from the rest of the population. And here, we're just showing individuals who have a z score that exceeds some particular threshold that we can define. And those will be our outlier individuals, and we and others have used this basic idea in several analyses now listed here on the right. And going beyond just total expression, however, there's a lot more information available to us in RNA sequencing. And one thing that we can look at in specific is alternative splicing. And we do know from many previous studies that both rare and common genetic variants that affects splicing have been implicated in disease and again both rare and common diseases. But abnormal total gene expression was was sort of simple to describe here you either go up or down and you may go too far up or too far down and that makes you an outlier. But how does that translate to quantifying who is an outlier who is unusual for for splicing. So if you have a gene with many different possible exons and many different possible slice junctions we're talking a multi about a multi dimensional space, where an individual may be an outlier and how do we really define that. So this is a challenge that we had been looking at. And a student from our lab Ben Struber developed a method called spot splicing outlier detection that can in fact address this. So what he does is to again think about our population of individuals such as the 700 people that we have from GTX, you can build a matrix that describes the. How often they use different slice junctions based on quantification here from leaf cutter. And from that we can again estimated distribution of what the population normally looks like. And here, instead of just a univariate Gaussian Gaussian, it is a multivariate distribution here he's using a Dirichlet multinomial that he then can estimate the parameters of from our population of individuals in GTX. So he can actually estimate all the parameters and learn them directly from our observed quantification of the GTX individuals to build this distribution of what splicing normally looks like in our population. Now if you have a new individual, you can then take that learn distribution and compare them to it and figure out how far away they are in multi dimensional space using the mahalinobus distance. And what I'm showing at the bottom here is an example of an individual who was detected to be an outlier based on spot. The pink individual is an outlier. The black individual is an inlier who's very close to the population average, and you can see, for example, that this individual seems to the outlier individual seems to retain a piece of intron that the normal individuals do not. And that would be hard to pre define it's really not just going up or down an expression it's really displaying an unusual pattern and we needed to be able to quantify exactly how unusual is that. Okay, so with spot then, and this idea of over and under expression outliers were able to investigate how often these individuals also have distinct classes of rare genetic variants that may then explain their unusual transcriptomic behavior. And now we have multiple different ways of looking at it so the four different plots here are four different categories of outliers we have our over expression outliers or over E outliers are under expression outliers. We have a leal specific expression outliers. And you can see just at a very high level but that the categories of rare genetic variants that coincide with these different outlier individuals are in fact quite different and illuminating as to their function. For example, you can see just from the red that on the top left that our over expression outliers often can often coincide with genetic variants that are duplicates or copy number variants and things like that. Whereas our under expression outliers often have deletions or frameshift variants that may cause extreme under expression. And then our splicing outliers look very distinct from either of these we're not seeing a lot of structural variants but what we see most starkly is just a large enrichment, especially for the extreme splicing outliers. For individuals having a nearby variance in one of the splice acceptor or donor sites or very nearby those. In addition, in yellow here some evident enrichment of just other coding variants for these individuals. So we wanted to take these observations and actually integrate them into a machine learning framework that could be used for personal genomics. Our goals here are not new is to take an individual whole genome sequence and all the variants that are identified from their whole genome sequencing in combination with features derived from those variants and use and build a model that can then predict from that information which of those variants are most likely to have functional impact for that individual. There have been many models that have attempted to do this including models that many of you have probably used such as CAD but a large suite of models in ongoing development. And these take advantage again of some of the really large scale data sets that have become available now such as road map epigenomics and end code and others. And there are also models such as conservation known transcription factor binding sites and things like this. I do want to highlight that these data that I'm mentioning that they use as features about the genome are general properties of the genome in those regions and are not personalized the individual whose genome were evaluating. But and we wanted to do something slightly different so these models. We have a building is, you know, all sorts of different predictors based on this data. But what we wanted to ask was, can we supplement that with other personal functional molecular data, again going back to our original hypothesis that a rare variant that might impact some of these health should also have a molecular signature in that specific affected person. And so how far can we get by taking personal transcriptomic data and incorporating that into our predictor to build a hopefully a better model that would again which variants from their personal genome are likely to have large effects. And that's exactly what we did so Ben Strober developed a model called watershed that integrates multiple different molecular signals together to try to make this prediction. So the different layers of this probabilistic model that we built include G which represent genomic features of the rare variants that we are attempting to evaluate again this could be conservation scores, regulatory element annotations and things like this, and E is also observed E is the signal that is derived from the molecular phenotype such as RNA sequencing. Now the key part of the model is of course these latent variables, which are actually the thing we care about which is whether or not our variant in question has a regulatory effect. Now, we do not we assume that we don't observe this so we are going to try to predict the values of these variables Z, given our genomic information, and our signals from our molecular phenotype now we have multiple of these variables representing that we may have multiple molecular phenotypes, such as expression and methylation, or in our case, expression, splicing and a real specific expression, or also in GTX multiple different tissues. So in each one of these, we can incorporate as a separate molecular signal E here, and watershed is very directly and easily extensible to incorporate any molecular signal that you may have, and other different data types beyond RNA sequencing as well. Now beyond being hopefully advantageous because it's this integrative model that includes you know both whole genome sequencing and RNA sequencing or other functional data. And there are a few other nice properties that I want to point out. So one of them, and I think this is important, generally in genomics is that it is trained in an unsupervised manner. So again, we assume that these latent variables Z of whether a rare variant actually has impact are completely unobserved during training and testing of new during training is important because that means that we don't have to go out, and in an attempt to collect a large and unbiased set of training data that can tell us whether a given rare variant has impact or not and this is something that has been historically quite challenging for other methods that have been developed is that just, you know, we don't have a huge set of variants that are known to be functional and variants that are that are not functional. It's also very efficient to optimize and apply. We optimize model parameters here using expectation maximization. The actual layer here is actually a icing conditional random field. If you have a very high dimension for T, there are approximate inference methods that we provide that will still make it efficient and applicable in your data. And then at the end the train model can give us a posterior probability of impact for any rare variant that you're evaluating, given whatever data you've observed in your new patient or new individual and it can give you that posterior probability given G given G and et cetera whatever data that you have available. Now what we observe is that the inclusion of signal from RNA sequencing allows watershed to, to have a very large improvement in prediction of which variants are functional over using only annotations from the genome alone. This would be comparable like our model here that I show in red and we call gam or genome, basically genome annotation model that excludes information from RNA sequencing. Blue is our watershed model which includes that information from RNA sequencing and you can see that in each of these precision recall curves, we get a large boost and improvements and improvement by using that signal from RNA. This is actually replicated, even if we make predictions in completely independent data outside of GTX, which we've done in a few different data sets, and several variants we then went on actually to validate with CRISPR-Cas9 and saw that the watershed predictions in general did hold up for the most highly likely variants estimated by watershed to have a functional impact. So if you take the class of variants that watershed predicts to have the most likely to have functional impact and compare it to the same number of variants from a genome only model or gam model, watershed really dramatically improves identification of rare variants that have a very high absolute risk of functional impact, again by evaluating their impact on the transcriptome in held out individuals. So the variants that watershed predicts to have a high impact are very quite likely to actually have that impact, whereas the genome alone model in general cannot predict with that level of accuracy. So the conclusions that I'd like you to take from this are that rare genetic variants do in fact often coincide with large transcriptomic changes, giving us access to the effects of rare regulatory variants and not just rare coding variants, and that our integrative model watershed, which uses samples from RNA-Seq and is extensible to all sorts of other data types, provides a really large improvement in rare variant prioritization using only whole genome sequencing alone. If you're interested in reading more, our paper is posted here, and the software for both spot and watershed are available on GitHub. I did want to end with some more general parting thoughts on enabling machine learning and genomics. I think I've highlighted already that I think key resources and opportunities include these large data sets that can enable diverse creative applications beyond the original conception of the data itself. One thing I really want to highlight is making those data sets easily accessible by researchers, you know, even outside the original consortia and things like that. We of course need, you know, a diversity of data types to highlight different sorts of genomic impact. We need flexible computational resources with an increasing interest in cloud computing, of course. And we need tools and software, not just sort of finalized tools, but also powerful frameworks that enable people to develop machine learning methods from machine from deep learning probabilistic modeling and you know traditional ml so frameworks that enable development of machine learning methods I think are really valuable. The challenges that I'd like to highlight include the impact of confounders and technical artifacts in these data, you know, even the most clean data sets do have, have some technical variability in them. So the collection and annotation of these data sets with really extensive metadata is critical and that was really important to our work in GTX and in depression genes and networks. And some things that are a little bit, you know, different from just thinking about resources of data and computing is, you know, how do we train researchers in this highly interdisciplinary area where they need to know both, you know, stats and computer science and biology and genetics, you know, how do we, how do we accomplish that. And one thing definitely I want to highlight is the need for investment in vetting, maintaining, and really coming up with really high quality computational tools from academia, you know, which is very different from software development in industry. And really how do we incentivize people to do that in a positive way and make sure that the computational tools that are available are really high quality. And I want to thank all the members of my lab, particularly highlighting here the work of Ben Strober, and all of my collaborators that were involved in the work as well, and the GTX consortium especially. And thank you very much for your time. Thanks Alexis for that great talk. So, we have some questions for Dr. Battle in the Q&A. I think the most important thing to address is just sort of, there are technical questions about the watershed model itself and what's the dimension of the variable G and what are, you know, what precisely is, are you modeling? Is every locus model together or is it sort of one effect at a time? Yeah, that's a good question. So, the model is applied to every variant separately, but the parameters of the model are trained jointly across all variants, but if you're applying it, you're applying it to a particular locus at a time, if that makes sense. So, the parameters of the regression from genomic annotation to whether or not a variant is functional will tell you things like how important is it if the variant is a, you know, known MMD variant, how important is it if it's in a known regulatory element or conserved. So the G features are features about the genome, like conservation scores and things like that. And the parameters are trained jointly across all loci, but then when you apply it, you're applying it to a particular locus at a time. Okay, there was a question about sample size requirements for the watershed model. Yeah, so actually there aren't not many parameters in the watershed model so you can restrict G to a, you know, I don't know about 50 relevant genomic annotations and then the model from Z to E is actually, you know, only like a dozen parameters. So the parameter size is actually quite small, so you don't need a huge sample size but you do need a decent sample size in order to estimate who is an outlier for a gene. So you do need to build a good distribution of what normal gene expression or normal gene splicing, or normal allelic balance looks like. And in our original application of a similar method we had about 100 individuals and found that that was successful. I think that that's a reasonable sample size to build a good distribution but you know now with GTX we have about 1000 individuals. But yeah I think you know less than 100 and it's getting a little questionable. So, so there was, there's a sort of more open ended question that seems popular. The question is, I wonder how many of these rare variants that are within active regulatory regions in development and are not in adult tissue, maybe you have variants that show no functional impact in an adult tissue but actually do play a role in development and, and, you know, the question goes on to ask you involve a taxi chromatin accessibility data and your prediction. Absolutely so you know one of the big interests of my lab is looking at context specificity and one of those does in our lab include looking at expression during development and differentiation, which is simply not accessible if you look at adult tissue. So, you know, this model is going to be based on whatever data is available so if you would like to apply it to a context that is very highly specialized you need data to do so and that is a limitation of course. So if you're basing it on the GTX samples or, or, you know, we actually base some of our reference distributions on the recount data which has, you know, 10s of 1000s of our NAC samples but most of them are adult tissue or cell lines. If you're really interested in the very specific context, you know you need data that represents that and I do think that's very important. There's also a question asking if watershed meets causal modeling requirements or is it just a prediction tool. I guess I would say is that you know what goes into it is the assumption that genetic variation is causal. It does not, it does not build it into the model. It's just assumed. Good question though. Okay, I mean, this is in the context of your talk so I'm wondering what your perspective is on the evolving landscape of how we deal with confounding factors and large scale data. Do you have any new approaches to mitigate this or still reinventing the wheel is there viable progress to standardization. That's a good question and you know I did highlight confounding factors is one of the points that I think we always need to consider when we evaluate data like this. I think there are good tools out there you know we found that applying tools that have existed for nearly a decade now like peer or SBA, or even just simply PCA for these large scale data sets are quite successful in accounting for a lot of the technical variability. But I do think that it's really, really important that as we collect these large data sets that we annotate them with anything that we know you know depth RIN of course you can derive but you know things like batch and it's really important to account for but I don't think that it's necessarily a methods development question as much as it is like when you're actually applying these models you just need to be cognizant of it and actually make sure you do apply them. Mark what's your call should we move to the next talk or I see a quick question if I could address that tray asked in the chat that he asked about how models like watershed could be used in a clinical setting. And I do want to note that we are in fact applying watershed to try to help diagnose rare disease individuals who you know they come in they have what looks like a genetic disorder, and they've gone undiagnosed with exome sequencing and standard tools. So we're trying to use watershed, collect RNA sequencing for these rare disease patients and try to identify regulatory variants that might be causal that are missed by exome sequencing so that's absolutely in progress and I think, you know, many people are interested in using RNA for rare disease diagnosis and hopefully tools like this will help. What, what. I'm sorry. In that context, how do you decide where the RNA expression should be collected. Well, mostly we don't have a choice. Right. So you're not going to get brain tissue from most of your individuals for example so generally we're talking about blood or maybe skin so we get can get fibroblasts. Long term, I think that we can talk about applying you know innovative technologies where we can actually get IPSCs from the fibroblasts and can induce different cell types but at the moment you know we're largely restricted to what's available which is usually blood. Okay, I know that the next talk is a bit long. I think we, we should perhaps move on. Thanks so much for and we'll have more time to to ask questions in the joint session. Right. Thanks for a great talk. The next speaker is Unshule Kandaje, who is an assistant professor of both genetics and computer science at Stanford University. Hi everyone, I want to just talk by thanking energy for this opportunity to present to you today. I'll be talking to you about how we can use machine learning approaches for discovering novel biology, particularly in the context of the romance. The general machine learning models have been really optimized for prediction in most domains, but in in the domains we are interested in particularly genomics and biology. We would like to use these models to understand how they're able to make these interesting discoveries. Maybe if we're able to dissect these models, we can learn novel insights about genome biology and that's what I'm going to focus my talk on today. As a case study, I'm going to focus on the problem of decoding regulatory DNA. As you know, genes are activated and repressed in a highly context specific manner. And this happens typically through a variety of different regulatory elements that I encoded in the genome. Regulatory elements are typically recognized by a variety of protein DNA complexes, such as transcription factors, many of which have very specific preferences for DNA sequence motifs. And a lot of efforts have been spent on characterizing the DNA sequence specificity of these transcription factors. However, regulatory elements have evolved to actually contain complex syntax and grammar, which essentially directs the precise combinations of proteins that bind each of these sites in different cellular contexts. This higher order syntax or grammar of regulatory DNA has remained quite elusive. And so today I'm going to show you how we can use what are supposedly black box predictive models to discover interesting insights into regulatory grammar and syntax. And before I do that, I just want to clarify what I mean by regulatory grammar or syntax. I'm talking about the higher order rules of motif composition, the affinity of each of these motifs. They are specific arrangements which includes spacing constraints, orientation, and how all of these syntax rules drive cooperative binding. So that's going to be the focus the case study of the talk today. Due to the sequencing revolution that we've had over the last two decades. We've seen amazing improvements in our ability to profile regulatory activity of entire genomes. So we can now perform all all kinds of experiments, you know, chip seek experiments and chromatin accessibility experiments and so forth, which can give us really high resolution maps of various kinds of regulatory biomarkers. In this case, protein DNA binding sites, various kinds of histone modifications, other kinds of epigenetic marks across the entire genome in hundreds of cell types and tissues. And the NIH and energy I have in fact funded several large consortia to accomplish this at scale. So, I've had the privilege of working with the encode consortium and the roadmap epigenomics project for several years. And these projects have enabled pretty comprehensive mapping of various kinds of molecular readouts, such as gene expression chromatin accessibility histone modifications, a protein DNA binding maps, and DNA methylation and so forth, across the entire genome in hundreds and thousands of cell types tissues, and now also across individuals. So with the new revolution in single cell profiling techniques, this axis has expanded even further to be able to do these kinds of measurements in single cells. So these large scale datasets are an amazing opportunity to start discovering novel insights about how the genome encodes this diversity of function. So machine learning models are an ideal tool to do this, as I'll show you, because they can not only learn from large scale datasets, but given the right tools, we can also interpret these models to decode regulated DNA sequences and use the models to prioritize functional genetic variants and mutations. So I want to start by sort of briefly going over how we can take these large scale datasets and convert them into a classical machine learning problem. So if you perform sequencing experiment, let's say profiling protein DNA binding in a specific cell type. You get a beautiful readout of coverage sequencing coverage across the entire genome. So the way you can think about it is if you take the whole genome and let's say the human genome, which is 3 billion base pairs. Each bin in the genome, let's say a hundred base pair bin centered at every nucleotide in the genome, essentially gets a mapping to some readout from the experiment. Right. So what we can do is think of this as a translation problem where you're given a sequence of the genome and your goal is to learn a model that can take this sequence and translate it to these these profiles coming out of the experimental assay. And so what we've recently done in collaboration with Zika abjack and Julia Zitlinger is we've built a new kind of model called a BP net model. It is a neural network or deep learning model. Think of it like a text to speech converter. What it's able to do is walk across a genome and take chunks of sequence and learn a predictive mapping from the raw sequence to single base pair resolution profiles of any regulatory assay. And the basic idea behind a neural network is it is essentially a complex pattern detection engine. So one of these layers of the neural network, they learn patterns with increasing complexity. When our inputs are DNA sequences, these neural networks end up learning sequence motif like patterns. And as you add more layers to the network, the network learns hierarchically more complex patterns. And by potentially learning interesting DNA syntax and grammars, and ultimately the model transduces those sequence patterns into profiles. It's a similar concept to how you have text to speech converters that are very popular in other domains. So the idea is that these models are in fact able to make incredibly accurate predictions just from raw sequence. So I'm showing you your predictions from BP net models trained on different kinds of data chip Exo which is high resolution protein DNA binding assay is chip seek for transcription factors and DNA seek attack seek and even pseudo bulk single cell attack seek data sets, which is very accessible. And in each case, what I'm showing you is predictions of the models on sequences entire chromosomes that it has never actually seen before in training so you hold out some part of the genome. We train on one part and we see the model can generalize its predictions to sequences it has not seen before. When you see OBS is the observed data from the actual experiment, and the PR Ed predicted profiles from sequence from the models. And you can see this is for for transcription factors or for socks to nano and KLF in mouse embryonic stem cells, the predictions the models and the observed data are remarkably similar. When we may when we perform these kinds of evaluations genome wide models are often as accurate as concordance between replicant experiments so these models can really max out the prediction accuracy in terms of mapping sequence to these kinds of regulatory profiles. You'll see also instances where the models also enable extremely effective denoising. Here's an example of chip seek data, which is very sparse you can see there's a lot of missing information. In fact, these happen to be two different data sets targeting the same protein, have a different antibodies and the observe profiles can often look very different due to differences in data quality batch effects sequencing depth. But when you use a model to make predictions from sequence you can see the same region of the genome for the same readout incredibly similar. So you also get the power of actually denoising and imputing missing information. But as I said, you know, we're very excited that the models can make such accurate predictions but I think what we're more excited about is the fact that these models can learn representations de novo from raw sequence. And so they must be able to learn some interesting biology. So, unfortunately, neural networks typically have been considered black boxes, that is, you cannot really figure out what's going on but I'll show you today that using the right tools. We can in fact use them as discovery engines. So the first question you might want to ask is given a particular sequence in the genome. Let's say you made your model has made an accurate prediction for some kind of bio molecular event. How is the model making these predictions which nucleotides in the sequence are predictive of this output. And so my student Avanti she actually developed a new method called deep left which we published in 2017, which takes the model and reverse engineers the con contributions of individual neurons in each of the layers of the network, recursively all the way back to individual nucleotides. And so you get like a nice decomposition of the predictions of the output in terms of the contributions of individual nucleotides. And this approach is extremely useful to obtain a high resolution annotations of predictive nucleotides in any region in the genome in any context. Here is an example of a distal enhancer that regulates the oct4 gene and mouse embryonic stem cells, and we fitted BB nut models to chip exodata for these four transcription factors which all cooperatively bind that enhanced as you can see right here. Beautiful footprints from the predicted data which very, which are very similar to the actual measure data. If we use deep left and we try to interpret the contributions of individual nucleotides in the same sequence, in terms of how it's contributing to binding of each of these four different factors. We see that the model can really highlight very specific instances of various kinds of motifs. And if you map these nucleotides these important nucleotides back to kind of known binding recognition codes. We can obtain extremely high resolution annotations of these of these individual enhancers in the context of how they are decoded by different kinds of regulatory proteins so you can see some some motifs are very some some motifs are very specific to some proteins whereas others are commonly used across the board. It's quite useful if you want to interpret the models at individual loci, but maybe we also want to summarize the global patterns learned by the model. So to do this we've developed another tool called TF modisco. And what this does is it takes a model, it uses deep lift to infer the contributions of individual nucleotides for all the millions of sequences that let's say a bound by a protein of interest. So deep lift highlights all of these interesting high contributions core sub sequences. So we can filter the remaining parts of the sequences out and just focus on the important nucleotide sub sequences and then run a clustering algorithm that's what TF modisco does it clusters these these millions of patterns based on similarity, and then collapses them into non redundant motif like representations. So doing that for proteins. Let's see if just we want to understand what what's the model learning in terms of predicting a binding for four transcription factors for for software and nano game killer just for four transcription factors. The typical, you know the typical approaches that hypothesis is that these proteins recognize maybe four to five to six motifs. In view of what we see is due to their extreme cooperative binding interactions. We need a much more complex repertoire of motifs to explain binding. In this case we need about 50 distinct motifs to explain binding for just for transcription factors and one side type. The motifs in fact, as I showed you before, have combinatorial contributions to binding of each of the four transcription factors so you can see for example the ox ox heterodimer is is important for binding for up for socks to a man or whereas other motifs, such as these two right here are very specific to nano. So, this is one of the reasons why the recognition code in vivo in the genome is more complex than what we typically find in vitro is because these factors often cooperate giving rise to more complex repertoires of recognition codes. And just to give you some examples, the neural networks can highlight very subtly different motifs. In this case, we have three different motifs for nano, they have the same core TCA site, a different flanking regions, which have nice support from individual motifs like msi experiments as well as crystal structure of nano boundary DNA. So we have some nice validation that these subtly different sort of species of motifs are in fact, very likely biologically meaningful. What we also see is the models ability to highlight very subtle, low affinity patterns often flanking specific motifs. So here's an example of the nano motif, the TCA motif I showed you before. And classical position frequency rate matrix representations or motifs are unable to highlight anything going on in the flanking sequences around these motifs. But our neural network motifs actually highlight subtle looms of TAT repetitive sequences that that rise up and fall. And the periodicity of this sort of low affinity pattern in the flanks of the nano motif is exactly 10 and a half base pairs, which happens to be the helical term of DNA. So one interesting hypothesis here is that we assume or we made is that nano potentially binds as homodimers to DNA such that the units of nano often are binding on the same side of the helix. And what's quite interesting is homeobox proteins like nano do have evidence of binding as homodimers, even in vitro, when measured in the context of nucleosomal DNA. Now, we explicitly try to look for these kinds of patterns in the genome using the neural network identified motifs. And so if we look for specific spacing constraints between even the nano core sites these TCA sites, as shown right here on the x axis is the pair wise motif distance between pairs of nano sites. The y axis is just simply computing the frequency with which you see different distance constraints. You see this again this beautiful helical periodicity pop up, which means that these enhancers have evolved to essentially contain nano motifs with this sort of helical periodicity. And this is true for nano with a whole bunch of other motifs as well. So this is quite interesting because we can start discovering syntax but can we use the model in more interesting ways to actually perform in silico experiments that can give us insight into how syntax drives cooperative binding. So I'll show you two examples of how we do this. The first one is using the model as an article and using it to make predictions on synthetically designed sequences with very specific properties. So here we've designed a sequence in which we've embedded a nano motif at this position, and another motif ox ox motif at this position. And we're going to move this motif, we're going to synthesize sequences of a very systematically move this motif towards this other motif. So here we use the model to predict how binding of nano changes and how binding of opt for changes simultaneously as we change the syntax spacing syntax between these two motifs. And here we are graphing as a function of distance the change in binding of the two TFs. So the golden curve represents the change in binding for nano and the red curve represents change of binding for opt for. So here's a very fascinating pop up which is that as the as you reduce the distance between the two motifs opt for really doesn't respond at all so opt for essentially is acting like a classic pioneer factor binds DNA does its own thing. Now on the other hand, has a massive cooperative increase in its binding as the ox ox motif moves towards it, and you see this exponential rise in binding strength. And you see this again this beautiful helical periodicity signed up kick in. So you're what we're going to use is take the model and dissect really interesting causal directional cooperative relationship between two transcription factors as a function of sequence syntax by using the model as a black box article designing synthetic sequences of specific properties, and then querying the model, and just getting graphing the answer right so it's a very powerful way to decide for hypotheses. So the other approach we can take is use a model again as an article to do in silico genome editing experiments so instead of designing synthetic sequences we can take a real enhancer sequences as shown right here this is the same up for an answer I showed you before. You can use a model to derive inferences about the important motifs in each of these sequences. As you can see you're the ox ox motif is driving nano and opt for binding as a nano motif idea is only contributing to nano binding. So it's systematic deletion experiments where we delete the ox ox motif and now we use a model to predict what will happen to binding of the two factors and as you can see the model predicts a destruction of binding of for and of nano. And we can do the inverse experiment where we delete nano motif and look at the effect again and you can see it has no effect on opt for and has only a minor effect on nano. So if we analyze this now across the whole genome across all the enhancers that are bound by opt for a nano, we see this kind of again this beautiful asymmetric cooperative effect as a function of syntax with sort of a helical periodicity kicking in. So we are able to replicate the synthetic experiment through in cynical genome editing experiment. We can use these predictions to design real experiments to validate these discoveries. So in this case, Julia, who's my collaborator on this on this work. Her lab designed CRISPR experiments that specifically took enhances in the genome that are bound by socks to a nano and the model predicts that mutating these two bases inside this motif would have a strong impact. The predictions of the model, the blue curve is the predictions of the model for socks to binding with the original motif in the genome and then if we make the mutation using CRISPR we mutate the TT to an AG. The model predicts this red car which is showing an attenuation of binding right at the motif, as well as a proximal attenuation at a nearby footprint idea. So if you perform the actual experiment you see remarkable concordance between the predictions of the model and the actual experiment. And this is also true if we look at the predictions of the model on binding on that after you make a disruption to the socks to motif. So this is just an example of how we can use a model to test millions of hypotheses in cynical and then identify interesting designs which we can validate in the lab using, for example, real genome editing experiments. And lastly, in my last minute I just want to show you how we can also use the models to prioritize functional genetic variation. So one of the things we have done here is we've trained BP net models on on binding data for the spy one transcription factor we've trained BP net models on DNA seek data and on H3K 27 oscillation gypsy data. And then we can use those three models to simultaneously predict the effects of an allylic change at one of these nucleotides, which is happens to be a variant that is a validated binding QTL for the spy one protein. What the model predicts is when you switch the CLU to a G allele you see this massive amplification of binding of the transcription factor the same answer season amplification as predicted by the model for DNA seek chromatin accessibility and a corresponding amplification for a stone modification is CQ 27 oscillation. Using the interpretation tools we have we can further dissect what parts of the sequence are really driving this allylic effect. And as you can see are the G allele induces a powerful, very powerful spy one motif that really enhances the signal like each of these different layers of regulatory activity. Last but not least, we can also use this to interpret causal variants and disease loci. And this is an example of a GWAS locus in Alzheimer's disease. This is next to the pie cam gene. And if you actually look at all the snips in that locus that are in strong LD with the tag snip that about 165 candidates, the causal variant is not very well known in this locus. So we have models trained on single cell attack seek data from postmortem human brains and using this data we can dissect individual cell types and cell populations, we can train BP net models on the pseudo bulk attack seek data in each of these cell populations. And as I showed you in the previous slide we can use the models to predict the effects of each of these hundred and 65 variants. And what we identify is a model really homes in on one variant in this locus, which happens to exactly overlap an enhancer puritive enhancer that is very specifically active in oligotendrocytes. And that enhancer appears to be looping to regulate the pie cam gene promoter. And when we take our interpretation tools to interpret what the allele is actually doing the G allele versus the allele of that predicted causal variant, you can see that the G allele in fact induces a motif of the fast transcription factor so we can get very specific hypotheses about potential function of these variants as shown right here in the context of any disease with well matched regulatory profiling experiments in interesting disease relevance and types and issues. So just to summarize, hopefully I was able to show you today that when we take black box predictive models, and you couple them with powerful interpretation frameworks. We don't have to have the trade off between prediction accuracy and interpretation we can in fact have extremely predictive models that we can also use for biological discovery of causal phenomena and hypothesis generation, as well as optimized experimental design. One thing we have to be really careful about going forward is models are going to be a very important commodity and going to be as important as data sets regenerate, but it will be equally important to be transparent about the limits blind spots biases and pitfalls of each of our models. Just to give an outlook of what we need going forward. I believe we need large scale harmonized machine learning ready observational and perturbation data sets. We need decentralized scalable affordable compute resources and able to enable training of large scale models across these data sets. We need unified ecosystems where the compute the data the models and the literature really sit right next to each other and interact with each other. And most importantly, we don't need new kinds of user interfaces to models to enable interactive discovery search and design. And lastly, as a community we don't need to incentivize collaborative efforts and diverse contributions. And with that, I'll just stop there and thank the members of my lab who have been instrumental in performing all of this fantastic research and work. We, again, I want to highlight the fact that none of this is actually possible without extensive collaborations with people with fantastic scientists with different skill sets. And of course, the funding agencies that support this work. Thank you. So thank you for a fascinating talk. We have a number of questions that have come through in the Q&A and let me just start by posing one of those to you which is, do you need to train a separate model for each assay and BP net. And also related to that can you say a bit more about how the BP net takes into account the cellular context right so how would it make predictions across different cell types for it. Yeah, that's a that's a great question. Hope you can hear me. So, yeah, we train BP net models on think of think of BP net model as a computational representation of an experiment right so every time you perform an experiment. You fit a model, and then you use the model to gain insights into that experiment so the current model I presented today is a is a assay specific context specific model. It doesn't have much use outside the context of the assay. Well, it has used outside the context of the assay because many of the features of the models learn generalized to other assays to for example in the paper we've shown how a BP net model trained on binding data can actually predict effects of trans effects of depletion of transcription factors on chromatin accessibility and on reporter experiments so the representations learned by the model can definitely generalize but very much in the context that you train the model and so there's there's a specific model for every transcription factor in every cell type. And so, the model generalizes to predictions within that context but not necessarily outside the current model that are certainly ways to generalize or train models that will actually be able to make predictions outside the context they're trained in but that's not what I currently showed in this talk. Some questions about deep lift and the scope of deep lift. Can you use it on other models other convolutional neural networks and maybe explain. Yes. So deep lift is a very general approach for it's a it's part of a family of approaches which are referred to as feature attribution approaches and there are there is a large family of these methods. So when Lee who's a speaker, I think in the next session. She has actually built. She and one of her students has had built a unifying principle of how you take all of these different attribution methods and put them in the context of, of a very well defined sort of theoretical framework. Deep lift deep sharp all of these methods can be used largely with any kind of neural network in working on any kind of data so it's it's not specific to the genomic data, I will say that the effectiveness of these methods is very domain specific. Some people have used these kinds of methods on imaging data as well and so forth, and the kinds of attributions you get there often tend to be much noisy or much more reliable. We are lucky that DNA sequences are discrete. They have four nucleotides. And so there's a lot more stability there's a lot more interpretability that pops out of using these approaches. And I just want to highlight based on that statement that we have to be careful how we take methods developed for one domain and apply them in another, because in genomics we have inherited a lot of machine learning tools applied in other domains. And one of our instincts typically is to take the models or the approaches and apply them directly as is, but we always have to be quite careful and do the necessary checks to make sure that the methods work better or worse in the domains we actually apply them to. There's also a question about the sensitivity of the model to the window size that's being used. That's a very good question. So, if you notice in my second, you may not notice but in my second last slide where I showed the effect of a variant on multiple regulatory phenotypes so those binding accessibility and histone marks. My windows were different sizes so I had like a one KB window for binding about two KB for accessibility and six KB for, or histones. And that's because the, these marks obviously are different readouts have different landscape sizes. Right, and they spread or not, some are very punctate others are much broader. And you also some of them are affected much more by long range distal effects somewhat you know Christina showed a very nice talk yesterday on how extremely distilled, you know, enhancers can have very long effects on expression. So you do have to be careful about how you design the receptive field of the model that is how much of the sequence it actually sees. And for the binding and accessibility experiments and we and others have shown that you don't get much gain beyond, you know, beyond like a two KB window and you can predict most of what's happening using to give you but as soon as you go to histone modifications gene expression, you often need mega basis of sequence based to look at to really get accurate models. So one quick question in your interpretation of variants at the end and the Alzheimer's example. So you had, did you have single cell data in normal brain or disease, and we're using training BP net separately on suit about for different clusters with just the reference genome, or just the specifics of that example. This was healthy human brain postmortem sample. Clearly, that would miss disease, like very important cell states that are disease specific inflammatory states and so forth. And that's one of the reasons why we cannot, we don't get we're not able to map every G was locus we're able to map quite a few but not all. We've been in models on pseudo bulk taxi from each of those cell states. And sorry, what was the last last part of the question reference genome. Yes, we currently do use a reference genome. It's the results certainly get better if you actually include a leak effects we currently use a leak effects as validation because we didn't have separate validation experiments but we're now doing CRISPR and PRA experiments. And so now we can incorporate the like information in the sequence itself as before before training the models. Okay, thanks so much. I think we have to move on. And thanks so much for a great talk. Our last speaker for this session is Gregory Cooper, who is a professor of biomedical informatics at the University of Pittsburgh. I'm Greg Cooper from the University of Pittsburgh. I'll be presenting a talk on personalized causal machine learning using genomic data. And this is a joint work with my colleague Dr Shimla Lou and others whom I'll mention a bit later. I'll first briefly discuss causal machine learning and then introduce a particular version of it that learns personalized models. Much of the talk will be devoted to a synthetic example and two real examples. Finally, I'll finish by mentioning some extensions, a few comments about the ideal data to use for personalized causal machine learning and and finally some conclusions. As we know science is centrally concerned with the discovery of causal relationships, including of course genomic science. We want to understand mechanisms of action, such as the molecular details of transcription regulation. We'd like to predict the results of interventions, such as the cellular effects of a gene modification. Importantly, causal knowledge also provides insight about how to control events, such as how to reduce the over expression of a genomic driver of cancer. Traditional machine learning algorithms learn the causal relationships that exist in an entire population, such as a population of patients with a particular disease. I think so, however, may result in learning a mixture of the causal relationships that are operating within subgroups in the population. It may sometimes also be the case that only the strongest shared relationships are learned. In contrast, learning causal relationships specific to a given instance, such as patient, allows us to understand more precisely the causal mechanisms acting in the patient, which can help guide the maintenance of health and curing of disease for that patient. As an example of instance specific causal machine learning, we can look at identifying the somatic genomic alterations, such as gene mutations of an individual tumor. Typically, there can be hundreds of such SGAs, only a relatively few of which are actually driving the cancerous behavior of the tumor. Moreover, our knowledge of all the drivers of cancer is incomplete. As an example of instance specific causal modeling, I'll present the TCI algorithm, which stands for tumor specific causal inference, which takes as input a large data set of omic data, such as the cancer genome atlas data also called TCGA. And omic data about an individual patient, such as SGAs and differentially expressed genes or DEGs, and then outputs a bipartite network of causal relationships between the SGAs and the DEGs that are supported both by TCGA and the patient's data. We used DNA data and expression data in this application. In particular, we define an SGA as being one if the gene contains any non-synonymous somatic mutation or has an abnormal degree of copy number variation, otherwise it is zero. We define a DEG as being one if the gene was significantly differentially expressed relative to a baseline. This shows that in general TCI can learn the relationships between SGAs and various types of phenotypes or endophenotypes where we use DEGs in the current example. We posited SGAs that are estimated to cause abnormal gene expression in a given tumor are good candidates for drivers of cancer in that tumor. TCI uses a bipartite graph representation. It searches the graph for relationships between SGAs and the DEGs. In doing so, it applies a Bayesian evaluation measure using TCGA data and patient-specific tumor data in order to score each graph in a tumor-specific manner. Although time doesn't permit explaining the scoring measure, it's described in the paper shown in the bottom here of the slide. I'll now sequence through a simplified version of a TCI search to give a sense of the search process, which is pretty straightforward. The nodes here denote variables and the red nodes denote variables with abnormal values, namely the SGAs and the DEGs. TCI only retains the SGAs and the DEGs among the variables. It assumes that the DEGs are being caused by the SGAs. But this in and of itself renders TCI instance specific because the effects of interest and their potential causes are both tumor-specific. The SGAs are modeled as variables instantiated to the values in the current tumor being modeled. The DEGs are modeled as uninstantiated variables. For each DEG, the algorithm searches for the SGA that's most likely its cause. It scores each SGA in turn as a potential cause of a DEG. Here we're focused on DEG E1. It looks at each possible SGA as a cause. Then it assigns as the cause the most probable SGA of the DEG according to the Bayesian scoring measure. It performs a similar search for the other DEGs here looking for the cause of DEG E3. And it then finds in this case A1 is the most probable cause. And finally for DEG E4, it searches and finally it arrives at the most probable network, which it outputs along with the posterior probability of each arc. TCI makes the assumption that in a given tumor each DEG has only one driver. This assumption is based on the observation that SGA's perturbing members of a common pathway rarely co-occur in an individual tumor, which is a phenomenon referred to in the literature as mutual exclusivity. This assumption can be relaxed, but we've not found a compelling need to do so up to this point in TCI. As an example involving real data, consider the PI3K pathway that controls the gene MPL, for which there's evidence that upregulation promotes leukemia development. GWAS EQTL analysis finds the PIC3CA as a strong driver of MPL and overall PIC3CA is the dominant driver. However, the EQTL analysis does not find P10, PIC3R1, or AKT1 as strong drivers of MPL, because they are overshadowed by PIC3CA in a population-wide analysis. The diagram on the right shows information for a single tumor. The green squares denote SGA's, one of which is PIC3CA. Of the approximately 300 SGA's in this particular tumor, PIC3CA was analyzed by TCI as the driver of MPL with probability greater than 99%. In general, when PIC3CA is an SGA and MPL is a DEG, TCA assigns PIC3CA as a driver similar to the EQTL analysis. However, consider the tumor data on the right here where PIC3CA is not somatically altered, but PIC3R1 is. Here TCI assigns PIC3R1 as the most probable driver of MPL and with high probability, which is reasonable in light of the pathway shown to the left. As mentioned, an EQTL analysis does not do so. Similarly, here P10 is given a high probability of being the driver of MPL, and here AKT1 is assigned as a driver with probability about 90%. So in summary, in a given tumor, TCI analyzes driver gene possibilities in light of the genes that are somatically altered in that tumor. Of those possibilities, it finds the one that is the most likely driver based on a Bayesian scoring measure. More broadly, we applied TCI to over 5,000 tumors in TCGA. The x-axis here shows those genes that TCI call drivers with high probability when they were somatically altered. The y-axis shows those genes that were frequently called drivers according to their absolute count in tumors in TCGA. The genes with names in blue are known cancer drivers, and the top right in blue is TP53, which is not surprising that it would be at the top right. The genes with names in red were not on the consensus list as known drivers when we performed this study several years ago. CSM D3 was ranked highly by TCI as a driver, and it is somatically altered often in tumors in TCGA. However, it's not a known cancer driver at the time of this analysis. Therefore, we selected it to perform a cancer cell line study. In particular, we found a cancer cell line in which CSM D3 was being highly expressed. We then knocked it down using two different SIRAs and observed the resulting behavior of those cell lines compared to controls. The results for the cells that were knocked down are shown here with stripes and dots, and the cells not knocked down are shown in white. The knocked down cells showed significantly less cell proliferation than the cells that were not, particularly at six days. Similarly, the cells that had CSM D3 knocked down showed significantly less cell migration. These results are supportive of CSM D3 being a cancer driver gene in some situations, although additional investigation is needed. The Cosmic Cancer Database now classifies CSM D3 as a Tier 2 cancer gene, meaning that there is strong emerging evidence that it has a role in cancer. This provides additional support for the gene being a cancer driver. Representative related work is shown here. There's been prior work on modeling and learning context-specific independence, such as that used by TCI. Other work is investigated learning instance-specific models, some of which are non-causal and others causal. To our knowledge, however, there has not been prior work on Bayesian methods for learning instance-specific causal Bayesian networks as in TCI. We are extending TCI to use sets of variables that yield a lower dimensional embedding that can serve to provide confounder control. We're also further developing and applying methods such as one we call IGFCI, which can learn instance-specific causal pathways between the genomic drivers and the resulting phenotypes found by TCI. IGFCI models for the possibility of latent confounder variables also. We've recently applied IGFCI single cell data to explore possible molecular pathways that are involved in immune regulation and that work is under review. The ideal situations in which to apply instance-specific causal machine learning methods such as TCI and IGFCI include those with high quality measurements that are tissue-specific, include data about the cells in the microenvironment of the disease under study, and include multiple types of measurements within single cells. Nevertheless, the TCI and IGFCI methods do not require such data for us to apply them currently. Briefly, in conclusion, we believe that instance-specific causal machine learning is a promising approach for analyzing genomic data in many diseases, not just in cancer. Additional development evaluation are needed and are ongoing in our lab and in others. These are two main references for the material I presented, which provide much more technical details and results than we had time to get into today. I'd like to sincerely thank Dr. Swathana Jabari and Chun Wei Kai for slides and figures that they contributed to this talk. Also, thanks to Dr. Sham Deshweshwaran, Dr. Kai and Dr. Jabari for their contributions to this research. And finally, I'd like to acknowledge NIH and NSF for their funding support. Thank you very much for your attention. I look forward to your comments and questions during the discussion period. Also, regarding follow-up, please feel free to contact Dr. Lu and me at the email addresses shown here. Thank you. Thank you so much for the talk. We have time for questions and we're also inviting Dr. Xingualu, Dr. Cooper's collaborator, to help answer the questions. So, I think there are a bunch of questions about the causality in the model and also some specifics on how you're encoding differential express genes. One question is, can you extend the TCI model to include the direction of differential expression that right now you're coding as one or zero, you know, up or down? Can you encode expression level instead of discretizing? Can you model co-occurring somatic drivers? So, I see Xingualu answering in the Q&A, but you can answer verbally for everybody. So, maybe our answer instead of regarding the discretization of the DIGs, what do we really interest genes to see? You know, if there's one signal that turns on, one gene is downward-regulated and another gene is up-regulated, it doesn't discredit, for our model, it doesn't matter. And if they only indicate these two genes are co-regulated by one potential signal. So, in that case, so we found out it's not necessarily differentiated with the down-regulated. Of course, when we process data, we always keep track, we want to detect the signal as a consistent one direction. So, that's a, that's sort of potential. But I think, you know, just to build on that, we can, it can be done. And we have done it before, both in terms of coding up and down and encoding the level, you know, the precise level of DNA of gene expression. So, in terms of coding different combinations of SGAs, I don't think we've done that. And that's something that could be done. Okay, another kind of related question. The analysis, since the analysis is always conditional on SGA equals one, right, like on the somatic driver being there, how do you, could you elaborate on how you identify causal pairs without variability of SGA? Maybe more broadly, can you explain in a bit of detail what makes the model qualify as a causal model? So, you know, I think, I probably could have made that a little bit more explicit. I'm glad we have that question, which is basically because this is cancer data, we're making the assumption that we don't have any hidden confounding between the SPAs and the DEGs. The IGFCI algorithm that I mentioned can model for that, but the TCI assumes that we don't have any hidden confounding. And therefore, it's not there from the beginning or we control for it. In cancer, we think that's a reasonable assumption, but in some other diseases, it would not be that we're venturing into now. So, being able to do lower dimensional representations of the entire genome and use that as for confounding control is probably going to be pretty important. And in terms of the former question having to do with a fixed value for the SGA, roughly what's going on is it's looking for an SGA from your training data, which leads to a distribution of your DEG, which is skewed towards zero or one. In other words, it has more information about what the DEG is. So it's looking through all the SGA's, looking at the training data and finding an SGA that leads to a very skewed distribution of your DEG, which means it's giving you more information about the DEG. Can you elaborate on why you think that for this particular case of cancer, it's a reasonable assumption that there are hidden confounders, but you're cautious about carrying that assumption over to other contexts? Well, I think that we're making the assumption, I know it's a rough assumption, it's a first order approximation that the interventions, if you will, the changes that are happening in the DNA are random, are being, something's not causing those changes and also causing the DEGs to change, but there's not some sort of hidden confounder of both those events. I'm sure that there probably are. And if you look hard enough, but as a first order approximation, we're making that assumption about the semi-randomness, if you will, of the somatic genomic alterations. Shenhua, did you want to add a last comment to that? Yeah, because in cancer, these mutation events are somatic events. So if it has a population structure, GWAS study, SNP can have the covariance with gene suppression because of the genetic structures, population structure. But at cancer, relatively, this is minimized because the mutation event is random, relative to this particular case. It's so you do not really see population-wise the confounding. The likelihood of an expression of the gene and the mutation event is confounded by something is less likely to happen. And then when you go to GWAS and population-wise. Okay, thanks so much for the talk. I think that we should bring all the speakers in and have our 30 minute wrap up discussion. And we are, you know, happy to take questions for all the speakers and the Q&A. So we should start with one question as we wait for Q&A questions. You know, the session is about resource needs for machine learning. And Alexis commented on the House, some of the largest consortium efforts led to a lot of creative work. But what are, what are, what are we missing? Like what are the data sets that we need and don't have if you were advising NHGRI? Anyone want to take that? I'm happy to start. Some of these questions came up during the Q&A after my talk, but I'm obviously very invested in the idea of context specificity. And I think that we do need functional data, and that would include epithetic data and expression data in, you know, many more cell types and contexts than we currently have at a large scale across multiple individuals. And we need that to be, again, readily accessible to the research community, not only available to the 10 PIs who are part of a consortium, but really readily available. So how do we do that? You know, how do we access multiple conditions and multiple cell types? I have some particular experimental ideas in mind, but I'm really interested to hear what other people think too. Yeah, I can also add to that. I agree completely with Alexis. I think one interesting and important aspect is releasing data to expedite machine learning development, right? And the current strategies of releasing data that we have, even from large consortia, which have been part of, have been releasing data in formats that are designed for consumption. For example, through genome browsers and, you know, relatively low throughput consumption, right? Whereas, just to give an example, when I was working on the roadmap of the Genomics project, I released some of the uniformly processed data, which, you know, which was a lot of effort, like making sure you try to remove confounders and trying to normalize the data sets and so forth, almost two years before we published the paper. So data was already released, but the process data was also two years. And then Olga Terranskaya's group actually took that data and created this deep-sea data set, which is a processed version of that. And that processed version of the data has served as a benchmark data set for many other methods that are built on top of it, right? Every machine learning, every person who does machine learning for Genomics have to go back to the raw roadmap data or even the processed roadmap data and figure out how to create that matrix and keep doing that again and again. It's a tremendous in the time investment, right? And some people can do this and we need better ways of sharing those process data sets. They should not be sitting in supplementary websites or papers. They should be in major portals where people can plug into them rapidly, you know, prototype models do comparison and so forth. So I think there is also a difference between releasing high quality process data, but making sure that we realize for different use cases, we might require different formats, different mechanisms for importing and so forth. I completely agree with that, Anshul. I think that there's sort of two issues. One of them is making data available early, you know, so, you know, GTEX came under some criticism for holding back data until, you know, a publication time and things like that. And we could make data available earlier and I think that we should do that. But the other thing is making process data available. So a lab, your lab, my lab may process the data in a new way. We should be able to make that available publicly, you know, immediately in some format. And I think things like Anvil and others can enable this technically, but we need to enable it sort of like procedurally as well. So maybe you want to explain for the audience what Anvil is. I'm not sure that I even can do it justice, but you know, I think that the community has been looking for a way to do cloud computing, and to access multiple different data sets such as GTEX and others, you know, easily and accessibly and reproducibly and Anvil is an NIH supported framework that is, you know, Johns Hopkins and my colleague Mike Schatz is leaving part of this is one of the ways that we're trying to do this. But I think, you know, procedurally when we do a new study, we still need to make that available as soon as possible. But I would definitely encourage everyone to look into using Anvil for data sets that do exist already. Yeah, also wanted to highlight another instance like Alexis had in her in a slide the recount database which is I think a phenomenal effort of taking RNA seek from the planet basically and uniformly putting it into a format and and processing and making sure it all works out. And, you know, what people can do with those kinds of data sets is is it really targets the machine learning community and the combination biology community who want to work off that they don't spend like three years. Reprocessing everything and figuring out normalize it like they could build very quickly on top of that. But that has to be done very carefully. Otherwise, that and other confounders just propagate to all that community, right. Right, but we're, you know, but if one, you know, one or a few labs do work on that then we can share those results with people which I think is a really effective way to help address this problem. So, following up on this this theme of making data kind of available and accessible for machine learning. I'm interested in hearing your comments about metadata and from past projects like GTX or the roadmap epigenome to what extent metadata has been has it has enabled machine learning versus to what extent has there not been the metadata that's really needed to facilitate it. My experience has been that this is critical that confounding technical artifacts are pervasive in large scale genomics, even in the best designed experiments. You know, so of course we should do good experimental design where we randomize batches with respect to whatever feature we're trying to look at of course we should do that. But even in population scale data, you know, the reality is that technical artifacts actually dominate biological effects in many of these data including GTX. And being able to demonstrate that we can identify those is is really important and metadata supports that. I really really encourage, you know, all of us to to track that and I think that there's a tendency among some of our collaborators on the experimental side to think like, well, my data is good, right and so like what is good, like it's it's good. But that doesn't mean that it's completely free of any sort of technical variation, right. So we need to track these things still, and we need to be able to identify when you know in single cell sequencing it's actually even more important than what is good in bulk but we need to be able to track when some of the variation that we're observing might be due to some of the technical variation and having metadata enables us to show that. And I think that's just it's just incredibly important. And on show I don't know what your experience has been with, you know, you know, obviously I'm usually working on like cross individual population scale data. I don't know what you think about that with respect to more like cell type variation and encode so maybe you can comment on that. Yeah, you know, I'm going to give Greg a chance in case he wants to add something here. I'm happy to. I definitely have lots of thoughts on this. I'll follow up. Oh, I mean, I agree with what's been said I think metadata for the reasons is really quite critical. And I just wanted to briefly also answer Christina's earlier question about the types of data. And this may have been said I don't think so but I think having tissue specific disease data genomic tissue specific data is quite important. And I think it's very as far as I know fairly relatively rare, but as we get into trying to understand different diseases in detail, having single cell disease specific data in the tissues that are experiencing the disease, and knowing what the micro environment looks like, and that disease tissue, for example the heart is quite important. And I realize the challenges but I think it's quite quite important to see that goal. Yeah, so if I can just add on the metadata aspect. Yeah, I think encode is a great case study for this because honestly, you know, I think it is one of the most heterogeneous consortia in terms of data types. And we have to give real kudos to the encode DCC, which has really spent incredible amounts of effort on metadata. I think the two reasons why two types of metadata that are extremely important Alexis, I think, also raised this is one is just description of the samples and a standardized format, which includes all kinds of information about, you know, technical stuff and dates and who did it and when and what exactly the sample is linking it to ontology and so forth. The other aspect is the QC. And this is a very important piece of metadata that we often do not call metadata. So, one of the biggest problems of working with data, for example, in larger public repositories like geo, they have incredible data sets. You just cannot, like, if you take yours to align these data figure out like exactly what they are because there is no standardized metadata. And more importantly, there are no QC metrics associated with it. So are we going to download like all of geo, reprocess it and start ranking things manually, because you cannot automate QC either that easily, right? I mean, that is what we do and it's really hard. It is extremely hard. You know, maybe a few of us with sufficient resources and and experience potentially can do that, but it is a real bottleneck for people who have the skill sets to analyze the data but then are focusing on unfortunately on, you know, mislabeled antibodies batch effects, you know, all those issues right is one of the reasons I think consortium data sets are so popular is because they have this uniformity they have all this data. And if you can somehow use the consortia experience as, you know, as as a launch point to do this globally for all data that would just be revolutionary and I don't have the highest impact ever compared to anything else we would ever do. So that's that's my view on this. I agree. And maybe like a small plug, you know, I've obviously been involved in consortia for, you know, basically my entire career. And it's come under criticism, you know, many of you may feel like, you know, why are we spending so much money on, you know, GTX or, you know, some of these consortia efforts. And I understand that, you know, I'm not, I'm not, I'm not on one side or the other of this argument. I do think that these large data sets enable, especially machine learning and computational methods that are not possible if we don't have these large data sets. I do think that we should also reconsider like how exactly are they run and this is something that I think it's very relevant for NIH to consider, you know, when does the data get released, when do when does the larger community have access to it do we have embargoes. I think these are really important questions and I think that giving people access to data earlier and, and more broadly, would be really beneficial for the community, but I think that that we're not going to do some of the science that we do without these data sets and they cannot be collected in the single lab they just can't. So, so how do we manage that. I think it's a really important question to consider. Okay, there are questions in the Q&A, but I think they're sort of related to this general discussion, you know, guidelines for what kind of metadata should be supported in in geo but but you know I think this is all really important but maybe we should also address some other questions so you know what struck me watching the talks. Alexis's talk is population scale, looking at effects across population, right, but and and that's sort of one axis, and Anshel's talk is a different axis of looking at the details of transcription factor regulatory logic and in specific cell types. And Greg, you know, also, you know, cancer level but how do we, you know, how do we move combined the axes right so there's a little, there's a little bit of work now and a lot a lot of interest in moving these kind of deep learning models into population genetics, what are the barriers and what and what what do you think has been accomplished and what needs to be done for for anyone in the panel. I don't know what Anshel answered most of us but I will say that one of the big barriers is that in you know natural variation the effects are just very small. And, you know, we have shown over and over and over again, you know not my lab but many others have shown over and over and again that like additive models are pretty good. It's so small that it's, it's quite hard to apply more complex models to these but yeah, Anshel and Greg, please chime in. Yeah, I mean I think there are in my view there are two, two fundamental, maybe three fundamental issues here which could good lead to improvements. I think the first is exactly what Alexis said is the machine like let's call them machine learning models but these are all machine learning models but let's just take the examples of the deep learning models that are trained across the genome. Right, the variation they are focusing on is a large scale variation of like when you go from one bin of the genome to the other, you see dramatic changes in the sequence, and you see corresponding dramatic changes in the effect sizes of that sequence. The models are fitting to very large variants in the data. So they're very good at predicting effects of deleting motifs, you know changing syntax. They have literally never been trained. I mean they're never seen in the in this training regime, the effect of what happens when you change a single base. Right. And so it's literally kind of like an out of distribution prediction for for a deep learning model, the fact that the even kind of work is fascinating. The fact that they cannot even work the way because they really should not be right because we take these models that are trained across the genome and we do apply them to individual variation and we show that you know you've shown that there is some predictive accuracy, but that's not what they're trained to do right. So I think one way forward is to is to of course I mean you can absolutely first of all encode alleles into the inputs, but more importantly you can train them directly on effect sizes that matter to you, which is effect size of variance right now. So I think there are any models in the in the literature that directly fine tune these models on a leak effects, which you can get for free from the data that you actually measure because you can you can infer the ideals again for the and do all of that or are better methods needed to deal with genotype data and better education and better alleles specific analysis, you know, is that a bearing. I don't think it's actually a barrier I just think the people working in this space are coming from different also that's just that's the second problem I was partially mentioning is that you know people coming from the deep learning space have a specific view of how to model these data they have their interests are often regulation versus variant effects, and the statistical genetic folks really understand variation but often, you know, don't use the data at the fine scale that you potentially can using these neural networks right so one interesting approach and Alexis and I actually have written recently grants together with the hope that we can actually work together in this space is using the predictions of the weight of the deep learning models as priors within statistical genetics models right so there's lots of powerful information that the deep learning models can give you but you need the statistical models that directly focus on variation information to really sort of fine tune that information out and get to so I think there's really, you know there's going to be a very interesting work I think in the next few years, coupling these two David Kelly already has interesting work in this space and so you know, I mean one thing I want to point out, you know is that all of these models require data, and if you look at at GWAS where we've actually been able to show you know extreme polygenicity and effect sizes of you know, many, many variants on many of these complex traits that we're talking about data sets of, you know, hundreds of thousands of individuals sometimes up to millions of individuals, our largest eqtl data sets are usually thousands. Right, so, you know, we aren't able to model those corresponding effect sizes as accurately from eqtl data as we are for GWAS, and, and we do need more data to do that. Yeah, maybe I'll have some comments to it so in addition to we need more data is just how you spliced the data for example I think I just really excited by the Alex's talk is about how you can concentrate on the individual rare ones and, and like yesterday, Dr. Tovall may also mentioned there so you know, digital twins, not concepts is basically say you found a subset of patients, or subset of populations within this, and zoom in within that particular subset of population may find something you wouldn't, you would be ignored, which would be ignored by the whole population why so because of population structure all those things may contribute into that. I think I want to grab already show the one slide so just like the using the using the the PI3K pathway, right, but if you're ranking at the population level the PI3, the R1 is ranked the relative low but once you zoom into the sub population of the tumors only have the PI3, pick three R1 mutations and then looking into the impact of this particular events, you may find about much stronger signal. So that's what is the motivation I think is a, in addition to increase the population and dissected the population in a certain way to strengthen the signal signal potentialism. We recently also tried to using the TCI method to migrate to the GWAS study, and we do find out in our, when you dissected the population and zoom in serve sub sub population, you can significantly increase the capability of detecting a rare events with impact. So that's a potential and not a direction say not only increase the size, but it's a how to dissect the data that might be in a new direction deserve them. So what about another comments on the meta data and all of the things we noticed that as a no matter how well you document to this, but a lot of times experiments can can contribute to confounding effects, just basically batch effects. The idea is a more is a methodology wise to utilize the biological structure to normalize data is another direction to go, because even if you know this data is a processing in the same way but we still can see there's a difference in the, in the results. So therefore, knowing the method is not enough, then you need to have develop a method of such that you utilize the biological structure, the covert structure of the gene expression could be utilized to normalize a different batch of effects. So that's what we found the potential is useful. Yeah. I wanted to just if I could just say quickly about this topic theme is. And I had a slide that alluded to this, but you know there's there's a bit of work going on in precision this kind of instance specific or patient specific or to tissue specific kind of modeling. And I think that in like a lot of areas there's kind of an expansion of people trying different things. And we have we presented our method and has some advantages other methods have advantages to. And then there's a synthesis phase so I think that that's pretty exciting to look at these various methods for doing instance specific biological modeling and figure out ways to combine them into something that's, you know, more powerful than anyone of them together. I don't think we've quite gotten to that but that's something that I think is very exciting. And usually the idea of deep neural networks, I think, thinking of ways to do instance specific with deep neural networks is a very interesting area. It's obviously challenging but I have a sense that there's something very useful there and that's in that research space. One theme that came through I think and all three of the talks this morning was the application of machine learning, not just to make predictions but to gain insight into the biology or disease processes. So I'm interested in hearing your thoughts about how you how you approach a machine learning problem differently when that is a key goal. So, things like taking into account interpretability of course but but what else is there that you need to take into account. So, you know, one thing I'd like to comment on that is the traditional performance metrics that we report on models is completely orthogonal to many of these use cases we just discussed. So for example, we have shown and many others have shown that you can get incredibly accurate models that are incredibly biased like they are learning high accuracy by focusing on completely trivial things like GC content or assay bias. So, just looking at cross validation or even out of this, you know, whatever even independent validation set performance metrics for prediction is only half the story. Even if you realize your model is not necessarily learning biases the second important point is is stability off of inferences drawn from it right so I showed you some examples of these deep lift scores it has taken us yours to figure out how to stabilize it. Because you can you can do this and if you change random seeds you can get often very different results. And we now have methods that allow you to do that but training a single model on some fold and then using it for all kinds of inferences is very dangerous because you don't understand the variance in the interpretation even though your stability on accuracy looks very high. So I think as a community we need to focus already have standards on also reporting various other kinds of performance metrics on on the interpretation side of things to show that our models in fact are stable and to use best practices that actually help make sure that inferences drawn are robust. Yeah. I want to raise two points with respect to us I think what on show described is is is really important. I do want to point out that sometimes technical artifacts are very consistent between studies so for example if you have a result is due to mapping error. You're likely to have the same mapping error in an external study. And so you may see that there's replication when it's really just a complete technical artifact. And the other thing I want to raise is a recent focus in our field on selective inference so this is a problem in statistics that shows that if you do something like cluster your data and then ask a question about differences between clusters. You're doing post selection inference you have you've inferred something based on your data and then you're you're specifying your statistical hypothesis in a post selection manner based on what you've inferred from your data. And this is a problem that we're facing in single cell analysis but it's it's broadly applicable to you know lots of parts of genomics and so yeah I guess that's just something that I want to call people's attention to and I think that deserves you know. I think it's getting increased attention and I think it deserves that increased attention so both technical artifacts and post selection inference is something that you know as a community we're going to have to face. And can you just spell out what the consequences of post selection. Yeah so the consequences of post selection inference are basically that you're going to have you know very inflated, you know, type one error. And then you're going to infer that you have differences when you don't a lot, I guess it's a very very simple description of what you'll see there, and it is something that we're going to have to take into account and it's something that I think the network inference community, you know maybe wasn't paying attention to for a long time and I think the single cell community is now grappling with. And it's going to be an issue. I think it could quickly body your question. I think that causal modeling provides insight, if it's correct. So that's that's the key, but it makes me think of something that I think would be very useful and the theme of this session, which is that it would be great if we had more. We have a lot of observational data, but, and we make inferences about what's causal, but then we don't have test of that to know if it's correct. It's not like doing machine learning prediction or classification, or you can just do cross validation. You have a hypothesis about what's causal you need to test it. So if there were if NIH were to provide data sets could could sponsor data sets that were observational that then had follow up with experiments, then researchers could use those. And if they were on the same tissues, use those to test if their causal discovery methods are actually working well. Great, great point. Thanks, great. So I think we've reached the end of our time but thank you all for terrific talks and and a really fascinating discussion. The workshop now has a break for about an hour and then we'll be back for another session this afternoon. This afternoon for many of us. Thanks everyone.