 It is a pleasure to be here and I'd like to thank the organizers for inviting me. I think it's, this is a really important and interesting topic I think to address because I think ENCODE has been huge and in the middle, okay, uh-huh. So the topic that we're supposed to be addressing has to do with how we're using genome data on function to understand how genome variation is affecting disease. And really in the common disease space there's been a lot of progress and we just have a few questions left. What are the genes? What are the mechanisms? What are the directions of effect? Where are those rare variants with big effects? Because that's part of what Eric was just talking about as well. The effect sizes, it's like 10 trial learning. We're always surprised when the effect sizes are lower than we hope. And the first generation of studies with new technologies are always underpowered relative to where it turns out the effect sizes are. So I think going forward, we will see this but effect sizes are always lower than we think. Because I assure you when we started out a long time ago, I was practically a child, we thought we were headed to Chicago with some genes with really big effects for every phenotype. And yes, oligo genes and yes in the distance, that polygenic background. And for many common diseases at least, we are in Levittown. We have mostly things that look very much alike where we don't even really know what genes are involved. We've got lots of variants with small effects where we're struggling really hard to understand the driving biology. And that was the purpose. We want to get to that driving biology. And clearly a lot of the common variation is acting in a regulatory way. This is just one of several papers that we're out at the end of last year showing that much of this common variation really is driving regulation at some level. You can pick it up certainly in encode annotations. So this is showing across 11 common diseases. So this was a really cool conceptual experiment where they said, okay, this will only work if the same kinds of regulatory variants are driving most of what is going on in common disease. We're going to put all these different diseases together and look at what DNA variation in different kinds of annotations concentrates heritability. So if there wasn't some overriding signal, you really wouldn't see anything. But there is. And certainly the, yeah. The, as Aravinda said, when you're using just the directly genotype SNPs, you get really only a modest enrichment in DNA1 hypersensitivity sites. But it is accounting for an appreciable part of the heritability that we measure with all SNPs. So close to 40% of the heritability measured with all SNPs. But if you go to the full imputation, you're really capturing a substantial fraction of all of the heritability that we measure with the common variation that's interrogated in GWAS for these more or less representative 11 common diseases. And yes, coding variants are both substantially enriched, so four-fold or 13-fold, depending on whether you impute them out or not, but the magnitude of the heritability attributable to that part of the variation that we can interrogate in, again, this is GWAS level, common variant interrogations, is modest relative to the heritability that we're understanding from these variants lying in DNA hypersensitivity sites, which turn out to map mostly to enhancer elements. We see similar kind of thing in GTECs when we look at just the EQTLs discovered in any given tissue, so we look at type 1 diabetes or Crohn's disease with overall heritabilities estimated in the range of 50%. So that's our sort of, that's what we get with all SNPs from the GWAS. But you capture, again, on the order 30, 35, 40% of that, in just a few thousand EQTLs discovered from adipose or any of the other tissues, and yes, this is a characteristic of one of the things we've learned in GTECs that there is a lot of shared architecture in the regulation of genes that are expressed across many tissues. Whereas the story with Crohn's disease is different, you concentrate heritability exclusively in whole blood of these tissues, at least, and not so much in the others. So some phenotypes will have a more tissue-specific architecture and others a quite cross-tissue regulatory architecture that is driving much of that heritability. So this raises an interesting question about how can we, how should we think about the experiments more that we need to do in ENCODE? And I think this comes back to a classic problem in genetics, a missing data problem. We wish we had more direct information on protein levels, for example. This is, it's transcription, translation, and proteins, and most of what we're trying to understand are things that ultimately reach through to affect protein levels. So if it's not mutant proteins, it's the amount of protein ultimately produced, when it's produced, where it's produced, you know, all of the things it has to do. So if a substantial fraction of the genome variation affecting risk of common disease is regulatory, why not alter the focus of analysis to the endophenotypes that are more direct measures of what we really want, the genetically determined part of what we can measure as protein levels? And of course, we're not there yet with the proteins, and that's part of what I am trying to drive with this talk. I think we need to get there faster, and that this is actually a way of getting there. It should be a virtuous cycle. There's a lot more today that we can do for transcript levels, and I'll talk a little bit about that. But the basic idea is that instead of, or in addition to, testing individual variants, where if we find something, we're beating our heads against the wall to try to figure out what's actually causal and what it's actually doing, we can now, we have enough information now to aggregate variants into SNP-based predictors of transcript levels, an endophenotype that we know impacts common disease. Impacts rare disease. Definitely impacts Mendelian disease phenotypes as well. And the ultimate goal is to get not just to the transcript levels, but the protein levels, and test those predicted endophenotypes, genetically predicted endophenotypes, directly for association with disease. Because then you've got mechanism and direction of effect by design in the methodology. So we take a step back and look at this at a simpler level with the transcriptome. If we think about sort of dissection of gene expression, we measure through RNA-seq sort of the entirety of gene expression. And part of that is absolutely the genetically determined regulatory program for each individual. But of course, a ton of other environmental exposures, lots of noise in the measurement as well, but a lot of exposures of all kinds of non-genetic factors affect what we measure as gene expression. And what happens in our bodies over a lifetime impacts gene expression as well. So as we develop disease, that impacts gene expression that we measure. We know, for example, when we measure a bunch of kids with asthma and controls, there are thousands of genes highly differentially expressed in the kids with disease versus the kids without. And most of that is a consequence of disease, not a cause. But we can get at this, and that's what we've been using GTX to do, is to get at this genetically regulated part of gene expression and look at the association of that with traits. So this is a great idea of a young faculty member at the University of Chicago who I hope will soon be at Vanderbilt. And it's summarized in a paper submitted to Nature Genetics as a GTX companion paper. So the first wave of GTX papers will be out soon. And we got some good reviews and are making revisions to the paper. But this is like a down payment on what I think we need to do more at the protein level. And so what this does is, I mean, it's very similar to imputation, which was also a missing data problem in human genetics. You want to interrogate the whole genome. We only could afford, and the first generation of GWAS products had a few hundred thousand snips on them. And so we learned the correlations between genome variation in a fully sequenced sample and then could use that information to impute the genotypes with just a few hundred thousand snips. Here, what we do is learn the relationship of genome variation to transcript levels in a reference sample like GTX, store the weights from these prediction equations in a big database, and then you can apply that to any data set where you have genome interrogation. So it can be GWAS level data, but of course, it can be whole genome sequence data, too, where you get the common variation as part of the sequencing exercise. This is all set up for a cloud-based thing where people can port their data up into the cloud and pull this out or pull the whole thing down if they want to do it locally. And the idea would be we serve the predictors with the most recent data from GTX that we're able to serve. So the general idea is very freeing in the sense that you really just start with the genetic data in a reference sample. We use the whole genome to predict expression, the expression level of a gene, correlate the predicted expression with phenotype. So we're looking at transcript-level information, but only that part of the transcript-level information that can be predicted by the genome. And for understanding the genetics of common disease, that's exactly what you want. So only genotype and phenotype data are needed. There's no reverse causality in the sense that this genotype to endophenotype to disease goes in one direction. And of course, it's a gene-based test. So you have a substantially reduced multiple testing burden, but you are also using an endophenotype that we know is indeed related to common disease. So you end up with a gene-based test that is mechanistic by design and comes with direction. And there's a whole bunch of ways to validate your findings. You get reasonably good prediction performance. And back to the question around RET that was raised about balancing selection, there really are sets of genes with a very high correlation between genetically predicted expression and measured expression. And that is a really interesting set of genes where there's population variation in regulation that seems to be very important in the functioning of that gene. So I think there's interesting biological questions to ask as well. What we find in terms of the quality of the prediction, the significance of the correlation between the predicted and directly measured expression levels, the Q values are less than 0.05 for 40 to 50% of genes and less than 0.1 for 60 to 70%. And that's in some sense pretty decent considering we're predicting only the genetically determined part of expression. And we're looking at out-of-sample quality of prediction. So we're contrasting what we predict from just the genes to what we measure in RNA-seq experiments where all of those other components are affecting gene expression as well. So there's clearly enough there for us even today to get something out of it. When you apply this in the context of just a single GWAS study, yes, you get many genes meeting genome-wide criteria for significance. So this is the Welcome Trust Case Control Consortium data only for rheumatoid arthritis. And yes, a lot of this is driven by the HLA region associations. But the fact that there are so many genes whose predicted expression differs suggests there could be additional biology that we haven't fully appreciated in the context of the variants that show association in the HLA region. And we're looking at this a lot now with Simon Malal at Vanderbilt. But here's a new gene that hadn't been implicated before. And we can replicate this finding in a lot of different ways. So we have replicated this in an independent rheumatoid arthritis data set, so classic replication. So the predicted expression of PSME1 in a new rheumatoid arthritis data set is also significant. We can replicate the SNP predictors as cis and trans EQTLs in additional transcriptome data sets, and we have. You can look in GEO at data collected in rheumatoid arthritis patients and controls. And indeed, measured expression of this gene is different in cases and controls, which is a weak test. Because as I already told you, thousands of other genes are also differentially expressed between rheumatoid arthritis cases and controls. But it's also part of a 20 gene signature that was developed for predicting osteoarthritis. So it is among a small set of genes whose measured expression levels are significantly different in osteoarthritis cases and controls, whether that's because they had some contamination with rheumatoid arthritis patients or some shared etiology. I don't have any particular opinion about. But there's a lot of ways of validating these kinds of findings. So advantages of this kind of framework that is very familiar really in genetics, this idea of imputing what we wish we had when we have enough data to start imputing it, and then using that in our interrogations, I think is an idea that we have almost enough data to really be moving on. It's got a huge number of advantages as a framework for what we're talking about today, because I think we can iteratively use more and more of what we know to figure out what we most want to learn. And that's true whether what we most want to learn is about transcript and protein fundamental biology or about the genetics of common disease. This also gives us key readouts, so genes whose expression levels drive disease etiology, but ultimately proteins that do. And that having those readouts should allow us to do much more informative studies to understand what environmental factors may contribute to disease risk through these intermediate endofenotypes. So we'd have some that we would know we wanted to use as readouts. And I think this sets up the opportunity for a more natural framework for the analysis of whole genome sequence data, because now you can see how you could combine common variants into sort of more functional kinds of analyses around genes, which is where we use, where we derive the rare variant studies already, because these are basically orthogonal pieces of information, right? We're looking at protein encoding mutations that, okay, this is the last slide. And so natural framework for that, our GTECS team and the people on GTECS that are, it's a really cool group to be able to work in. But also really a lot of opportunities for interacting with ENCODE.