 So, I think we're ready to move on then to our last working group. Mark, you have the cleanup slot here, so. Oh, terrific. Since I had the... Since you had the opening... The biggest pleasure of leading off the meeting. Dawn is going to present the slides on behalf of our group. So, we're going to use the term integrated approach to describe any sort of situation in which we use multiple data types to assess the functional role of a gene or genetic variant and phenotypic variation. The types of data that one might use in the integrated approach are quite diverse, and here's a list of some things that have already been used in the past. Lists of causal genes or known associations, data from genome screening in unbiased fashion such as whole genome expression data or protein-protein interaction data, as well as genome annotation data from things like the ENCODE project, transcription factor, binding sites, epigenetic marks. We could also consider things like semi-structured literature such as PubMed abstracts that can be used as fodder and text mining applications, and model organism data such as the very rich mouse phenotype data curated by the mouse genome informatics group at Jackson Labs. There are quite a number of challenges one would face in taking the integrated approach, and our working group decided to highlight one particular challenge in this form, which is a challenge or perhaps even a danger depending on how you see it, and that's the new opportunities for storytelling that arise when you start considering extremely high-dimensional data. So one could imagine a scenario like this where you take a list of candidate causal variants for a disease, in this case Crohn's disease, which might have equivalent levels of statistical evidence just on the basis of the genetic data, and align those with the rich high-dimensional genome annotation data such as the ENCODE data, and using a strictly algorithmic approach, identify that SNP which has the most overlap with a relevant set of annotations and anoint that as the causal variant. I don't know if this kind of analysis has been presented yet, but one can imagine it happening in the near future if not. So I'm not disputing the fact that or we're not disputing the fact that functional annotation isn't relevant for prioritizing and identifying causal variants. Indeed, there's unequivocally an enrichment of functional annotation in disease-associated or phenotype-associated SNPs such as DNA's hypersensitivity sites, but this is a statistical observation, and we need to use rigorous statistical approaches when integrating those observations into our genetic analysis. It's not a black-and-white scenario. So for the purposes of this talk, we're going to frame the discussion, hopefully, by presenting first three use cases in which groups have used the integrated approach to identify new genes or genetic variants to be causally implicated in phenotype variation. There's not meant to be an exhaustive set of examples. In fact, probably not even necessarily the best set of examples that one could imagine, but those are examples that we are familiar with. The other thing that distinguishes them is that they do attempt to do data integration in a model-based framework which we highly applaud and would like to promote for future people that want to use the integrated approach. So our first use case involves using networks to implicate a gene in a complex and highly multigenic disease. The use of networks for doing this kind of analysis has been around for certainly five or ten years. It's a highly rich and prolific area with many contributors, and we're just using one such example today. So the paper that we'll present, it describes an approach called DAPL. It came out of Mark J. Daly's lab. The idea is pretty straightforward that we start off with a set of linkage intervals or, sorry, association intervals that were identified in a genome-wide association study or set of GWAS. We look down underneath those association intervals and pull out all the genes that exist in those intervals. We then turn to publicly available protein-protein interaction data and build gene networks on the basis of these PPIs, including not only the genes that are in the association intervals, but perhaps closely linked genes that fall outside of the association intervals. Of course, you give me a list of genes and I'll give you back a network. The fact that we have a network doesn't mean that it's necessarily appropriate for learning anything new about a disease. So what we can do is actually test for an unusual enrichment of connectivity between genes and different association intervals by comparing the connectivity of our observed network to a null distribution generated by something like a permutation strategy where we generate random networks. And if you observe a statistically significant enrichment of connectivity in the observed network, that gives you the confidence that there's actually useful information that can be used to learn new things about disease pathogenesis and genes that might be involved in these diseases. Having done that, you can then nominate candidate genes based on where they fall within the network. Those could be, again, genes that are in a large association interval or genes that fall outside of the association interval. And we can further narrow down that candidate gene list by cross-referencing them and their expression patterns in tissues of relevance to the disease that you're interested in. So in this paper, they apply DAPL to a couple of immune-related diseases, rheumatoid arthritis and Crohn's disease, and we see that the networks that fall out of these GWAS catalogs are highly enriched for connectivity compared to random networks with similar types of properties. You can define candidate genes based on those disease networks and see that they're highly enriched for expression in the immune tissues of interest for these two diseases. But the important thing is that this integrated approach actually has something useful to say about implicating new genes that weren't previously known to be involved in a complex disease. So 293 genes were identified in the Crohn disease network and also expressed in the relevant tissues. 10 out of those 293 at the time of the analysis were not associated with Crohn disease but were later confirmed by GWAS metal analysis as being Crohn's disease associated. So we actually learned something new about genes that are involved in Crohn disease using this integrated approach that we couldn't have learned strictly from the genetic data. The second use case involves using non-coding annotation to establish causality for non-coding DNA variation. Just a quick overview on non-coding annotation that I observed that it's a rapidly growing large body of data as we witnessed in the massive publications of the ENCOBE project over the last few weeks. Importantly, disruption of these annotations are getting to the point that they're interpretable from a sequence perspective. And also a very important point is that these annotations allow integration of germline genetic variation with the biology of somatic cell types and tissues. This is something I think is an underappreciated point by pure place statistical geneticists. So the paper that we'll discuss came out of the labs of Jonathan Prichard and Matthew Stevens of the University of Chicago. This is nominally a genome-wide CISI-QTL analysis in which they took whole genome expression data from 210 samples from the 1000 Genomes project, linked that up to the 13 million SNPs that have been identified in these samples and mapped gene expression phenotypes to these 13 million SNPs. However, we're presenting this today as a key example of how non-coding annotation can be used in a Bayesian statistical framework to form priors on causality to actually make statistical arguments about which SNPs might be causally associated with a gene expression phenotype. This modeling framework, which I'll call the hierarchical model, was first formulated in response to this thing that you might call the LD problem, which is this curious property that the population genetic property of linkage disequilibrium, which enabled the massive success and also confounds our ability to find map that calls a variant in a lot of haplitid context. This is a classic example of the NEGR1 locus. This is a haplotype that's associated with obesity and that in certain populations contains a dizzying array of types of variation that are in completely disequilibrium, including a large deletion and many SNPs nearby. It's impossible on the basis of some population genetic samples to disentangle the functional roles of different highly linked variants. The intuition behind the hierarchical model works out as follows. Imagine that you've done single point association analyses on a number of SNPs in the vicinity of a gene with which you've measured gene expression values for. This might be a one megabase window centered on your gene. You've unequivocally established the fact that there is an EQTL in the vicinity of this gene, but the extensive LD in the region confounds our ability to actually pinpoint which of the many variants is driving the association signal. However, we're doing this not in a bubble, but in a genome-wide context. There are many, many other genes which we've done the same analyses for, some of which might have more informative haplotype structures in which we can unambiguously pinpoint a single causal variant. Now, if we assume that there's some shared properties of bona fide cis EQTLs that we can measure in genome annotations, a reasonable strategy for arbitrating these more complex, sort of, uninformative haplotype structures is to combine our set of bona fide SNP EQTLs and identify which genomic annotations are predictive of EQTLness. So this is where the hierarchical aspect of the modeling comes in. So we have our first pass analysis gene by gene, pulling out individual cis EQTLs. We then lump together bona fide cis EQTLs and estimate weights on different genome annotations, which we might call lambdas, and these weights are telling us about the relative importance of different genome annotations and whether or not a SNP ends up being established as a cis EQTL. So the fact that we're assuming that these lambdas apply across the entire genome makes it hierarchical. That's the top level of our model. So once we've fit these lambdas, we go back to the original single point statistics, which we might represent as a p-value or a base factor, and we treat these lambdas as priors on those single point statistics, reweighting them with this new prior information. I'll say this again with a slightly different visualization to kind of hopefully drive home the point, the rationale here. So here's a gene in which we have two SNPs linked to expression which roughly the same amount of statistical evidence, and say the identical amount of statistical evidence. We fit the hierarchical model considering genes across the entire genome, some of which have more informative haplotype structures, and we learn about several annotations which are predictive of EQTL-ness. We can now see that SNP1 sits in three annotations which have strong effects on the probability of being an EQTL where SNP2 sits in only a single annotation. We can then reweight our prior, using this prior information, we can reweight our confidence in which of these SNPs is causal and find that SNP1 is much more likely now to be the variant driving gene expression variation at this gene. So this analysis was done with an earlier set of encode data. They tested over 50 encode annotations, and they found a handful of things to be predictive of EQTL-ness, including proximity to gene, a handful of histone marks, DNA1 hypersensitivity sites, core promoter motifs, and transcription factor binding sites. This slide is meant to drive home the fact that we're not just dealing with a simple algorithm that takes a set of SNPs and pulls out the one that has the most encode annotations overlapping it, but we're actually dealing with the proper statistical model which has predictive power. So what this is is a validation of the model using the following kind of approach. The authors took 100 genes from the genome in which a single SNP could be unequivocally anointed as the causal SNP in a driving gene expression variation. They removed those 100 genes and fit the hierarchical model to the remaining data. They then went back to each of these 100 genes and said, based just on the prior that we fit from the hierarchical model, what is the rank out of all possible candidates in the region, and since we're dealing with the one megabase region there's about 1200 candidates per gene, what is the rank of the known bona fide EQTL out of those 1200, just based on prior probabilities, and this histogram is capturing the distribution of those ranks. Let's just focus on the red histogram which is the fully specified model. Now remarkably for 45% of those 100 genes the true causal EQTL was predicted to be in the top 15 of 1200 of 1200 candidate SNPs just on the basis of the prior. So it has high predictive probability and you can compare that to how you would do if you were just randomly picking SNPs without a model which is indicated with this solid black line at about 2%, 3%. So I would call this kind of strategy perhaps building evidence by analogy. So what we're doing is taking a set of variants that we have a very high confidence in as being EQTLs, finding what the properties of those things share in common, and then identifying analogies between variants of unknown significance to prioritize based on these analogies. So our third use case is taking this paradigm of evidence by analogy and moving it into the rare genetic variation rare disease paradigm. This is a paper from Han Brunner and yours Veltman's group in which they did whole exome sequencing on 10 trios with sporadic cases of idiopathic mental retardation. The group identified 9 de novo mutations in these 10 samples and they claim that 6 of these are likely to be pathogenic based on gene function evolutionary conservation and mutation impact. This claim was made on the basis of a statistical model. They're taking unique mutations that have perhaps never been seen before in humanity and using a statistical model to come up with a p-value although I'm not sure the p-values are reported here to come up, you could potentially come up with a p-value on whether or not these things are considered pathogenic or not, it should be considered pathogenic. The modeling framework is pretty straightforward, it's easy to follow. They take two summary statistics of pathogenicity a Grantham score and a Philo P score and they use training data, they use dbSNP as a representation of benign variation and HGMD as a representation of disease variation and they fit basically what you could consider a classifier model, a two dimensional Gaussian mixture model using these two summary statistics to separate out benign from pathogenic variation. You can then take the observed Grantham scores and Philo P scores for the de novo mutations and assign them to one of these two spaces, pathogenic or benign, with some sort of statistical confidence. Now I'm not saying that I necessarily approve of the content of the model, but I do approve of the form of the model and I think it's laudable that we try to start to come up with statistical frameworks in which we can say something interesting and statistically rigorous about the n equals one scenario. So the take home points from this very whirlwind tour of the data integration world is that these model based approaches are essential for integrating secondary data. Networks are key data structure for integration which I think we should all get more comfortable with going forward. There's a wealth of non-coding annotation available for integration and maybe a potentially controversial point that n equals one may be tractable with data integration using large natural variation data sets like DB SNP and HGMD. I'll toss up a long list of potential discussion questions in case there's a lot of things to talk about. Thank you very much. All right. Comments. David. I guess one challenge with the, I think you're kind of alluded to this, one challenge with the Nova Paradigm paper is that the characteristics of the mutations tell you whatever they tell you and then you'd have to consider that alongside the fact that if those are mostly causal mutations we actually are missing the other expected mutations that we would get under control. So how do you weigh that in? The fact that there are in fact fewer mutations than we would expect if there were no causal mutations in the data set at all and surely that ought to be something pointed out when you're carrying out an analysis like that. Of course integral to this and I would say we discussed last night rather pointedly that some of us didn't actually think that that was a particularly valid example to use. Primarily because you absolutely cannot compare the properties of de novo variation to the properties of inherited variation they're quite different. The de novo variance in perfectly healthy people are very different than what's in dbSNP and there's no way around that. That I think leads to the false impression that more of those are relevant than they probably are. I was just going to say given our focus today of finding variants that we feel confident implicating do you think that we've implicated those six variants there? It doesn't seem like it's the kind of thing No, not those six variants in that paper I think what we think is that it is conceptually possible at least it's conceptually valuable to think in a very rigorous way about how one might integrate other sources of data into that. So if we're faced with our de novo variants where we have twofold excess of de novo loss of function variants, then certainly other types of information what we know about those genes whether they are, for example using some materials models to predict which ones are more likely to be haploinsufficient genes versus recessive knockout genes and so forth might be of value and in addition to the straight statistics if we can integrate them in a rigorous statistical fashion now what we have to be very clear on and I think what Don was emphasizing here are examples where you can see a path to integrating other data sets in a principled way to enhance the statistical analysis not to replace the statistical analysis or not to just present the data and then say oh and these ones are interesting because of such and such over here and distract from try to cloud or distract from the statistics but actually to supplement them and I think it comes back to sort of one of the things we were talking about last night and then we were talking about again this morning is that the statistics with small sample numbers may be challenging but we can't abandon them simply because it's hard to reach thresholds and similarly integrating other types of functional data into those statistics is not easy and has to be done very very carefully because there's biases in our genetic data in terms of what we cover in our data sets and what we don't there's biases in protein protein data that's available on the web, expression data but we need to be able to figure out ways of overcoming those and bringing those data together in a statistical framework Mark Gerstein I just like to amplify on what the two things just said and also respond I mean I think one of the key things about the last case study is that if we have really well worked out catalogs of natural variation then we can go to an individual case and we can think about it in a statistical way I mean the idea is not to just think of n equals one as n equals one but to think of n equals one next to this very large catalog of what we expect variation to be and so we can better appreciate that individual that one individual as long as you compare apples to apples when you do that and for you know for I mean just specifically I mean you know there is not strong evidence in that paper I mean because there's actually very compelling contrary evidence because there are too few de novo mutations so this is that you know in fact it's a good example of what not to do really but if you are going to make the comparison sorry there's no authors here that paper sorry sorry but in general if you are going to try to make an argument based on you know I agree with you by comparing the properties of mutations that you see in a single patient to others that you've seen you do have to make it comparable and you know and one limitation there is that we do not have actually a big collection of de novo mutations from individuals without specific diagnosis to work from right now we don't know we'll just shout Steve come on we can hear you to say about DB SNP in the thing we were just talking about integration and comparing to the the corpus of natural variation I think it's important to realize DB SNP does include variants with clinical significance that was mentioned earlier this morning but we do it export now which is a subset of DB SNP which is just the variants with no known clinical significance so I've sent the e-mail out to a couple of you earlier this is a new product DB SNP is doing it's in combination with ClinVar so wherever we have any kind of pathogenicity that feeds back into the database that's flagged and so we're doing a weekly dump that will be the variants of clinical significance and those without so if you want to do that kind of screening that's the proper subset to use not all of DB SNP just this seems the group that probably can best use that so did we want to talk a little bit about some of these discussion questions because there are quite a number of them where there's some that you guys were really stumped on Don and Mark or really split on they took the you know tack that we didn't necessarily want to present ourselves as having made firm conclusions about all these things but to have an open discussion about whatever topics in this area were of interest to people we presume too much going in then people have a tendency to end up defending their positions that they pre-stated rather than openly and constructively engaging well and one issue that probably people around this table would know well and could discuss is the key untapped data resources so and I think we've heard one of them which is all the data in clinical labs that are not being captured so I don't know if that's the key one it may be but perhaps Heidi or David could comment on that sorry the question is can we get this no no the question is they're asking what are the key untapped data resources today and it sounded as though one of the key and my question is is it the key untapped data resources what's in clinical labs so is that like the biggie and everything else pales in comparison or are there other things we should consider well it's hard to put things into relative perspective but it's certainly n key untapped resource I guess I would say but whether it's the best and I think one of the limitations of clinical labs is phenotypic information and so we will get interpretations but there will be a paucity of clinical data associated with them in terms of how this data is collected today and I think we can all agree that clinical data would be a huge advantage in these efforts we were kind of hoping as well from the integrated approach group that this is the time to start integrating everything today a lot of what we talked about so for example at the end we're talking about an integrated framework and this gets into what you guys talked about with the levels of evidence etc do we think that we can come out of here today with making recommendations about integration about some kind of framework that how we can wait different types of evidence is that too far are we willing to go out on that limb and actually assign levels I just want to say we did talk yesterday a bit about aspects of integration and I think there's two points that are relevant to raise here first is that it's just to integrate data especially from very disparate databases you really have to think about how the databases are structured, can you bring those together do they have are they on common cell lines are they on some type of thing that can fit together and so that really does require some communication and then the other thing that we talked a bit about was that once you can do that the whole point of doing it in a statistically rigorous way and trying to do more in integration than just simply overlapping something I guess maybe to throw a scenario which is similar to one that was raised earlier so if you had let's say that you had done a rigorous integration in a research discovery effort and there was a gene which just from the genetic data alone didn't reach the standards of causality but all of the other integrative stuff it was the most likely gene where there was clear evidence of excess of de novo mutations or whatever and then you have an exome sequence for a patient with the same sort of phenotype who then ends up with either a very rare missense variant or a de novo variant missense variant in that same gene for which there was good strong evidence what should the clinician who's ordered that exome sequence tell the patient is that a variant contributing strongly to that patient's phenotype or not or should they not report anything back what level of how does that clinician have access to that information has it been published that this is a there was a paper that came out and it says ABCD3 is the most likely here's 10 genes that we think are very strong candidates for causing autism whatever based on that they are presentations and they fit into the network and et cetera et cetera and now they're doing exome sequence and there's a very rare or de novo take your pick variant in ABCD3 in a patient with autism shift the question a little bit I think it's well posed but I think the answer is right now there's a standard for the type of gene that we report back to that we instruct clinicians might be useful to report back to patients and so forth so the question may be how do we advance genes given the enormous wealth of research that's going on of that flavor what's the standard we set as a community for advancing a gene into that set so if the original study had had five de novo hits in this gene and none were ever seen in the 20,000 other exomes that we looked at and the significance of that was 10 to the minus 20 then that is something which we would probably say we need to think about as a community a way to advance those into the pantheon of clinically interpretable things perhaps still not in the way that we have a set of really clear 100% penetrant things that we know about but that is a good thing I think this group could wrestle with may not be able to wrestle with it this afternoon as we start to fly but I think we heard from Heidi a very cogent example last evening of where this is incredibly important where in your example of the prenatal testing they were looking for a cardiac hypertrophy variant and had found one in the child who died and didn't find it in the prenatal testing and ended up having a severely affected child so clearly not that variant must have been something else so does that information get captured somehow that this wasn't the causal variant even though it really looked like it should be and we certainly capture that in the subsequent interpretation we write for that variant in the next case we see where we each time we add on our newest knowledge and some of that is supportive and some of that is not supportive and so we continue to keep that story going and it kind of sometimes oscillates back and forth as to what we think about a variant but that's a check against it it doesn't rule it out and it's still I actually think that that variant is still implicated in disease probably in a mild way but it's not the only piece of the story in that family and I think that's actually one really critical piece that sometimes doesn't get talked about is we say is the variant pathogenic or not but it's a separate question to say is that variant causing disease in that patient and an example of this is there are mild variants and when I do hearing loss testing and I find homozygacity for the M34T variant connects in 26 and the patient is profoundly deaf I say I don't think that variant is actually causal for their deafness but if I find homozygacity for that variant in a mildly hearing impaired patient I say I think that variant is causative they both have hearing loss they're the same exact phenotype but one's mild and one's profound and so I interpret that pathogenic variant two different ways and I think that's a subtlety that is sometimes very challenging to deal with okay I don't think anybody would mind if we if we ended a bit early so I think what we would like to do is take about a ten minute break just so that Daniel and I can figure out what we're gonna say and if you would come back in ten minutes so 20 of four and then we should finish off a bit early great