 So I think our next talk is Greg Cooper and this is the estimating impact group Which used to be weren't you the ones called the functional variant group right you not only rejected your charge But your name to that's right So yeah, so you can see that the group members up there on the slide So yeah, so we were charged with with functional annotation, but in fact sort of I originally objected to it And then we sort of had to come to consensus on what what we really meant by that And so we came up with this term impact, which is maybe less precise in a lot of ways But it actually was sort of more to the point of what we're really interested in and so we I think it's worth differentiating What categories of terms that we're talking about when we say in what is impact mean? so one way to measure this is to talk about things in terms of damage so You know Polyphen will tell you that something is damaging to a protein and basically we interpret this that this is something a Molecular statement that some variant has an effect on a particular molecular function Whether or not that molecular function actually has it has an effect at the sort of phenotype level adult more overt phenotype Then something is deleterious So this a lot of information that we get is actually from evolution and the term deleterious means that it a variant would result in Reduce fitness so essentially reduce survival and reproductive success And then of course there's a term pathogenic or causally is other sort of the disease related terms when a variant causally Contributes to a specific illness and these are highly correlated terms and a lot of the reason that we rely in on Annotation to use them is because we're leveraging the fact that variants that are Molecally damaging and that are evolutionally deleterious are heavily enriched among Disease causal variants, but they're they're not the same thing and it's worth always keeping that sort of distinct So most annotation tools that people use Essentially estimate one or two one or both of the first two terms either they estimate sort of molecular effects or they estimate evolutionary effects They don't really gently speaking. They don't estimate the latter the which is what we're most interested in is disease relevance And then lots of models which we'll talk about are hybrids that sort of combine both biochemistry So they make damaging predictions and they also consider evolution as well So that being said Annotations are obviously extremely important to This whole sort of process just because the fact that genetic information is just not going to be sufficient for reasons that we've we've gone over Repeatedly and the basic idea here is that annotations allow us to At least get around the original assumption that The sort of the ideal genetic assumption of if we just treat all the variants the same and come up with a purely Genetic argument that something is important. We're just not going to be able to find lots of true biology that way So we can use annotations to the extent that they are truly Different for causal variants relative to not causal variants and the big distinction here And it's it's a very tough question to answer is that we have to differentiate hunches So a typical candidate gene or just so story is not sort of a very well-validated and robust predictor of causality But we do in fact have annotations that we believe are Systematically and quantitatively relevant and we talked about this earlier that really the goals here should be Quantification, you know, what is the actual change in prior probability for variant x versus variant y based upon what we know about it? And then that being said there's also this notion of a lot of times we think in terms of multiple testing and correcting for multiple tests But it's really about at least I find it more useful to think about the hypothesis quality So not so much how many you're testing, but what is how good is that hypothesis start with so in other words? What is the likelihood that something positive will come out so asking is variant? Randomly selected variant x associated with disease is very different than saying is randomly selected nonsense variant associated with these There's two very different hypotheses And because of that so for example, we all sort of assume that protein disrupting variants are enriched for disease causal variants and This is in fact an annotation driven assumption about the genome That has enabled lots of discovery. So this is an example that we've already seen once today and it's from J's lab a few years ago where essentially the top row here You have sort of both genetic information and genomic information being blended here together into Drilling down to the bottom right here, which is on a single causal gene for this particular Mendelian disease And it's sort of again predicates upon they eliminate all synonymous variants for example That's a functional assumption about we're interested in non synonymous because we believe that disease variants are enriched among that set And in this case it works So there's some sort of anecdotal evidence that that this is a strategy that and those lots of studies like this that Leverages this assumption That being said so, you know, how do we really know that nantations are useful and the short answer is that we we won't get You know clear quantitative unbiased estimates in the way that we would normally like which is we would need large collections of Truth and large collections of you know large collections of things that are really causal and things that are not causal and everything in between And then we could actually sort of quantify what properties differentiate them But we can't really do that. So instead what we end up going to and we'll see that as we show some of the data that That we have available to support annotations is sort of indirect ways to measure the utility of annotations to infer Disease relevant properties one of which is changes to allele frequency So basically if you can predict variants that are deleterious, they should have on average sort of Lesser diversity lower frequency levels in human populations because of selection We can look at retrospective analysis of what we previously known variants either through databases or study by study through anecdotes And then as I'm sure about later. So comparisons to experimental measures of function, especially for predicting damage So I'll sort of now talk about the different types of information that we leverage. So the first is compared to sequence analysis and this is actually This type of information is really crucial to you know Essentially most methods of variant annotation and prediction in some way leverage this basic source of data Which is that when we sequence lots and lots of genomes from other different other species and we compare them to the human sequence We find regions like this where you have a high degree of similarity throughout evolution and it's not just a matter of similarity It's comparing that similarity to a model of neutral evolution. What the point is that we can say Fairly confidently there's something about this particular stretch of our genomes that has a function We don't really know what that function is But we believe that when it's when that when these bases are mutated that they're deleterious because otherwise We would see a lot more diversity in this this particular region and the one advantage to this kind of approach is that it's It's agnostic with respect to any molecular function In fact, we don't really know what this does at a molecular level, but we believe that it does something important and And that means that mutations are probably cause disease or very well might cause disease So hot like I said one way to do this is to look at does this correlate with levels of diversity And in fact if we quantify what we just saw in that picture at an individual base pairs Level we can see that sites that have higher conservation scores sort of moving to the Along the x-axis here. These are conservation scores tend to correlate with a reduction in derived a little frequency. This is from The excellent favor a few years ago, but we see this over and over again. This is a very reproducible trend That variance that highly conserved positions tend to be at lower allele frequency So we can sort of infer that there is purifying selection operating here and important thing to consider is that this is in fact a site specific Correlation so in fact if you just take the conservation score one base to the left or one base to the right You essentially obliterate this correlation suggesting that again. It is selection acting on individual sites. That's important And so we also see this trend albeit heavily reduced at the non-coding level So this is sort of telling you the effect of again exomes the fact that you're in an exome carries a lot of information There is more signal-to-noise when you're considering conservation measures at the per base level than there is at the non-coding level But you can still see we see a very highly significant Negative correlation between past conservation and modern diversity So however, there's this notion of can we use conservation to predict disease and this is I'll just show this one side This is a look at the previous table. We showed from J's lab in the Freeman Sheldon case Where along the x-axis here these bars represent the number of candidate genes that that shake out of the analysis as you move from Essentially from models where the disease is the result of a different gene in every and every patient to a model of a monogenic Form where all the families have the same causal gene and then so they get down to you know Something like 10 or 20 genes based upon Eliminating everything that's synonymous and then after they eliminate DB snip then only NYH3 remains So one way to look this is could we do this without knowing about function instead just use conservation scores And in fact we get a pattern that looks like this so if we just set some arbitrary thresholds on the conservation scores of of Different levels of sort of stringency on relative to the genome In fact, we do very similarly and and keep in mind We don't impose any functional and a functional assumption about this So there are synonymous variants contributing to this as well And then the other advantage is that and this is sort of a general feature of quantitative versus qualitative features Is that we can sort of step aside from arbitrary thresholds and start talking about things like ranks? And so in this case for example Without reverting without using DB snip the conservation metrics would have told you that the causal gene here was the top ranked candidate gene So in some ways we're doing doing more with less with this sort of quantitative information This is conservation of each individual nucleotide each individual variant position Okay, so and we've now seen this there there this is all again all anecdotal But this is accumulating a lot a large number of anecdotes are contributing to these kinds of observations Of course the other major flavor of annotation dependent information is of course molecular function So obviously a lot of this is driven by gene models so we can say an extensive List of information of the different kinds of mutation events and how they impact a gene or a transcript Many of these are listed here and of course within the space of variants that are miss sense So so obviously lots of function variants of stop codons and and frame shifts sort of things are special or special relevance But within the realm of miss sense variants We can leverage lots of information because we know lots about the biochemistry of protein. So this is Figure from the recent paper describing polyphen the most recent version of polyphen where in fact you have Several different source of information. So you have annotations of protein features. You have structural information about About the proteins sort of three-dimensionals and secondary structure and then actually polyphen also uses Evolutionary information. So there's a multiple sequence alignment and in fact beyond just measures of sequence conservation Which is important. You can also look at things like what types of mutation events are tolerated So are they all hydrophobic residues? Are they all small amino acid that sort of information is also present in multiple sequence alignments that tell you a lot that? simple structural predictions can't tell you and in fact when you sort of merge these source of information and Into a single probabilistic classifier like polyphen you can show that again given And this is driven by databases of known variants that you can get reasonably good accuracy at differentiating the variants that we believe based upon the known databases are causal versus those that are Are not causal. So there are some evidence supporting the utility of these scores for predicting disease relevance Of course Most of the genome is not protein and we'd like to think about moving to whole genomes and it's it's more difficult but projects like encode are really dramatically Reshaping our ability to annotate molecular function in the rest of the genome So this is just a slide shot from the plots paper last year Where you can and you can see the GWAS catalog up on top there and you can very quickly link all of those GWAS hits to a whole variety of Kind of molecular function annotation including transcripts obviously, but also including things like hypersensitive sites Transfers vector binding sites and in fact in the most recent encode paper And this is work from from John Stam and other other people That there is very significant overlap between when you look at rates of overlap between snips and hypersensitive sites say with versus G-WAS so genome-wide association study hits. There's a very non random overlap So it's telling you that in fact these annotations are enriching for functional variants that are likely contributing directly to These these phenotypes contributing directly to disease risk. So there is some utility of these My likely annotations and the good news is that both the encode style annotations and the comparative genomic annotations will only get better because Those experiments are getting cheaper and more, you know more cell types more factors More genomes all that sort of stuff will make these better annotations That being said there are a variety of caveats and concerns that have to be taken into account So the first and foremost is annotations are not, you know, neither necessary nor sufficient for causality You know genetics is the first and best source of information for this. These are really a supporting role Another issue is that the positive predictive value for from annotations is is low for any given variant So if you just said non synonymous, you know, that applies to lots of variants that don't do anything Same thing if you said the variant is highly conserved again There's lots of variants that don't do anything that are quote-unquote highly conserved So you can't this is not nowhere near a a single bit of information But when you use it an aggregate sense it can become powerful and there are limitations to both the Comparative genomic and molecular function kind of annotations. So for example, you know compared to sequence analysis Something that's assumption that's often overlooked is that you assume that that all the animals That you're comparing that the function of those bases is the same So if zebrafish if a zebrafish protein is in your alignment, you assume that protein does the same thing in fish as it does in human beings Basically, you're assuming that that that Orthology is consistent sequence alignment is an Unappreciated problem. It's very difficult and quality can be an issue. So it really should not be taken for granted in these kinds of scores And of course in molecular annotations, there's lots of errors and oversimplifications And I'll show a few examples of that and then there's this irony that that the most interesting variants from a biological point of view are actually substantially enriched for for errors since errors don't have to undergo purifying selection so here's just one example of Slide from chamele showing that you know evolutionary sequence conservation has limitations with respect to predicting What are the selective consequences of a mutation at that site and that really the focus here? Should be on the bottom right here is that there's this whole range of selective pressure That is perfectly consistent with so-called complete conservation So if you take the human genome and line it to 40 other mammals you can get a base that's perfectly conserved There are lots of those and that's compatible with a very wide range of actual sort of levels of deleteriousness ranging from extreme Like sort of embryonic lethality to a much more milder measures of selection Then of course there's looking at gene models There's a multiplicity of genes at multiple levels so there are different databases the databases change over time Within each one of those databases are varying numbers of transcripts that are associated with any given gene And this is true for both protein coding genes and and for non-coding RNA genes Here's an example for for link RNAs or different views of where link RNAs are and what their their transcripts look like on the genome And part of this is that gene models as static features just aren't fully capable of capturing the sort of dynamic nature of transcripts in human cells And then of course there are Context dependent effects that can alter your interpretation of an annotation. So here's an example the top from then from Dana MacArthur's paper looking at loss of function is that there's a large fraction of sort of What would appear to be loss of function bad mutation events that knock out a protein that in fact only affect one particular? splice variant of that protein so if you didn't know what the proportion of That particular splice variant is and well and what the portion of that splice variant is in the cell type That is most relevant to that disease then you can't really make it a direct inference from loss of function to disease And then of course there are context dependent effects and this is just one example there are compensatory mutations For example that can rescue so here's a stop codon early on in a protein that is effectively rescued by a new start codon that Produces a protein that's nearly identical. So there are these sort of context dependent effects that can Result that I misinterpretation of an annotation basically And I mentioned this last points this is just a slide showing that Because purifying selection reduces variation at sort of more interesting Biologically interesting sites in fact the rate of error stays the same in terms of that if your errors are sort of even if They're uniformly distributed across the genome here And so you really should be worried about you know stop codons in particular Are going to be enriched for just simple sequence errors even when the overall quality of the data is extremely high You know you can have very very low False positive rates, but they're going to be still heavily enriched amongst the most interesting sets of variants Okay, so where do things you know, where do we sort of see things going well the place to really like to be is unified quantitative estimates that consider both sort of molecular information and evolutionary information and Beyond that we should be thinking about how we can complement variant level data So if we look at a particular coordinate in the genome and what that what the two different alleles are also thinking thinking about things like there's information that's captureable and What is the overall level of conservation of a whole protein does the the gene that? That captures the variant does it is that associated with an eqtl and liver tissue And you're looking at a liver relevant trait or whatever the case may be and there that the similar logic applies to other higher order grouping strategies Which is really going to pose a problem for? Evaluating what that impact might be on on sort of false discovery rates So this is sort of thing where permutation and simulation are really going to be crucial to evaluate What that looks like in a some sort of reasonable null model of that type of analysis But really an important goal should be sort of quantitative integrated measures that capture all of the Assumptions that were either explicitly or implicitly putting on our analysis of the data So this is a slide showing you know and so basically we're not there yet, but we're we're pushing towards that I think it's this is a realistic goal to get these kinds of unified scores getting them well calibrated is going to be a real challenge but we're There's progress to be made in the future So here's just a figure showing if we look at the upstream region and first Exxon of beta-globin In fact, we know lots of disease variants in this locus and we have lots of information So in the middle row there there's conservation scores on the bottom left We have experimental measures of functions with Mutagenesis assay on promoter function on the bottom right as a measure of biochemistry of proteins So we really can start to combine these these bits of data and they're all correlated to a certain extent But they all tend to bring some independent information Of course you have gene models motif annotation at the top So it is realistic to start thinking about layering these different types of annotations into better predictors of impact So I'll conclude on this last slide just some general comments First is that there's not really likely to be a single, you know, everybody would like to say Run this annotator and use this score, but it's it's just unrealistic. I think that's going to happen for the foreseeable future and That's also going to be true that there's going to be combinations of annotations that we'll have to use and that and the only way to get around this is that we just need to be very very plain and transparent with how we're using them and any assumptions that they depend upon and And and we need to be very empirical So how many variants meet whatever criteria you're setting how many are there and in thousand genomes How many are there in any one genome in any one gene that sorts of things so you can start to evaluate How things rank what the empirical FDRs might look like To the extent that conservation is and others evolution information is used Biologinies and alignment quality really should be described like if you're using a polyphen score It's worth considering what range of species that were used in the alignment that polyphen looked at to make that prediction Things like that the general property of quantitative measures being having advantages over qualitative and really and we've don't want to rehash this necessarily but It should be clear when annotations are sort of defined up front You're saying I believe that non-storms variants are enriched for these causality To some certain extent whether that's stated before or after you're looking at data. So really making it clear What the biological rationale for any given annotation is and trying to estimate what the enrichment factor might be And we really have to be careful with with sort of dynamic and hierarchical use of annotations Especially after you start playing with the data this can really implicitly expand your your search space and make it much easier to Identify a non reproducible observation or conclusion And again considering it probably should be worth considering all the steps and all the assumptions that affect your interpretation of prior probability So to the extent that we do exome sequencing for example It might be worth saying this is we believe that there's some enrichment factor within the exome And we're explicitly quantifying that and that places it in the greater genomic context of we know we're sacrificing some sensitivity But we believe that the you know this level of specificity that we achieved by just sequencing exome is worth it So but it's worth sort of being explicit about that and trying to put numbers on it to the extent that that's possible And so with that I'll conclude and discuss it. All right, so lots of lots of issues covered We did have I think a discussion of functional annotation that was going on in the previous session And and so we can we can carry some of that back into here I mean one one key question that might be good to discuss here is the extent how close we are or what the pathway is To get to the point where we're actually treating non coding variation with the same degree of rigor that we can for coding variation Where we can actually say this variant This variant does actually is up in the top tier of variance that are expected to be potentially disease causing I don't know. I mean, I guess the people here who've worked on the encode data might be able to comment on How close we are to that point and how what the pathway is to actually get there Since you're looking at me I can So I think that the and some of this may be Reconvigilated in in our session this afternoon, but I Think that the point is We are able now to look at individual variants in the context of a native genome and Make decisions whether that variant is actually for example affecting the binding of a protein So that is having a measurable functional consequence of course the the challenge then is to is to connect that with with with other You know features that eventually are going to get you to You know ultimately something clinically meaningful so I think that Those type of data exist, but they don't exist comprehensively and Meaning that right now Projects like encode or a roadmap have essentially sampled genomes opportunistically And part of the difficulty is that that The emphasis has been on was not a difficulty sort of by design the emphasis has been on trying to capture the breadth of Regulatory phenomena there in these genomes and therefore you want to look at a large number of different cell types And what that is necessitated is that every single cell type is pretty much coming from a different individual So all of the variation is present effectively accidentally Nonetheless we know so that's one level that we don't have a lot but but the You know at least for looking at say GWAS data There are obviously you know most of its common variants and so many of the individuals have the data and that feature has been able to be exploited to to say some things I think so so That's one caveat the second one is that the Emergence of those properties within the annotations that have been created so far is largely dependent on the that literally the sequencing depth in many cases of these asses and and so for example you to see you know clear evidence of abrogation of in Let's say protein binding at a particular variant in vivo You need quite deep data from you know, whether it's chip sequencing or or DNA footprinting to see that and and those data have not You know they have emerged to the point where you can make these definitive calls for thousands of variants But that doesn't really you know there obviously are huge numbers more so I think that but that the trajectory currently and and I don't know how how you know how the strategies is going to shift Or if it's going to shift, but the strategy currently is doesn't really Take into account trying to capture that variation systematically now there are projects That that are ongoing like the GTX Project where there is going to be Systematic sampling of tissues from the same individual But one is not at least in the current conception of the project going to have that level of information From from all those tissues, but one could imagine it coming in in the future But I think that you know as we'll cover later one of the other points though is that directed assays do exist If there is a specific spot that one is interested in looking at So that that's one perspective on the on the current state of those projects Follow-up so you know that to me that the the near future will probably see a lot more connections to cellular phenotypes like Eqtls, you know that being able to map a variant that disrupts a motif that for a transcription factor and a promoter and a hypersensitive site that then we can measure Actually does have an effect on transcription those sorts of phenotypes are going to fall very probably relatively quickly in a lot of ways Be able to map eqtls down to functional variants will happen relatively soon, but connecting that to disease is obviously much more difficult so in relation to the Non-coding thing one thing that you you raised great, but which maybe isn't fully appreciated is the importance of the Very good alignments to say chimp and so forth and the thing the thing is within coding regions You you might make that issue But when you go into non-coding regions people don't I think fully realize how grotty some of these alignments are and and Conversely, yeah genome assemblies and but conversely what has been very clear from listening to the discussion is how Valuable conservation is everyone would say that conservation is a useful thing and just trying to assess Conservation in non-coding regions is actually remarkably tricky. I mean and it shouldn't be because you know, I think with great, you know greater Emphasis on the assemblies of these other primates and whatnot. We can probably do a lot better. I Was just gonna add the nature of mutation encoding versus non-coding is also likely to be different And so I think you should well Greg the base resolution in coding versus non-coding is very very different And if that's true, it could be an artifact of alignment But it's probably not then it may be that you need bigger events or a multitude of Single nucleotide changes as a group that cause an effect on a non-coding function So it's also going to be a different scoring scheme than what we're talking about coding sequence and codon use it Part of that depleted effect of coding versus non-coding you can actually get a much better So this is the sort of allele frequency versus a conservation score if you look at just Transfix factor binding motifs, for example You can really get that slope so you never approach the sort of information content of an exome But you can substantially enrich for the information content by a variety of sort of that Instead of looking at all non-coding variants look at non-coding variants that are in hypersensitive sites, for example, it gets better So I have three cautionary notes about encoding variants so first the There is the main reason we believe in functionality of non-coding variants and several people in the room contributed to that is This exact observation that conservation between species contributes to changes in a little frequency However, recently we discovered a lot more complexity with biogen conversion Yeah, I being important in a little frequency shift and potentially background selection being important in a little frequency shift And either though I do believe that what we see is signal of direct purifying selection the complexity of the speech is much greater than We which wasn't appreciated in the beginning. So the second caveat when you think about conservation The intuition is more species Will will saturate phylogenetic tree and conservation would provide complete information However, there are two caveats in in using conservation information and these are conceptual Theoretical points and in evolutionary genetics. We should be addressed one is what we're doing in conservation. We assume constant fitness landscape So we think that mutation which is bad for human gene Is also equally bad for fish gene Xenopus gene Mouse gene and so forth and it's not only we're not fishing live in the water What's important is there is a pistaceous So there are compensatory changes multiple compensatory changes and all methods look at consideration in a single site and We look together with Nika Katzanes and his set of experiments and you have a variant which is in his rescue experiment in zebra fish Is shown to be pathogenic doesn't rescue zebra fish phenotype However, this exact human mutation is wild type in zebra fish gene which works perfectly and We see 8% about 8% of human mutations like this And I just don't know what to do it. We don't you it doesn't help to have very large phylogenetic tree with conservation And the last the last a cautionary note is the use of intermediate phenotype. So we can talk about Association with DNA sites DSQT else we can talk about EQT else and What surprises me and I think this is more question to the community when we think about LDL being Intermediate phenotype to myocardial infarction All most LDL peaks Juas peaks have influence on infarction. This works beautifully when I look at the QTL data we have Thousands of peaks a lot of signals Obtained on small data sets, but it looks like very very small fraction of those signals is being realized In downstream phenotypes and and we all think that QTL QTL Intermediate phenotypes are useful, but we're somehow in different reality here compared to LDL and By Kevin you're usual by markers I was gonna say so I completely agree that there are intrinsic limitations to what measures like sequence conservation can tell us But at the same time there's a lot of room for growth You know so I love to see a high quality assembly of every primate species and we're sort of in a position Where that's a very realistic Kind of goal. We're talking about hundreds, you know, not millions of genome assemblies here So we're nowhere near saturated the amount of information that we can get from these kinds of comparisons So it and and I think that will be really really useful for for interpreting mutation events The other thing people should realize too is that in terms of the assemblies You can people could do functional genomics on other primates, too And I think that really helps, you know kind of align things and put things together. I mean, you know There's a lot of simple things that you could do to do a lot better in the region So I'm just gonna make a couple of points the first thing. I mean, we've talked a lot about Creating a database of true causality and that's what we really need to test the prediction equations And we don't quite have that yet And I think that Greg has made that point the other is in terms of conservation We're essentially doing an experiment with an N of 5 to 15 and we have to remember I think that every one of these other species has its own genetic variation and we've picked one individual from those species to be Representative and while if we look across multiple species, we now have multiple individuals in different species It's really we're really looking at very small numbers and each one of those will have their own private rare and common variants and who knows at what spot what we're what we're seeing So there's a sort of technical comment But it's actually kind of important when you're looking at that conservation scores is that It's it's important to actually eliminate we actually strip out the human sequence when you're scoring them because what happens is that When the human is polymorphic it the reference assembly captures some of those polymorphic alleles So it tends to introduce what looks like substitutions which results in a very large penalty because it's a From the human chimp ancestor to human it looks like a fixed substitution event So it's essentially one change on a very tiny branch So it really deflates the the conservation score. So you get the sort of auto this Tautological correlation between conservation and allele frequency that so yeah So that this notion of diversity within each of those species is important. It actually has a can have an effect on the scores. I Think one important thing though that needs that needs to be Conveyed is that there are many people right now in the community that completely equate conservation with function and but conservation is a way to infer function It is not function itself and and that I mean function in biology is and at least in the way that it is practiced now Comes out of physiological models where you can break things watch them act So I think that that that that point has to somehow be encapsulated. I Think I'm Greg Greg's division of things into those three different categories is important I think to really explicitly spell that out the difference between damaging deleterious and pathogenetic I think we can make that point as explicit as possible that would that would help to get that out We have the tools now to be able to distinguish amongst those three categories And I would deposit the answer is no and then and then the question is how do we get those those resources? I Don't think we have tools to distinguish but we have anecdotal evidence and I think the clearest example Where deleterious and damaging go in the opposite direction is our examples of advantageous to the genus Asian we have very convincing stories about To the genus Asian event being supported by positive selection and In being potentially beneficial for for fitness you wouldn't call them deleterious But being loss of function events. They are clearly damaging. So we have a variety of stories where again this I totally agree with Greg that in most cases these three notions Correlate and coincide, but we have various studies showing examples where Damaging is not deleterious deleterious is damaging but the genetic is not is not deleterious and so forth And it sounded like there was consensus on that point that everyone agreed that they are not the same I think that the question is how do we then go forth and figure out which or which in what settings and it made differ in context what what may be Damaging and deleterious in in one environment for example is damaging and not deleterious in another In in in current paradigm molecular evolution. That's extraordinarily difficult because What what we're doing? We're relying on what we call neutral standard So we we look at parameters of variation in variation Which we think is neutral or close to be neutral and compare this parameters to what we think is functional. So so the primary method of inferring selection Is to rely on functional evidence on on being damaging or functionally significant So so I think I think we need completely new approaches to address this issues in the well-known Lands experiment where we have complete conservation In ultra conserved element and you do genetic experiment and use you detect no phenotype. I mean this this exemplifies that This inference purely and on evolutionary grounds is extraordinary difficult So how many where do you you right now think? conservation based methods are in terms of if you were going to throw out a ballpark number of The predictive accuracy of where you think it is now and where it could where you think based on You know the gaps in knowledge where it could get So I guess you're in terms of predicting whether something is deleterious coding Let's let's take the example of coding coding regions. That's right present Yeah, not great. I mean I there are lots of For example, we can go through and find lots of non synonymous variants at highly conserved positions at high allele frequencies So I couldn't tell you what the what the positive predictive value really truly is at a given site But it's it's not high It's there there are lots of exceptions and I really don't know what those percentages would be So I can probably quantify with one simple number So you can plot a pie chart where you're having human common non synonymous snips potentially benign or number of variants which you believe are benign and you can compare it with what we think is disease causing and in how many cases you see exactly same Amino acid variant not in a single vertebrate, but in say two or three species In diverse points on the tree and the difference is dramatic So things will you observe three times on different points of the tree which are Disease-causing is I think below two percent is very small Can you put on? Number 14 up on the screen We've we've gone through and taken our pathogenic and benign variants classified by independent methods And then use them to look at some of these different predictive algorithms to see how can you put on number 14? So I just I just to give people a sense in doing this for You know variants classified through the our clinical laboratory So this is looking at a line gvgd And so the variants that we've pre classified as benign are in blue and those classified as pathogenic again This is through independent methods of classification. So, you know, you see They actually work a little better for predicting benign. So for a line gvgd The c0 is equivalent to benign and c65 to pathogenic, but there's you know, they're all over the map. This is polyphen 2 Again, you know benign prediction being a little better than on the other end. This is There's two different classifications for polyphen 2. It's the other one human variation This is sift tolerated versus deleterious deleterious about a 50-50 shot flip a coin But but on the tolerated it's a little more accurate This is looking at Grantham difference This is blossom scores Blossom 62 blossom 80, you know, we've gone through and and done this sort of thing Comparing to our database for a lot of these different sorts of things and you know I think we've all had the same perspective these approaches are good for filtering data and You know putting some priority on things, but individually, they're not great now. I will say We did work with Chamele on a very specific project to develop sarcomere polyphen for eight specific genes and Did a lot of training into a Chamele's group did a lot of work to do better Alignment of and Chamele can explain better than I can to try and train this algorithm to be better and Through and and we I said there has to be a zone in the middle of no call Don't try to call everything benign and pathogenic allow some things to be not called and then you know We were able to get that to be much more accurate You know on the what was being predicted as benign and what was being predicted as a pathogenic So I I think you can take these tools and and make them better But I think the tools that are accurate are going to be different for each gene Because the types of mutations that are dysfunctional or deleterious are different types of things So I think you have to really do some significant training here. So I just put that out there Yeah, so that's a perfect lead into what I wanted to say which was that The tools are actually probably fine But it's the choice of when to apply them and I mean I think I'm stating the obvious but the utility of a conservation Based method for making these kinds of predictions is going to rest entirely on That the source of the species that you're looking at Need to share the phenotype of interest. So the reason it works well for basic Metabolism is because we all need to do glycolysis and we all species Glycolysis TCA we have to mammals have to glycosylate and you know, whatever And it's totally inappropriate to look at conservation if that phenotype is then not shared Across those species and my favorite example is an autism where if you have a dog You know that dogs have a trillion autistic features and so to use dogs and Conservation when you're searching for it is just a big mistake because they almost certainly are gonna have causative variants as part of their Background genome. So I think that the key point here is that you really have to Define your conservation background to be relevant to the biology in question and I think we could Probably get even all the existing methods could be perform a lot better if we just apply them in a more sensible way to appropriate background set of species etc Yeah, I was gonna follow up that it's you know, and this is where Stating sort of being more transparent about the biological rationale for any given annotation is important, right? So if I'm doing a pharmacogenomics study Well, selection is probably not gonna be all that useful because you know, I'm interested in how we respond to some new Molecule that people have never been exposed to before whereas if I'm studying, you know a series limb defect then chances are All mammals do this you use the same sort of molecular process to produce that limb So it's much higher chance of conservation being relevant to that kind of problem versus pharmacogenomics But again, it always has to be clear that none of none not conservation not function None of them are a one-to-one, you know, it's it's just not gonna happen for a very long time So it always has to be taken in context of conservation is one bit of data that we use that's very useful on average but Subject to a lot of variation at the individual site level Yeah, so I think what we learn in this project with Heidi and that We were not able of course It's very important to use in specific features and specific training data set But we were not able to come up with binary classifier Pathogenic or benign with any level of accuracy which reasonable clinical lab would use so this middle category So it is possible to sacrifice coverage and improve accuracy But you have to sacrifice a lot of coverage and and we have discussion with Sharon Plon at Baylor and She said that they consult patients on treatment options based on genetic diagnostic and they use conservation as one of their criteria and She said it's very important to use big middle middle category because if we consult one way next day There is alligator genome sequenced and conservation score changes and I have to meet with this patient and say because there is alligator genome It's an impossible conversation And I was gonna say one of the categories we hate the most is the the case of a primary immunodeficiency This immune system is not conserved even across primates And so I mean you can have a hundred primates, but it's still not gonna help figure out an immunodeficiency gene The only way you can do it is by functionally looking at the effects and so You know whereas when I'm looking at my mitochondrial complex two complex three proteins They're conserved all the way down to yeast so it's kind of it's pretty easy and so really there's just this this concept of The context in which you use conservation score is very important I think it sounds good to try to think about the context in which using conservation But I don't think it's really doable It's quite so readily and because you know the yes sure, you know autism is it's not really You know the same kind of trait in dogs or else maybe it's selected for and all that but but the genes where those mutations are of course are not You know genes for autism the genes for something else that are under strong selection for a variety of things And so you're looking at autism, but that's not really what the nature selection is at the gene level So it's actually quite difficult to do that Despite that Shamil, do you know if If anybody has looked at the relationship between the tendency of Mendelian Mutations to be at sites that are conserved as a function of the type of disease that they influence and what kinds of Variation that we see among classes of disease to try to get at this a little bit more systematically So I don't know about the studies beyond dominant versus recessive, but specific phenotypes I can't recall any of such study, but I completely second your opinion that it is very likely that a lot of selections play a tropic and To relate selection signal directly to biological effect on specific phenotype may may not be the right right way Thank you about it Just one comment. So another thing where conservation might have a disconnect is you know So for example, there could be relatively quickly evolving sites in a protein that That by a measure of conservation are weak But if you put a stop code on there, it's bad because it's in the middle of the protein And so obviously it's that site is resistant to stop codons, but nothing else in which case the conservation score So again, it's it's always going to be this hybrid Molecular function evolution, you know everything you can think of to use to interpret. I think there's one more Aspect that you know as Greg mentioned the conservation story is by no means complete And and there are other aspects of you know the story which I think are not systematically incorporated At the moment, I'll give you a concrete example if you look at let's say The binding the recognition site of a DNA binding protein What you typically find is that there are some nucleotides that are conserved and there's some nucleotides that are not conserved and What we're looking for there by meaning conserves that we're seeing the same effectively letter sequence show up If you take that non-conserved nucleotide and remove it the site stops functioning so there is an element of spatial Conservation that is and this is relevant for things like indels for all kinds of stuff that can show up in genomes That is not being captured at all and and that's something that I think is a you know Is a major area that could just potentially be systematically incorporated Right just as another Secondary comment. It's it's a lot of in conservation. We're not there yet But we could eventually get closer to that is more allele specific scores so there are some motifs where you can have an a or a G but you can't have a TRC and We don't do we don't really capture that at all and simple measures like it's highly conserved It's not conserved But with the deeper evolutionary data you could start to imagine we do that with proteins For example, you can say all there it's all valines and isoleucines But nothing else so you can start to see patterns of substitutions that is informative beyond a simple measure of the rate of evolution And so we could conceivably get there with with non-coding sites if we had more genomes and and more motifs and that sort of thing But we're a long way from that. I have a general question, which is if it's the case that Which seems likely that conservation Is going to be applied if not gene by gene regionally in the genome And how is that going to be I mean right now? It seems like there are a lot of people out there using it because it's a simple concept to understand there's data You can look up and but how is that how are more sophisticated tools or or ways of using this going to be? Pushed out to the community I mean I do think that I just want to bring up the thread that I brought up earlier about the structural and I mean I do think that if you want to get to the next level of you know Either looking at for instance the conservation and the DNA binding motif or looking at what's going on in a you know In gene conservation and clearly the next level is to actually think about it as a molecule think about it As a three-dimensional structure and you know what I think is remarkable is you know There's been a tremendous amount of work gone into developing structures of things and so forth It's actually amazing how uninformal as he was pointing out that that's been but you know That's I guess the reality of things but that is that would be the next level You actually start to think about things, you know Where you think about the base substitutions in terms of you know base pairing or binding or what sort of actually going on So Maybe this is something that You Shamil and Heidi are already working on with respect to this example, but this seems to be You know the this seems to have the seed of Something which can be more holistically brought together with some of the things we were discussing before the break So when you know I started to see this and then think about all the the very you know Constructive discussions we had earlier in the morning, you know I now think that you know a patient is going to present with a certain phenotype and you're going to perform exome sequencing And then based on a history of patients presenting with that phenotype We're going to have a view and you probably already do have a view of what fraction of patients Actually end up having a mutation in this gene or at least have a candidate variant in one of these genes That's an important thing to know what fraction of the general population who aren't presenting with HCM Have rare variants in those categories that in those genes then With the variants in the subset of individuals in both of those categories that do have them how often do they break down into these groupings and you know getting to a sort of you know from a starting point of a whole exome with This type of annotation and with other types of annotation that we can bring in you know an ultimate you know degree of confidence in specific variants and You know that may be sort of a community outcome that while there won't be again a one-size-fits-all Answer there are so many different clinical areas in which we have 10 15 20 25 genes that we customarily inquire and Which we expect to explain give or take half of the cases that present and I think we can we can move as a community towards treating that in a consistent and you know very productive way I Wanted to comment simultaneously John's question marks and marks The questions where we're going and this is one possibility is I don't know whether this is realistic possibility Because many people know that that different genes different phenotypes different functional categories so you can largely customize make Develop the method specifically to each phenotype and a gene context and to much point about structure So in this particular case there is unique property of some of those proteins we have Structural active confirmation and a non active confirmation and you can track Movement for of each amino acid between active and in active confirmation happens to be very useful feature. It's a structural feature It's very useful. There's no way you can generalize it, right? It's specific to do to circumvent proteins and we Were likely to have the structural pairs So this is potential way forward, but maybe it's over over the ambitious to think that this is where we should go on the global genome scale So, I mean the number of individuals working on conservation It's relatively small probably compared to the number of clinical labs that there are out there dealing with the you know Data and it's also more organized And I mean it seems like one of the key themes that keeps coming over over and over again is the value of centralization and It would seem that since there's so many people interested in you know applying it And we agree that there's value and there's you know room to to grow and expand that that that it would be perhaps a nice model Thing to try to build a centralized resource that could be systematically used so at least all the clinical labs and everybody's using the same score set of scores For looking at things So feasible is that I mean I would think it's quite feasible given the fact Yeah, that they come that the community is organized already. So at least the ones working on conservation Yeah, I mean I that that's I think it's fairly practical especially I mean encode is already essentially gathered most of the most of those people in one umbrella anyway So that you know there are that's a very conceivable goal to Sort of unify those kind of measures and also organizations like cage age of ES There's an ipsy group I think and and I said be so so there are groups which try to get people together and Come up with some standards. I'm not sure this is going to work with There's some activity. I mean you could imagine not only organizing things but also, you know the community But also for example at annual as hd meetings having a You know placeholder symposium or whatever that's built around this topic to to you know increase awareness, etc So maybe for the the epidemiologist at the table you could just very carefully Respecify what it was that you need specifically. So so wait. So what is it that you're talking about building? It's a it's a standard way of applying conservation scores or estimating them or some different sort of Description of the conservation so that it's not just, you know the allele, but something else What is it that you'd like to articulate there? Well, I mean my thought in in proposing it is that there are continuous developments that are going on and You know Sharon Plon doesn't need to align the alligator genome So I mean there's some central resources that's going to do that and in other words that that it would be great value And having one place that always has the most up-to-date Information that that is agreed upon by the community that's generating it for the annotation But it sounded like you were talking a little bit more about Conservation is is a blunt tool right now the way that it's applied and you wanted some more specific way of applying that tool Did I hear that right or well? I mean so that was you're talking there about the gene specific So that would be for example if there's a centralized resource that would be the you know the ultimate place to push out Start pushing out things like you know gene specific annotations, etc. Where there's a you know Where there could be feedback for the from the community I mean obviously it requires work it requires patient information interactions, but potentially there are many more genes That could be annotated in the same way, but it's got to be but centralization is the key to doing that Maybe also you unified benchmarks of accuracy in specific contexts So if the community can develop this type of standards this may be useful So one of the challenges I've had is there's a lot of tools out there And you know a colleague of mine had gathered I think I want to say 18 different tools And was then trying to do a similar validation like we had done with the polysarchimere polyphen And he took a set of known mutations in a few different genes benign and pathogenic to validate and he actually did it a computational method to Combine in math in a three different way or Bring in all the tools and figure out which set of three methods combined together gave the best predictions and then computationally did it in thousands of combinations and and Selected three different tools that were optimal for each of the three different genes, right? So they actually came up with with different things and that underscores I think what we were talking about is each gene may have sort of different Things that lead to that but at the end of the day each of these tools We don't understand the accuracy of each of them And so it's been quite arbitrary what people have actually chosen to use Oh, I use polyphen or I use sift or you know right now I'm using the ones that because we use alamut as a way to more efficiently get access to some of this data We end up using the three that alamut as a software package has embedded in it Which is polyphen sift and the line GBG, but it's it's highly you know Arbitrary today as to what labs use and and there's just no capacity use all 18 of them And then I also think that each of these different whether it's 18 or whatever number of tools are more or less independent from each other So there's a tendency to say well if I use more and they all and the 10 that I'm using all say the same thing that That's better accuracy than if I had one saying one thing right which may or may not be true depending on the Independence of the underlying Methodologies that each of these tools use and so I think it would be great to have some sort of comparative process to say well These five tools use very different underlying things So if you use the three of them you'll have independent assessments and if they're all the same you have a higher probability being correct But anyway, you know the long story short. I think it's still challenging if you're out Even if we had a centralized place What what are the methods that that should be used to do this and the other thing is you know for homology and Conservation being able to visually see the alignments is critical because there's always errors in those alignments And I want to see that not only is it conserved, but it's conserved on the back All right, it's not conserved on the background of a conserved region Because if the whole region is non-conserved, but then it probably does something different and I shouldn't be trusting that data And so that's why we always make our fellows take a screenshot of the alignment and paste that in for the geneticist to review And not just have an automated numerical number of conservation I just want to follow up on your point I mean maybe a chameleon was alluding to this but one nice way of kind of organizing this thing is they have Don't have like a competition like CASP where people would try to Predict deleterious mutations and some people in these communities would hold them back and you know Then then you would reveal them at this at this thing. I think that would be quite I mean you maybe have something to say on that It's a John Malt and Stephen Brenner now ran KJ which is critical assessment of genome interpretation And I don't know how many people here participate in the challenges and I think there will be Challenge this year smaller challenge than last year It's still people still trying to find the right way to run the competition blind assessment of the tools It's also includes complex phenotypes Last last year some people were able to predict whether somebody has Crohn's disease with 90% accuracy which is way above heritabilites So there are certain things to To to to to awaken those competitions, but but this initiative is out there and please please check it out I Mean I think that there's one one thing that should be said is that there is an existing paradigm for Resolving these things which has not really been employed. I mean, you know We think of like clinical medicine is oh You know, they're just off seeing patients, but clinical medicine. I mean, we're not the ones with all the You know, we're not the only ones with all the toys. There are lots of of Rules and scores in clinical medicine. In fact, there they're thousands of them these risk prediction rules Somebody does a study. I've got you know 500 patients show up with epistaxes. What are the causes? How can I predict their outcomes? Etc. They're these things show up all the time and and inevitably what happens is there are 20 different risk prediction scores that come up and then you know, they don't just duke it out at meetings They just say alright somebody organizes a prospective study because everything we're doing so far as retrospective I mean somebody just needs to just take that model put it forward organize of some you know wealth Put together perspective and at that point out of the 20 There's usually a couple of winners, you know winner or maybe a couple in certain circumstances And it clears up the field dramatically and I think that is something that is you know can clearly be You know fostered by NIH other comments Great. Well, we're we are just about it at lunch We actually had scheduled lunch at 1240 and we do you know respect the fact people like to get out a little bit early when they Can so but but when we need to continue a discussion we certainly will so It's hard to get people back at five minutes before an hour But if if you could come back at five minutes before the hour and we'll plan on starting the the experimental data Talk discussion. There are lunches for those who ordered them just outside here You only have 20 minutes so it's gonna be very tough to get something anywhere else But run if you can and we'll see you back in 20 minutes