 We will move on now to Eric Borwenkel who will speak for 15 minutes on finding rare variants contributing to complex disease risk. Thank you, Eric. The floor is yours. Thanks. I'm going to speak quickly for 15 minutes. So I divided this into a series of kind of questions. First, I have colleagues who are basically rare variants who care, particularly clinical and epidemiologic colleagues. You know, basically, you know, they say, well, I've never seen such and such disease and I treat hypertensives every day. And I think that definitely misses the point and we've beat this pony pretty hard. You know, I think the point about PCSK9, though, that hasn't been made yet, it's not enough just to show that it reduces LDL cholesterol in this case. These are these loss of function mutations. For those of you who don't know this story, so these are African American Eric study participants and you can see that there are loss of function variants that lower LDL cholesterol but more importantly is they lower cardiovascular disease risk. So you're not just lowering the quantitative trade, it actually translates into a clinical outcome. And I think that closes the story and these individuals were following very closely and we brought them back in on multiple occasions and phenotyping them more carefully so we can definitively publicly say that they don't seem to have adverse other traits. You know, they don't have obvious adverse other traits but we're looking at them very carefully. And not only that, Pharma has jumped all over this to identify novel therapeutics. This is actually a poster. There was a meeting, I believe it was in October in the Natural Center. So there are a number of people now looking for inhibitors of PCSK9 because not that it's just lowers LDL cholesterol, lowers LDL cholesterol, one, two, it lowers cardiovascular disease risk and three is the individuals that carry these mutations seem to be phenotypically and clinically within the normal range. So you're putting all three of those things together, not any one of them. I think it makes a very nice, druggable target. The other ones, how many are there? We get to ask this question a lot. And this is a slide from the ESP exome data set that I like. There are a number of metrics down on this x-axis of total diversity in a sample or an individual averaged over individuals in the sample. And what you can do then is just partition that metric, whatever metric you decide on into the contribution from various bins. And what you can see is you notice the bulge at the top that the contribution of low frequency and rare variance to the total diversity in the sample is pretty high. And so even though, you know, it's fairly flat in that middle range, but a large amount of the total genetic diversity in the sample is attributable to these rare and low frequency variants. And what's amazing to me still, having done this now for several years, is we, the more individuals you sequence the more novel rare variants that you identify. And these are two different ways of looking at it, either, you know, a fairly tight target on very many individuals or a broader target or the curve looks the same even entire exomes is if you just look at the accumulation of variants, numbers of variants here, it seems to increase and I'm not sure if it would ever plateau except for every nucleotide in the genome is at one point or in some individual is variable. And so there is an argument then for sequencing very large numbers of individuals because as you sequence at least within the ranges we're working with today, which are still very large numbers by the way, we were still identifying novel rare variants that we had not seen before. The other reason that we're interested in low frequency and rare variants is they tend to be deleterious and this graph shows basically that the proportion of variants in the bin that are predicted, these are by computerized prediction algorithms now they're predicted to be deleterious and then looking at those that are low frequency and those that are fairly common and you can see the proportion that are predicted to be deleterious in this rare and low frequency bin is quite high. Where did they come from? Well, obviously they came from mutation but it's a little more complicated than that as you know, why are there so many of them here? And this is a nice sort of cartoon or character of human population expansive from a paper by Boyko and there are several versions of this and what we can see is obviously modern humans have grown super exponentially in modern time and we have expanded as a species throughout the globe in a very short timeframe and what that means basically is by rapid population expansion at least intuitively, so the population geneticists bite your lower lip. Intuitively at least as we've overcome in some way the role of purifying selection by this rapid population expansion and so we're able to have again as a collection of individuals we're able to have a large number of these deleterious variants that are still in the sample and the way I look at this then from a, you know from when you have a sample of individuals is looking at where these people come from is you know, we all know about de novo mutations basically down here and many of us are looking at the role particularly in dominant disease of new de novo mutations of finding disease but then you going back into fairly recent history is you can really think of this then as a very large pedigree and you have these low frequency and rare variants and you can actually, you know the Brownings and Bruce Weir have leveraged this very nicely looking at identity by descent as a way of identifying regions of the genome that contain rare variants contributing to disease and then basically up here where you can consider us to be unrelated individuals you have common variants with large effects and at least intuitively this is kind of how I look at the human family and the role of rare and common variants. The other thing that's kind of interesting is everybody thinks that these rare variants because of the history of human genetics are basically only in sick people obviously they're not going to be present in these cohorts they're only in sick people that you ascertain through a medical genetics clinic and that is definitely not the case. There's a couple ways of looking at this again this is from the ESP paper that was in nature recently it's kind of interesting to look at the number of functional or again predicted to be functional variants per individual it's in the order of about a thousand a little more than a thousand so each one of us contains about a thousand in this case amino acid substitutions that you can use prediction algorithms and they should be quote functional whatever that means. And this goes back to what I was asking in the previous session is what we've done is in a sample of Eric atherosclerosis risk in communities participants so we've sequenced several thousand of these individuals and you find in this sample individuals that contain mutations that according to OMIM should have various phenotypes and I've just shown you two and so for example so what's shown here is basically the histogram of fibrinogen an entire sample of 16,000 individuals and then you put on to them the individuals that contain these variants and you can see in the cases that I've shown here the people that contain the variants actually sit in the tails but you can actually find individuals that contain variants for I'm not showing it for HDL cholesterol for example that basically that although they're high they're in this entire sort of tail of the distribution they're not they would not be considered clinically extremely high even though according to OMIM they should have all of these phenotypes that you can't believe they're perfectly fine except they have very high or very low fibrinogen or HDL cholesterol for example. How do you analyze rare variants and you know this is probably well known to many of you now the bottom line and I just selected one simple example for those of you who don't do this is you you have to come up with some metric that collapses and I think Peter said it very well last night this is the way we're doing it today if we're doing it this simply five years from now we've probably got a problem we've got we've got to figure out more clever ways of of of combining information in in low frequency and rare variants in a sample then simply just counts that we're currently using counts counts are great but there's more in the data then simply counts and we need to think of more sophisticated ways of doing this one of the things in the write-up for this meeting was sample size and power and the bottom line this is these are some two slides from Susan Liao that I like to make the following point that people you know across the country and around the world have done numerous power calculations to look at sample size and those are published and I'm not going to go through the bottom line is you know when the power calculation fits the assumptions that all of these methods that you know they do fairly well and when they don't there's a lot of variation and so you've got to I think we have to all be very careful of taking very naive power calculations and then dictating that we have to have a sample size of so many tens of thousands because we actually don't know the underlying genetic architecture and we don't know yet how we're violating those assumptions so I would take some of this with a bit of a grain of salt then finally closing up is in this thinking about large scale sequencing we need to not be limited I think by exomes we need to figure out when we're going to make the transition from exomes to whole genomes and then frankly how we're going to analyze whole genomes in the charge consortium we've now sequenced 2,500 whole genomes and by the end of the summer our sample size will be 4,000 that's our target sample size and I'm obviously not going to go through the numbers on this slide but just shows you when you sequence whole genomes there's a tremendous amount of variation there's a tremendous amount of variation and the goal I think for both study design and data analysis is how to reduce that or increase the signal to noise ratio and pull out phenotypically important candidate variation and lower disease and we're doing this a number of way and I'm not going to sell you on any one way what's shown in the bottom are some metrics here about coverage and numbers of variants and this is just a snapshot then what we're doing is basically having burden tests in annotated regions and this happens to be a gene then the second thing we're doing is using on-code and oregano and basically annotating non-coding regions and again then doing burden tests across the genome in those non-coding regions the other is we call basically a super GWAS is every site here is a variant that is of not rare frequency and we can just look at those individual variants and then the final thing we're doing is basically a sliding window so a burden test a sliding window across the entire human genome and then looking for variable sites and we have a few positive controls and so we know this kind of strategy works whether this is the A strategy well I can tell you it is A strategy for addressing human genomes whether it's a very good strategy or not I think history will tell and I think certainly we all need to be thinking about how we're going to analyze whole genomes in the future so that's the end staying within time and so the bottom line is basically the way I look at rare variants are these questions and I think really as we think about sequencing and large-scale cohorts what we're doing really I forgot who made the point earlier is in arguing to sequence and not just genotype we need to think about the role of novel rare variants and how we're going to use that information so thank you Thank you Two minutes under, that was brilliant Okay, good Just to pick up on one of the points you raised Eric let me ask you to comment on the issue of surveying rare variants because I think the issue is that we're going to always see many, many rare variants and the question is which one is do we pull out or do you think that we're at a point to even consider that statistically we can we can rely on statistics or even with weighted tests that are like or do you think we have to or are we at a position where we have to consider a lot of ancillary information you know corollary clinical information and code data you know laboratory evaluation and the like because whatever we choose I want to put two points on the table one is we have to think about validating anything that we think is really real and important just because there are technical issues of false positives and false negatives and the second is in the validation you know showing that somewhere other than in the discovery set that allows us to confirm that this relationship is really robust and it's not a unique relationship only in that one individual that would never be able to replicate and sort of act upon. Well I mean there are better statisticians in the audience than I am but I like statistics and so I think we will rely on statistical analyses however unlike what we're doing today is I think we're going to need to incorporate which I forgot the word used and this ancillary information and bring it into the analysis which we're currently talking about for you know waiting by predicted function but not very many people are doing it so I think we need to figure out how to look at the entirety of the data and not just focus on a simple burden test related to cases and controls. The advantage of sequencing and deeply phenotyped individuals is we do have a lot of ancillary information. I think the other point is it's not in the scope of this meeting but I really think we need to think about better ways of developing functional pipelines whether it's the mouse, the fly, the zebrafish and complement our human studies with functional pipelines. Then the issue of validation again you know Rick can talk about technical validation and the other part of validation is kind of replication but you know so I think the technical validation you know it would kill us if we had a huge sample set and we were going to technically validate every variant that we saw we couldn't afford to do that that would cost us more than the parent study but on the other hand before making obvious conclusions we would need to replicate and technically validate but I think we have the infrastructure to do both those things. Thank you. So it's still a little bit early days in terms of the analysis of rare variants but do you have any data so we have certain approaches for GWAS analysis of correcting for population substructure for common variants do we have any sense at this point as to whether that degree of population stratification correction is sufficient and whether we need to do super fine population substructure analyses for the analyses of these rare variants. My opinion is we're going to be very careful with substructure because the rare variants tend to be I'm not sure what the right word is population or even you know clade I mean specific but I don't have good data on controlling. Lynn do you have? Well I'll talk a little bit about that this afternoon and show you at least some data although I think we're really still in the early stages of assessing it but my intuition and I think that of a lot of others is that because rare variants are more likely to be population specific we're likely to need more fine tuned population data for our background samples. I mean just looking at the thousand genome data I mean you know and as you look at the population privacy as you go from group to group it just you know that becomes such a high proportion of the variation and so you know some of the studies like in Norway where they've been looking at you know bracket mutations and the like it's very interesting you know they can map who's on the north side the south side and the west side of a relatively small country just seeing the additional rare variants along the haplotypes of some of quote-unquote more commonly known BRCA1 mutations and so you know how we will control for that again comes back to this question can we use the discovery set for action or as we get rarer and rarer do we face this very very difficult you know question of having to have independence or going somewhere else to interpret and you know and confirm what we see as we can take that into account and you know and use population genetics history and in various ways of of organizing information to to really confirm or see whether that is really a passenger as opposed to a real driver we're looking at. This is a question I meant to ask Peter last night but Eric you raised it again which is the power question which is a power question and it's one I can't answer because not a statistician but when we were putting together the ESP project and then you listed all the other projects the diabetes etcetera I think that there were some very very smart people who made power calculations about estimations of what the sample size would be needed to detect a meaningful effect from a low frequency allele and as you pointed out a while the projects are still going etcetera I don't think we were having this this kind of treasure trove of findings that they're necessarily coming out now we have that large number of relatively larger number of whole exomes when you piece together all of those projects plus you have exome chip being genotyped in well over a hundred thousand individuals through several different ICs like the NHLBI I could probably count probably a number of cohorts that probably together would add up to fifty seventy thousand individuals so I think we probably in the next six months would have data where we could actually make power calculations based on actual data rather than just on some hypothetical so I guess I would I would say it may well be a good idea to think about how we could collaboratively across the different institutes gather those power power estimates with real data so that the true sample size for the for these cohort for this cohort or sample sequencing could actually be based on data I completely agree I think we all know that we should be a bit skeptical of power calculations in grant applications power calculations require assumptions about how big the effects are and as you said in the early days of sequencing I think we're a bit more optimistic about the sizes of effects that there might be for lower frequency and rare variants and there's a lot of empirical evidence that we were overly optimistic and that there aren't things of the size that we thought Eric's two examples and the quantitative traits where you pick things that omum said were interesting and they were in the tails if you try to do that experiment the other way around if you said would you learn from this data that they were interesting I'm pretty sure the answer is no with this kind of sample size so it's I mean there's a general point one of the things that's come out of the sequencing studies is that there are lots of rare variants some people find that kind of surprising and exciting I think it's not so surprising but anyway there are lots of rare variants what we don't know yet is how big a contribution they'll make and how big their effects are for common diseases except that we have partial information from all these large studies that those effects aren't as large as we might have hoped when we did the power calculation for the studies we're currently doing and I completely you know I think the empirical data is now pretty strong that we need large sample size Eric when you said the problem with power calculations for various statistical methods is if you make the assumptions of the methods that are designed to pick up on the power calculations of good I can and in general that won't be the case then you said so maybe we should be a bit skeptical about power calculations so we say we need large sample sizes is that what you're saying I mean that sorry it was more than just sample sizes it was also the the many many other assumptions that go into those power calculations that we don't know if they're true you know for example that the direction of effect is always the same when indeed in the same gene you can have variants that send you in both directions things like that I think concerned me in addition to the sample size so I completely agree with all those concerns but the consequence of them is that we'll need larger studies than those power calculate than even those power calculations might suggest absolutely yeah yeah that was the end let me just a bit continue on that main given the list pick the numbers 65 the data those data today we need to point us not the profits but to point us and read this if this power calculation questions come up what would be the goal to recommend the sample sizes my guess of this issue is going to be launched we're going to need those numbers and be nicer to have the the future numbers that were based on 48 we did two years ago so the we need to know how big the effects are of rare variants to contribute to the diseases we're interested in we can either hypothesize about that which is what we've been doing and we've got it wrong I think or we can find rare variants contribute and use real data but we haven't found many rare variants will contribute yet so I'm not quite sure what you mean by but we've relied on hypothesis testing just take the point estimates in the in the samples we have today not if we take only those that have a p-value of greater than 10 to the minus pick a number then I think we're going to actually get that same bias back into our calculations that we're going to but we we just take the point estimates independent of p-values and use those point estimates to generate of effects of particular or sets of variants within a what do you mean by point estimates of burden of a set of variants within a gene because that's not well I mean you could do it several ways I mean the whole point is this is you know trying to discover what that space is going to look like and you know we know that there is you know pleotropy and on the phenotype side as well but I think that it would be to have very large numbers like the 65,000 or so and ask questions of just about everybody would have something like BMI or something related to that where you would at least be able to look and see what what you're able to discover to what you know what a little frequency and what effect size at least what's showing up in the agnostic analysis now whether there are a lot of false positives or not as you go further down you're going to need you know additional data sets to really establish that because again you know it's this question of discovery and validation but at least to know what you really would see with very very large numbers so I think the way I think about it we can use current data to put bounds on what the effects would be by arguing that if the effects were bigger than this we should have seen them and we haven't seen them if we try and do point estimates the problem is certain that the other good thing to do would be to find real things and then we've at least got some examples of the kinds of things we're looking for and what their effect sizes are we haven't got very far with doing that one worry with point estimates I completely agree with all of the earlier discussion that population structure is likely to be a much bigger issue with sequencing data than it is with has been with US data and that will confound point estimates so you don't quite know if you do point estimates in the presence of confounding whether you're actually estimating the disease effect or you're estimating something else so it's a bit scary I think a couple of sort of comments from a totally non a ton of exome sequencing and a ton of exome chips that do you have a good handle on the phenotypes that's the other part of the that's I mean if you're going to do this this million person cohort it seems to me that there are two two issues one is the genotyping the other is is how good the phenotypes are and I'll just sort of leave that open because I know we're going to talk about that this afternoon the the the other question or comment for Eric was what is rare is rare a one-off or is rare sort of a 0.1% because because one off I think you're going to have really great trouble figuring out what the phenotypes are unless they're really exotic and obvious but if they're start if you start to sort of look at at rare meaning 1% or 0.1% you might actually be able to get to a phenotype if you have that phenotype annotated somehow and then the third comment is this business have been and really irritates me because we find rare variants in the same gene with some of which are hypomorphic and some of which are hyper morphic and the idea of sort of bidding all that together because they're all variant I find it's sort of it's it has to be at best to something before a first approximation just comments well I think that just reinforces that we're we're using counts today and and counts are good but if we need to use other information to improve how we do it going forward I think everybody would agree with that just to address the the phenotype I I only know what's in the sort of NHLBI space for the most part but within that space we have a number of cohorts so you have cohort data with which is pretty high resolution quantitative trait phenotyping and some disease traits like my cardio infarction is one that's getting that's that's receiving attention so I think it's well phenotyped and probably would represent the the highest resolution phenotyping that would probably be available but if those patients have some exotic form of malignancy you might or might not find that that isn't being measured you mean right correct so I I guess there's a there's a potential elephant in the room when as we start talking about cohorts with very very deep phenotype data or broad phenotype data and that's the multiple testing burden that imposes the the power calculations that that Peter presented which are depressing in and of themselves like relate to what you see with a single discrete phenotype if if you're measuring a thousand different phenotypes across the cohort those power calculations move from depressing to incredibly depressing and I I I just think that's right with a million phenotypes it gets it gets obviously much worse so so there is there is a danger that by measuring too many phenotypes within a cohort in a gene discovery context you actually reduce your power to find association for any one of those phenotypes so there is I don't know there's an easy solution to that but I just think it's something we need to consider in this context I wouldn't although it's always been argued that the number of genetic tests dwarfs the number of phenotypes even if you have a thousand phenotypes but you know I would be quite so depressed I mean because I I mean I again the question is what is this first study it's a discovery it's it's not giving you the completely wrapped finished picture that you're going to go talk to a patient the next day on the basis of that one sequencing I mean so false positives I mean they're our best friends so to speak you know we wish they were in the room but we as we as we go through you know linkage GWAS now sequencing we always carry false positives but it's the second and third analysis that win those out those things that we say hmm that was a false positive these are the true variants that we think are really worth pursuing or putting into the next paradigm so again this is where numbers and thresholds become very important in this rare variant sequencing world to be able to go look elsewhere and and and try and establish those things over a series of studies because no one sequencing study by itself unless we sequenced all 300 million Americans and had perfect phenotypes for them and then we were able to just start to play in that in that in that field but that's not going to happen but given the topic of Eric's talk which was finding rare variants for if what if it's just a one-off how do you know and I think the way we've tried to deal with that in the past is to look at all the existing databases and say well if it's been seen before it's probably isn't related to the disease and and that's probably not the best approach particularly for most more subtle things or form fruits and and it seems as though the 65,000 people that we've been batting around here would be a fabulous reference population if only we could put them all together with all the recommendations we have for doing that to be able to go in and then query and say okay do they really do do these people with this PCSK9 for instance do they all have LDL cholesterol or low-ish you know LDL that that sort of thing that would would I think inform some of this and then similarly wanting to go back and refinotype those people which is you know involves a whole host of complexities but if you had a group that was actually large enough and agreed to be refinotype like the UK 10K or other kinds of groups that that you you could indeed go back find those people based on their genotype and then phenotype them that might solve this problem wouldn't it or at least help address it yeah so on that hopeful notes okay one comment oh yeah no I completely agree with Terian I think it it means there's there's tremendous power and in these cohorts from a validation perspective it just means we need to be careful about how we think about them from a discovery perspective yeah so three hopeful notes Michelle J. Quish at HLBI two issues that have been brought up just now one is huge sample size and power and the other is what if it's a one off we've also talked about family studies but nobody's integrating this and I'm wondering do we need these huge sample sizes if we're looking at families for rare variants because rare variants cluster and families we're looking at transmission not association families are a good way to show that it's not just a de novo mutation so people keep bringing up family studies but they're not really talking about how it would be brought into such a study design hold that thought for the general discussion we have to move on Magnals will tell we talked to us about more genetic modifiers in complex disease