 Okay. Thanks Daniel and Terry. Daniel, I was expecting you to talk for longer so I could finish arguing with my colleagues. I'm the design working group about key points. So I'm going to show a combination of some things which hopefully are not too controversial that we thought should go into the manuscript that comes out of this paper and also some things that we thought were more controversial discussion points. And I've tried to remember to highlight all of the things which might be controversial by putting the title in red, mainly just to make the point that I don't necessarily believe everything that is put on those slides. It's phrased in a way to try to create discussion. I should also point out that other than the argument I just had with Nancy and Ben, the rest of the working group, the study design working group hasn't actually seen these slides so any errors are my own. I think I've managed to pull together all of the ideas that the four of us discussed. And we were talking, Nancy, Ben and I were talking yesterday after the last evening session that the topics that we are, that I'm going to present and that we spent time discussing are maybe slightly different than some of the things from last night. And I think there's multiple different purposes of this meeting and there's quite a wide variety of different, people come from a wide variety of viewpoints. I think that's a good thing. I was talking to Daniel about this meeting and he mentioned that there was an original plan to focus exclusively on assigning causality in Mendelian or more clinical genetics cases, but that the organizing group wanted to expand it to include complex disease, too. And it became apparent that there's quite a different set of questions in terms of thinking about a research setting versus a clinical setting. And most of what I'm going to describe is probably more relevant to research than clinical settings. That's partially because I think study design as a topic is kind of more natural to think about in a research study rather than in running a clinical lab. It's also probably reflective of the backgrounds of the people in this group. It's not to say that any of the clinical issues aren't important or in fact they may be more important and so hopefully if there are things that we missed people will bring them up. And the other caveat is that I think there's going to be discussions all through the day of combining different types of data to try to assign causality. And you might imagine designing a study which includes a functional genomics and obviously designing that study will be, you know, there'll be important aspects to making it well-powered to help implicate causality. But for the purpose of this discussion we focus mainly on the genetic aspects of the study. So if you're going to try to, how do you design the study to collect the best genetic data to implicate causality which then might be assisted by other kinds of data. And also if anyone wants to interrupt me as I go along feel free. So we kind of broke down issues on study design into three broad topics. So analyses and analytical questions of how you are going to design and carry out your study, the appropriate number and type of samples you might want to include, and the different technologies that are available for doing these kinds of studies. So I'll take each of those in turn. So the first point which we thought was important to make is that when designing a study with either with or without the express goal of implicating causality genetically it's important to use the existing body of knowledge both for your phenotype and more generally. So this is a cumulative fraction of the NHGRI GWAS catalog of hits by odds ratio. So basically from left to right what fraction of hits have an odds ratio of just over one, two, three, four, et cetera, et cetera. And what you can see is that, you know, something that's no surprise that the vast majority of all of the existing cataloged, typically common allele associations have odds ratios less than say one and a half. And this is, I've filtered this a bit removing things like the HLA and age related macular degeneration which have really huge effects. And in fact, if you continue to look at, say, the 10% that have an odds ratio above 1.6 or 1.7, those are often studies that are a little bit atypical. So there's things, traits where there's perhaps less selection like the biggest male pattern baldness locus has a really big effect size. There are some diseases which I have never heard of that there's a study in 74 tie cases and 80 tie controls and they find a genome-wide significant hit with an odds ratio of like six. And that may well be right. The sample size makes one a little bit suspicious, but it's, that disease is not really likely to be the same kind of complex genetic disease that the most of these things come from. And so my main point is that in moving from, say, GWAS to sequencing, obviously the sample sizes are much smaller initially because of, for reasons of cost, but it's probably not a good thing if you start out with your study designed by saying, I'm well-powered to capture an allele of 1% with a odds ratio of 2.5. Because if you start by postulating an effect which almost certainly doesn't exist in most common complex diseases at least, then you're unlikely to be successful. That's a sort of general point. If you think about specific cases, these are four papers from this year on all looking essentially at de novo variants in trios with an autism-affected proband. And there are three in nature. Mark mentioned them in one in neuron. And the main point I wanted to make here is that if you look at the titles, there are no genes in them. None of these say gene XYZ is implicated via de novo mutations in autism. If you read the abstracts, there are a few genes mentioned, but none of them say confidently that this is an autism gene. They say we observe so many mutations in this gene, and that's unlikely to happen by chance and similar kind of weasley words. And which is not that sort of criticism. It's all of these papers take the absolutely hard line that they haven't yet implicated beyond a shadow of a doubt any gene via this route. And there's something like a total of 1,000 families in these studies. And I'm absolutely certain that as these studies get combined and we're doing some similar work in the UK and as the samples grow that they will implicate specific genes. But if you again decide to start your study by saying I'm going to sequence 100 trios and I will definitely find the causal de novo variant in some of them, then you're probably deluding yourself because it's not, there's no good reason to expect that these four groups which have gone before somehow magically didn't find the statistically convincing evidence in nearly a thousand samples that you'll find in 100. Another thing we thought about in terms of analytical questions is it's worth questioning assumptions in your design. So this is a, one thing that I think as people have started trying to do in terms, again, because sequencing is still relatively expensive is to go back, for instance, to large complex disease pedigrees, which didn't show any easy to interpret linkage in and of themselves. But with sequencing, you can try to follow these up. And this is an IBD pedigree that I'm working on along with Adam Levine and Tony Siegel and UCL. And this isn't different families. This is one giant family that these are SIBs, including a pair of identical twins. And these are the different branches of that family. It's in total something like 800 individuals in this family, 40 of them have IBD. It's, if any, any sort of Mendelian version of a complex disease might work, we would have thought it was this. And we've kind of thrown the kitchen sink at this. We've looked at the, within family linkage peaks, we've sequenced exomes and then whole genomes. And we're in the process of trying to, trying to follow up a variety of candidates. But as I think David and others mentioned yesterday, the, it's not, you know, it seems like, oh, this must be, there's a big Mendelian type gene in here, but it's not necessarily going to be straightforward. And it's important to really think about, can the variant you're looking for exist? And one example is if we think about, say, a 10% penetrance variant, which in a big family like this, you might say, isn't really that much. And indeed, a single 10% penetrance dominant variant couldn't explain this family by itself. But if that variant had even, say, a 1% frequency in the general population, it would only need an R squared of 0.01 to any SNP in our latest IBD meta-analysis to have definitely been found at genome-wide significance. So there's a relatively narrow, I don't think it's an empty set, but there's a relatively narrow set of possible variants that could be hiding in this family. And before undertaking a study like this, it's worth actually writing down what that set is instead of just saying, well, you know, maybe we'll find something. The first controversial point in terms of analyses is, one might ask, do we actually care, and this is perhaps more relevant and complex than Mendelian disease, which variant is causal? So if we know the gene through some means, and we know the mode of biological action by which I mean reducing the function of the gene increases the risk of disease, do we actually care what variant is responsible? And by way of example, HNF1-alpha is one of the most common Modi, that is Mendelian diabetes genes, and it was found, I'll quote Ben to himself in 2010 to have common alleles which are, which provides susceptibility to the more complex type of type 2 diabetes, and they state, at some level such as HNF1-alpha, HMGA2 and KLF14, existing biology coupled with phenotypic and expression data highlighted the name genes as prime candidates for mediating the susceptibility effect. And I think that's a relatively cautious version of the statement that, as Ben said yesterday, I think everyone would be surprised if it turned out that OASL is the causal gene in this region. And in fact, if you know HNF1-alpha is a Mendelian-Modi gene and you have some understanding of that, the Mendelian variants reduce the risk of the gene to cause disease, then perhaps in this case we know everything we need to know about this region, that there is also common alleles which affect the same gene. Turning from analyses to samples, I think, I've mentioned this already and I'll come back to it again. It's a bit of a hobby horse of mine that sample size in sequencing is still king. It's, if anything, more important than it was in GWAS because the types of variants that we're pursuing require even larger samples. And this is a figure adapted from a paper of consolas actually from last year. And essentially it's saying, imagine you have a fixed total sequence depth and you want to do a study aimed at finding a 1% allele with an odds ratio of, say, between 1.5 and 2. So something that might have been missed by GWAS, it might well exist. There's a few examples in the literature. And you could take that fixed sequencing depth and either do really deep genomes on, say, 400 individuals or really, really sparse genomes on 3,000 individuals. So if you look at this, each line which corresponds to a fixed sample size but adding more and more sequence, the lines do increase as in if you add more sequence, your power does go up. But it's nothing compared to the difference between focusing your sequencing depth on a relatively small number of samples versus doing a relatively less deep sequencing on a larger number of samples. Now, of course, the caveat for this is that we're talking about a very specific class of variant. That is shared variation that's common enough that you will see it and using the sort of low coverage genotyping plus imputation, you get relatively good genotypes. If you want to find very rare private things, this design will, of course, not work at all. But the main point is that if you decide, well, I'm going to pick my absolute best 400 cases and sequence the hell out of them, for these kinds of things, you're still going to have no power. I'm quoting everybody else in the room's papers today. We think that larger sample sizes and Mendelian sequencing are also useful, and that's, I guess, kind of an obvious statement, but there's a couple of reasons that I'll come to why that might be. This is one of the first versions of using exomes to do Mendelian gene mapping from J's lab. And I think, by and large, the approach is still quite similar. You essentially take all of the types of variants, say, non-synonymous SNPs that might explain your phenotype, you pass them through a bunch of filters related to their frequency and general populations, their predicted function, and so forth. And if you're lucky enough to have several different families, you can narrow in on the relevant gene. There's, I think, a couple of points here, which is that, obviously, if you can increase the number of families that you have for any particular disease, that's obviously going to provide an extremely useful genetic, increasing the genetic power in addition to, say, any of this information from the functional side. And I'll come back to that point in a minute. I've mentioned power several times. I'm sort of, as I said, something that I'm slightly obsessed with. And I think that it's just worth thinking about, this was a sort of approximation for the expectation of a typical single marker association test, so your power is proportional to only a few things. And it's worth thinking about each one of those things every time you design a study. So one is N, the total sample size you have, and that's, in some cases, sort of an easy thing to increase. In other cases, if the disease is very rare, it's a very difficult thing to increase. This gamma is your relative risk, the size of the effect that you expect to find. And that goes back to my earlier point about use the existing body of knowledge to ask yourself a realistic and honest question about how big the effect you think might be. P, the frequency of the allele. And again, I mentioned it before, but what is the kind of allele you're looking for? And if you say, oh, well, I'd be well powered to find a 1% allele of this effect, but then it turns out that if the allele really were at 1%, which we think of as being relatively low frequency, but in a huge, you know, in Joel's latest height GWAS with 200,000 people, I think even things at 1% that are really badly tagged, generally speaking, if they have an effect that's reasonably moderate, they'll still be seen. And R squared, which is, you know, however, whatever technology you're using, how well does it capture the variant that's causal? And all of these things are still true. All of these factors are still, of course, critically important in doing sequence-based designs, but there are other factors like if you look in the literature, there's literally now hundreds of papers talking about statistical tests for rare variant association. And I think we'll probably hear more from those shortly. This is, I think, a kind of cool figure in plus genetics, either this year or last year. And they basically did a variety of different assumptions of underlying models expressed in these different boxes, and they tested several of the more common published statistical association methods. And the kind of points I'd like to take home from this is that if you look in any of these bins, which correspond to different potential models of underlying risk, the power goes from zero to essentially 100%, depending on the assumptions. And if you consider any of these models, they go from having, they basically are not in the same order under the different assumptions. One test is better than another, depending on the exact underlying truth. Which is to say that if you go through the points I just mentioned about thinking about what is the plausible set of effect sizes, the plausible set of frequencies, the plausible ratio of true causal alleles to hitch-hiking rare neutral things, then it's almost impossible right now to basically say what is the best powered test to do. And so it really leaves us with something of a conundrum to say, how can we go about designing our study to have the best chance of success? And the point I was making about Mendelian sequencing studies that if you consider a single family, I sort of mentioned that that might not be well powered to actually detect the underlying causal allele, but it also has the problem that is very difficult to make a meaningful negative statement. And again this is focusing more on a research context. Obviously if you're presented with one family in a clinical setting, you have to sequence that family and sort of go through the extremely laborious process that Heidi described to try to understand something for that family. But if you can zoom out and look at say 20 families with the same disease together, then if you don't find anything, you can at least make some statements about the underlying architecture of that disease. Namely that there are, you know, it's perhaps not coding variation if you've been sequencing exomes or that there are many different genes such that only 20 families aren't enough. And when you zoom out, of course, you can see that what looks like a mess at first starts to make more sense if you can put more of the pieces of the picture together. So finally technologies, we thought of a couple of what we think of as technological holes, that is studies that we'd really like to do, but it's not really feasible given technologies that are available today. So for instance, it's still pretty difficult to use next gen tech to sequence a one gene or just a couple in thousands of individuals. And things are changing in the sense that if you look at targeted sequencing, the actual cost of running the sequencing machines is like a rounding error in the total cost of the experiment. And for a while it was the, say, the capture array and targeting just the section of the genome, whether it be the exome or say some candidate genes, those are becoming cheaper by ways of multiplexing that step of the process. And now we're getting to the point where essentially if you want to look at this in say thousands or even tens of thousands of individuals, just making the DNA libraries is essentially the entire cost of the experiment and all of the, of doing these, the capture even and the sequencing is more or less free. And that's a real problem because for many of these things, to get to the statistical level that you'd really want to, as Mark said yesterday, that is absolutely essential, it's impossible to do sequencing on, you know, you know the gene that you want to look at, but on the sample size that you need. And another is, I'll mention an example of this in a moment, but there are times still, I mean genotyping still has a variety of advantages over sequencing, it's sort of faster still, it's the analysis requires less computational time. And, but sequencing prior, often prioritizes large numbers of variants that you want to follow up in your huge cohort. And ideally we'd like to be able to genotype, say between 10,000 and a million variants really, really cheaply so that they can be done in a million samples. And again, this is belying my bias towards complex rather than Mendelian disease, but there are plenty of situations when being able to take a sort of round of variants that could possibly be causal from a sequencing experiment and to really dissect them to implicate specific variation, you need to move up to huge sample sizes that can only be done if you get a very cost effective design. And just by way of one example, Mark mentioned our IBD immunochip experiment, which is a sort of version of that. We had this immunochip with 200,000 SNPs, we did it on something in the order of 70,000 samples in IBD. And in terms of being able to implicate causality, it greatly lengthened the list of IBD loci. You might say, well, okay, but that's not necessarily telling you which genes are involved. But one thing that I think is, and there's also, you might say, well, most of those loci admittedly have an odds ratio of like 1.03 or something, so they don't tell you much about prediction of disease. But one thing that I think is cool is if we do some network analyses, this one happens to be with Grail, a text mining tool from Mark's group. So our previous network had these light blue circles. So there was quite a bit of connection from the previously identified loci. And these dark blue circles are sorry, the gold circles are new genes that add to the network. So that's, again, maybe not that surprising. We find many new loci and we can use the network and suck in genes from those new loci. So that's reassuring. But the kind of cool thing is these dark blue circles, which are genes in loci that we already knew were associated to IBD, but didn't previously connect to the network. So they were, we sort of had loci that we had no idea what gene might be involved. And it's only by adding the new stuff that we kind of expand the network and now start adding these other things. So that instantly prioritizes those genes as being at least candidates for being the causal functional unit in that locus. Some of these are certainly not right, but they prioritized them as candidates and we couldn't have done that without having that long list of tiny effects. It's the sum of the knowledge that allows us to better interpret the, like the underlying functional units in individual pieces. And my last controversial point is that perhaps exomes are already obsolete in that I just talked about the ratio of costs changing relatively quickly. And obviously the publication of ENCODE just last week has, you know, underlined a point I think that everyone already knew, which is of course that coding variation is only one part of the wider story. And of course, this is especially true in complex disease, but I would argue likely true in many Mendelian diseases. And it's just that exomes are a place we can look where there is a sort of process that is already defined in trying to interpret the data. And that's fine, but perhaps it's time to move on to looking at the entire genome. So I will finish there and again apologize for any inaccuracies that remind alone not the working groups. Thanks. Great. Thank you, Jeff. So other comments from the working group on all those inaccuracies? Good. All right. So we have a little over half an hour for discussion. I know it's early in the morning, but I think we do want to study design obviously being a critical issue. So comments, David? I'll just make one. I noticed that you you raised the question. Does it really matter if we know the causative variant, if we are confident of the gene and we've tagged the gene somehow? But you follow that with plus if we already know what the mechanism, what the perturbation, what the path of physiologic perturbation is. So for many of these genes we don't until we find those causative variants. So it would be nice if we were in that circumstance, but often more often not. I would guess we don't know. Yeah. I mean, I think that's that's certainly true. And I guess that was the discussion we had is that you might imagine that, you know, there's causality at the functional unit, whether it's a gene or something else than the biological mechanism, which as you say is, you know, frequently not clear. And then at the actual sort of variation level that, you know, this specific variant does this, which then causes the downstream chain. And, you know, I'm not even sure I agree with the statement I made, but certainly I would say that if you don't have a sense of the part of the the functional perturbation and that connection to disease risk, then you're probably not there yet. Well, also, I mean, it's not just the risk, it's what are we going to do about it? That's what we need that. Yes. And of course, even for, so certainly, for example, for evolutionary studies, it can be very important to know the causal variants. There's, you know, the causal variant. Yes, the context was more in the situation where you've got, you know, three or four polymorphisms physically close together in near perfect linkage disequilibrium, all of which are very strong sissy QTLs for the nearby gene, which is a candidate from Mendelian subtypes and you know the directionality. How much more are you going to get by knowing which of those three in near perfect LD is actually causal? Unless, you know, and perhaps all of them are. And so given how much discovery work has to be done to get to, for example, the kind of network Jeff showed for IBD, you have to weigh the cost of going after absolute causality against the investment in getting more, more genes into those networks so that we actually understand more of the contributing biology, illuminate the contributing biology when you know there are going to be many, many contributing factors. And so it is definitely a difference between the more Mendelian and the complex where you have so many genes contributing to those networks. I think Jeff did a great job of covering particularly study design in the complex disease space. But I was wondering it would be good, I think, to get some of the comment on the Mendelian study design aspects. So I'm wondering if maybe David Goldstein or someone else here could could talk through some of the key issues that we really need to consider that are specific to the Mendelian space. Sorry to put you on the spot, David. If anyone else wants to step in at that stage, that's absolutely fine. Well, I guess I sort of agree with Jeff that a lot of the same considerations apply. I mean, I see it kind of on a spectrum. So if the locus heterogeneity is low, then looking at a relatively small number of cases means that you can see something going on in the same gene across those cases, even if the number of cases is low. And we need to think about exactly how we think about filtering the variants, but it's actually doesn't look like it's all that sensitive to how you do filtering. And I guess I would say that even in the Mendelian or complex world, it's sort of the same consideration. What do you think the variants are that might be contributing? So how do you, therefore, filter? And what is the locus heterogeneity? And therefore, that determines the necessary sample size. So actually, I don't know that I view it as all that different. One place where I do think it is different is not when you're looking at multiple individuals that you have decided have the same condition. But when you're looking at a single individual and you don't know what's happening, there, I would say that by and large, you're looking at a single individual and you don't know what's happening. What you have to do is use existing knowledge about genes that influence that kind of a condition. And that's a little bit of a separate effort because there, what you have to ask is, you know, look at all the mutations in that individual that look interesting, just for example, de novo mutations and ask, are they in genes that are already fairly securely connected to the phenotypes that that individual presents with? And that can be a difficult clinical judgment, especially for a research geneticist. Daniel, can I, there was one of the point we discussed that I didn't mention, which I think could benefit from some discussion, and this has been raised before in meetings and things, which is, what is the actual success rate at given a set of Mendelian individual suspected Mendelian disease or a handful of them, the success rate of if you exome sequence them that then you confidently find the mutation. And I think it's, you know, it's just a kind of classic issue of publication bias that obviously the failures don't get published or if they do get published without as much fanfare as the successes. So I was wondering if people could comment on that. So you see numbers like 30% and 50% banded around and I think those are probably not too far off the mark. But I think the problem there is that a lot of those failures can be chalked up to, I mean, kind of poor study design, right? I mean, inadequate power assumptions about heterogeneity being wrong and things like that. So it's really hard to say, right? But that kind of gets to a broader point, which I think is what you're saying. So, you know, thinking about our just, you know, somewhat negative discussion yesterday about the history of genetics and all the false positives that in Mendelian specifically, most of the quote false positives tend to be the misassignment of variants is causal, right? Not necessarily the misassignment of genes. By and large, most of these things from 30 years back where they found something and didn't, you know, weren't thinking about the stats or whatever. Those findings have nonetheless held up for the most part, right? Just because the genetics are simpler. But there's a tendency now, as we're moving to exome and genome, people still are not very careful and people don't do power calculations or anything like that. And they just sequence what they have because that's what they have, right? So I do think, I mean, it would be reasonable to propose that, you know, there'd be more thought about study design, including what you were saying about if you have the ability to do so because there are enough samples and there won't always be to design things in such a way that you can make a negative statement about your assumptions and, you know, state your assumptions to your sequencing and then state your positive or negative result. Nice to have more studies like that. David. I think I was just going to say, I mean, a lot of the old things hold up because I actually would say we knew how to do the statistics back then and we don't know how to do it now. Because, I mean, in reality, you know, used to be the case that the way it worked, by and large, is like you have a family and you have significant genome-wide linkage. And so you say, look, it's definitive. There is something happening in a genomic region. And then you look for a mutation that makes sense. That part of it, you're absolutely right. Sometimes you look for a mutation that makes sense and it makes sense to you but you got it wrong. But at least you're grounded by the fact that there's linkage there. Right now in the literature, the way it's happened is that we have thrown that step away. So now you don't need to have genome-wide linkage in a family. All you need to have is a stop mutation. I mean, the one thing that I find really offensive, you know, you look at a family there's a few effecteds and, wow, there's a stop mutation. I mean, that's what we're doing right now. So I would actually say the problem that we are facing is that we've thrown away the grounding that we used to have that gave us all of these secure genes. We need to establish a new grounding. I think also, like, if you, it's hard to say what is the success rate because you have to think of the initial study design. So there's probably a much different success rate from when you have large families to if you're just grabbing a single individual from a family and you have absolutely no information. So with the families, you probably have a much higher success rate. And I think maybe we're a little, also, we're thinking because in the earlier studies, people were looking at exonic regions. So maybe we're also putting a little bit too much credence that most of the variants are going to be falling in the exonic regions because we also had that bias before when we were studying using the linkage approach and then looking at candidate genes within a region. So you have Mark and then Mark. You can turn off your microphones. You want me to talk and maybe that's a message. So Mark and then David. So David makes a really excellent point. I agree totally that part of one of our challenges is that there has been backsliding from what were traditionally very strict standards. And you can even in sort of retrospectively apply the way we think about the problem today in the presence of whole genome sequence and reference databases and so forth. And you go back and you realize that by requiring a initially convincing evidence that there's a large score of three, we get down to focusing on about 1% of the genome. Then we always traditionally require for publication the identification of mutations in independent families. And so if you confine yourselves to less than 1% of the genome, search for rare variants but require the two of two or two of a very small number of families have a mutation. You now are really in statistically sound territory even the way we can approach the data today on the whole genome level. So I think we should think about that just as David suggests. The other point I want to make is partly similar to Suzanne's point. I think the idea that there's a success rate in this activity is possibly one of those less than helpful constructs. Because right now we, as many of you guys do, sit down with a large number of clinical colleagues on a routine basis and discuss instances in which these technologies might be applied. I think there would be little disagreement that when faced with a severe pediatric phenotype in multiple offspring of a mating, particularly one from a more isolated or consanguitous background, we think we have a very, very high chance of success. Well in excess of 50%, I'd pick it more 80% to 90% ultimately if we apply all the genomic tools. At the same time, there are a huge number of extremely severe presentations in an individual case. We may decide that even knowing we have a very low chance of success, that there might be environmental causes, infectious causes, cardiovascular events that might have led to the phenotype, that it's still, because the investment is fairly low right now, worth exploring an exome sequencing in that family, even though we don't expect to have success, I'm going to quote. And I think that neither of, you know, none of that is necessarily bad study design. It's applying the tools and evaluating in what circumstances it's worth trying to apply the tools, but being realistic about the chance of success. So actually we have David Dimmick, Gonzalo, and Jeff Barrett, and Joel. So I'm going to offer a contrary point of view to increasing the sample size, which is to increase the power. I think one of the mistakes we've made over the last 10 years is we've increased sample size at the cost of not adequately phenotyping the patients, so we end up having more heterogeneous groups of patients. And I would argue that we perhaps should focus more effort on improving the power, the chances of actually finding something by increasing the homogeneity of the groups, so that we actually have a bigger effect size for variants. And so reducing the problems with multiple genes being involved in a last well-defined... And I think coming as a rare disease researcher, I kind of sometimes have the opinion that most common diseases are actually a collection of rare diseases, so I'll put that bias out on the table. But I think one of the ways we can improve study design is actually by improving the power by being more accurate with the phenotyping and the collection of patients in the first place. And it's kind of a converse argument with the increasing the numbers argument. Gonzalo? Yeah, so even though I'm a sample size guy, I always argue for larger sample sizes. I do agree that there are many examples where picking the right sub-phenotype, even for a trait that appears very complex, say like macular degeneration, can lead you to a finding. No, I think that's all fair as long as you use the same criteria for evidence. Because I picked a very interesting subset of patients and I can use a lower threshold. But I think you basically need to try a variety of things. Some of these traits are extremely hard to sub-classify. If you think about type two diabetes or something like that, it's extremely hard to figure out how to subdivide. I think the other thing is, the history of the field is that we do know what the really high quality Mendelian disease study is. So if you think about, say, I don't know what Ed Stone and Val Sheffield used to do. They used to clone a few Mendelian eye disorders every year and they basically would start with a family. They'd find a variant that probably implicates this gene and then they'd have a panel of rare eye disease cases that Ed Stone collected over many years and they'd say, oh, let's screen that gene in our panel. And typically you end up saying, well, we actually found examples of traditional mutations and didn't find any in controls and you'd say, ah, that's pretty convincing. Or you would say, ah, actually we found there's a bunch of random things in the panel and also in controls and you'd give up. But I think that, you know, as we say, oh, you know, let's go just with a finding in a single family. Let's skip the step of looking at additional cases or families before we say it's final and definite. We lower the bar and we let a bunch of random things slip by and I guess the nature of the thing is it's much easier to make a mistake than to get the right thing. You have 19,999 chances versus one or something like that. Although I think it's true that the success rate increases if you have a better pedigree or you have evidence for inbreeding, Jeff showed us a great example for IBD. Most of us probably looking at that pedigree would say that should be mappable. This is hundreds of times higher than the population incidence of IBD and so far it's been pretty hard. I think there's lots of examples like that out there. So I think the 30 to 50% is probably actually the right ballpark, but I'm not an expert in that field. Jeff? So I wanted to comment on a couple of things that were mentioned in the last few minutes. So one was that I think Mark's point was a very good one that saying, oh, what's the rate of success at lumping sort of exomendillian sequencing into one thing is obviously probably not that helpful. And in fact, it might be, one thing that might be useful to come out of this would be to propose a set of different kinds of study designs that actually are, generally speaking, better powered and ways of evaluating whether the particular number and type of families you have is a sensible thing to do because in some sense, if we could encourage people to sort of critically think about picking the winners in advance, it probably would improve that success rate and also it would be maybe useful to have a sort of assembled catalog of the different types of designs and whether they tend to actually succeed in identifying a mutation or not. And as Mark said, it doesn't necessarily mean you shouldn't do the things that are a bit less likely to work, but just that at least you have a sense going into it of where your proportion of success might lie. The other thing, I mentioned this idea of questioning the design assumptions with that IBD pedigree of that is it really a pseudo-mendillian version of IBD. And another one that I think is potentially interesting is in designs such as looking at population isolates or likely consanguine marriages or things where you have a very clear a priori expectation of what the genetic model is, how often are you actually right? So you might, Mark, you said you can often find, you can often find the causal mutation if you go in with say a family that presents as an extremely likely to be a sort of recessive model, but presumably some fraction of the time that's just coincidence and in fact there's a different model under action. And I think it's useful to try to have an idea of given different, those different assumptions, those different designs, how often does it actually turn out to be what you expected to be versus something else? So that we have Joel next. Yeah, so I think in terms of judging from the, what's happened and what's been successful, there's some good lessons to be drawn and then some maybe dangerous lessons as well. So the good lessons are if you can try to figure out what characteristics of the families or clinical phenotypes led to success, that can be I think very helpful. But one example from the parallel history of association studies, the initial associations that were discovered were of course the easiest ones to find. So things like HLA and even ApoE and things like that. And many study designs were based on the idea that they were gonna find the HLA for type two diabetes and that sort of thing. And of course for some diseases that worked very well and for other diseases, there was nothing like that to be found. And I think the same thing is probably happening for Mendelian diseases, up until now, one of the limitations has been also that the people with the families which were highly likely for success may not have had access to the technology that they needed to do the gene discovery, but that's really rapidly becoming accessible to everybody. And so what's going to be left are the things that are by definition going to be much harder. And I think that the study design really probably needs to take that into account moving forward. So past success may not be in this case the best predictor of future success. We may need to work harder. And then in terms of sample size and study design, again as Mark said, there's potential exploratory value in looking at the one patient where you think you may not be able to definitively find something but you may be able to implicate things. But and hopefully there's a theme that'll come up again. There may be somebody else halfway across the world with essentially the same disease in a family. And if neither person looks at that one family then that discovery will never get made. But if there's some way of both groups looking at that same family and knowing about each other and some mechanism for making those results available then that I think will greatly increase the chance of maybe very rare diseases where there aren't these multiple large families having discovery. And then I guess the phenotyping, I think that probably in terms of is it better to increase sample size or be more specific about phenotyping? I think that's actually a very interesting question is probably disease and phenotype specific what the answer is. So we did talk last evening about approaches as Joel had mentioned for sharing this kind of information and obviously sharing comes up at every meeting that we hold and many others. I'm wondering is any of the later groups maybe known variants or experimental data or somebody going to tackle this issue? Because at some point we really need to. Is this on anybody's? I can throw a comment out there. What you want is not to share the variant. You kind of want to share the genome with some annotate. If Joel has an interesting case or Mark has an interesting case and they sequence it and they have an hypothesis they want to say in previous cases that had similar features or actually in any other case of million desires where the variants in this gene get seen, right? And I also want to throw out actually a different design maybe. So one thing that's starting to become possible is this idea that if we want to find out what's the role of rare variants? One way to go about it is to find a rare disease case and sequence it and figure out, we think it's probably this variant and after you collect a series of such cases you'll have an hypothesis. But it's also possible now to think about studies where you might sequence even hundreds of thousands of people and then say I'm interested in the role of rare variants in this gene, particularly so I might be able to advise someone where I find a new variant there that I haven't seen before. And there you could actually say, hey let me pull out all the individuals out of those 100,000 that happen to carry rare variants in a particular gene and think about bringing them back for some phenotyping. I think that for many kinds of questions where we want to say what does your variant mean long term when you don't have a clear clinical phenotype? You kind of need that kind of perspective thing where you collect the individuals with the mutation and then phenotype them and figure out what you can learn about the phenotype. Right and so we have David, actually David Valli, David Goldstein and Mark Daley but I might just, and Heidi, I'm sorry to forget. David Valli, David Goldstein, who was the other? Mark and Heidi. Mark and Heidi, thank you. But pursuant to your point about identifying hundreds of thousands of people who have a variant, that was the topic of an NHCR workshop earlier this summer and so we are trying to do that and pursuing it. I think that the question that was raised was more an issue of all of the things we're currently sequencing clinically, can we get those data into a database? And one of the group. Yeah and so I think one of the things we might want to try to tackle today is how we might do that. But with that then we had David Valli. So I just wanted to amplify something that Gonzalo said, the value of a second family. So some instances you may be able to share your variant call files but there's a second sort of situation where you have found something in a single family and you need to find another family unrelated with a similar phenotype. And very often that family has not been studied at a molecular level but that as Mark said, having two families, unrelated families is a key component of burden of proof. So the centers for Mendelian genomics have looked at ways to advertise, I found I'm studying this disorder, does anybody in the world have another case of such and such? And I think facilitating that would be very powerful in terms of gene identification for the very rare disorders. Great, thank you. So David Goldstein. I saw a very similar comment to what Dave just made but what I would add to it is that it's striking when you're sequencing, as we're doing, I think many people are now, when you're sequencing children with undiagnosed genetic conditions and the clinical geneticists can't match the child up to anything that they already know about, it's striking the rate at which even in small collections you appear to find things going on in the same gene, suggesting that even centralizing a small number of cases that have been sequenced like that would permit internal discovery. And what's happening right now is really pretty unacceptable because at Duke we're doing a little bit of sequencing and so then when we do it, if we have to see some clinical similarities amongst individuals, we can go back and say, are there any genomic similarities? But even at Duke, there's lots and lots of sequencing going on where a sequence is ordered from Baylor or UCLA or wherever and all that the clinician gets back is something's been found or something's not been found and so there's no capacity to go back and ask what are the genomic similarities amongst patients that you later decide to have phenotypic similarities that you want to interrogate. You want to say something immediately? I was just going to say that the other thing that happens in that equation is that the phenotype that we know is really only part of the phenotype and so this connection through the genes actually expands and fills out the phenotype which is a very important component of the growth of knowledge. So that was like the second point I wanted to make to amplify on Gonzalo's point. So if you think about what we've done so far in terms of understanding the kinds of phenotypes that mutations in a given gene can produce, what's basically happened is that you start by clinical similarity and so what that means is that you're only looking at individuals where mutation carriers are more similar to one another than they are to wild type. But if you think about what we know from animal models it's very, very frequent that mutations in a given gene result in some of the animals being more similar to one another and some of the animals being more similar to wild type than they are to one another but we've had no systematic capacity to look for those kinds of phenotypes but we do now if we go genome first and we actually ask, okay, here's a gene we already know about that's connected to the following phenotypes, find the mutations in those genes and then get new phenotypes and I actually do think that that's gonna result in an explosion of varied phenotypes for genes we already know about. We should give Daniel a little. Okay, just a quick, okay, a quick point on that front, I totally agree and one of the conceptual issues we sometimes run up against and I suspect other people here run up against in presenting results from XM sequencing back to the initial clinicians is the response that that doesn't fit with what we know about the phenotype for that gene. I mean it comes up again and again and it is because of that ascertainment bias that David talks about where there is a sense in which because there's been such strong clinical ascertainment for sequencing that specific gene of course we only know about that phenotype for that gene and I mean there's a profound conceptual shift that I think will occur as David talks about on that front. And so I think in that vein I think perhaps what we need is to think about different ways of data sharing not so much simply everybody depositing genotypes or depositing a clinical report of what are the rare variants but how to form interactive networks between clinical sites working on similar severe rare disorders so that we can go back and forth between the genome and back to the phenotype of those individual patients and often then engage in collecting whatever additional biospecimens are necessary because I agree completely I think our view of many, many genes is skewed towards a because we looked at this in this particular phenotype we have this idea that there's a very direct relationship between the precise presentation and I think that's far less the case. I mean in perhaps some arenas as David suggested this might be the case. I think we learn more and more about many different areas in which it's absolutely clearly not the case and so the mutations that we find in autism sequencing so there's large recurrent CNVs we find those in patients throughout the spectrum all different cognitive abilities all different affectednesses and even unaffected individuals as well and if we make the presumption or continue to make the presumption that we know in advance what that presentation is going to look like for this gene then I think we're going to consistently be mistaken. And that was, oh sorry it was fine. So I just wanted to comment one of my concerns as clinical sequencing is now much more accessible to patients is that there'll be more of a movement for a lot of these Mendelian cases to go directly to clinical sequencing only analyzed by clinical labs that aren't really focused on discovery and looking at novel etiologies and then they're going to get a negative result and that case stops there and that those cases which are already sequenced with this data set sitting in some clinical labs you know hard disk aren't going to be accessible and one of the things we put into a recent grant was the proposal to work with all the clinical labs doing exome and genome sequencing to co-consent their patients for depositing those sequences into either DBGAP or some accessible place and also put in a contract with patient crossroads which is a group that creates patient registries so that we can collect phenotypic information from the physicians submitting those cases and actually even allow a portal for the patients to put in clinical information which you know there's studies on the accuracy of that information but I think more data coming out that the patients actually do contribute beneficially to the phenotyping process and then enable a portal for researchers to search through that registry to find similar phenotypes and then know that these cases have a genome or an exome sitting with it that is accessible and enable a researcher to be able to go and collect out a given set of phenotypes whether they're similar or variable and so we're hoping that we might have funding to be able to create that resource but if in some way shape or form that resource could be created so that we don't lose an incredible opportunity for all these patients that are going to be sequenced through clinical efforts but not necessarily fully interpreted in my mind. Yes, you. And so while I would entirely accept the idea of the burden of proof of having to have two families and I think that that could be facilitated by all the things that we've talked about we are talking ultimately about rare diseases that are quite variably phenotype and I think about for example left ventricular hypertrophy quite a lot to hypertrophy cardiomyopathy if you look in the basic science literature at the pathways that have been implicated in left ventricular hypertrophy there are hundreds and hundreds of genes that could potentially lead to a phenotype of left ventricular hypertrophy and I wonder if we make the burden of proof having to see a gene occur twice in two families and we're not as connected as we're discussing we might be then those things may not be published or they may not be put in we need to find a way to put them in a domain so that when there is another family it can be connected and so I don't want to make the bar too high that we end up leaving things in databases and in clinical labs and other places and I think a forum other than maybe the literature perhaps that's structured and where the phenotype is classified in a way that can be searchable that could bear a significant fruit. Yeah, that's an excellent point I think we actually did talk about that concern that researchers will pick up their 20 families and then maybe if they're lucky they publish a study a positive result on one of them and then leave 19 on the table that nobody else hears about so it would be really useful for the community to have a way of doing exactly that like making accessible the study that was successful if so or the negative results that do get released and tags sort of in a useful way as you sort of suggest the genes that were discovered the genes that were sequenced how well they were captured what variants were found what interpretations were made so that then others can sort of query that resource and maybe actually lead to discoveries because you actually start building a nexus of information because again there are how many times could we be missing things where there's mutations that lead to clinical variable phenotyping like WFS1 is a great example where you've got multiple mutations of the gene that lead to a huge distribution of different outcomes we could just be missing those things because we sort of have a dogmatic prior bias about what phenotypes were actually investigating if we expand and sort of in the spirit of doing these ad evidence we might actually be gaining in that way so I think that was one thing practically we really thought we should do. Could I just ask a quick question here about the importance of controlled vocabulary for phenotypes I mean Heidi talked about the ability to share phenotypes and one of the challenges we found in looking at the phenotype data even from a single clinic is there's a big difference between one family which might be described as EDMD like and another family where there's basically an essay on the clinical presentation of that particular family I just wondered in the context of the database that we're discussing here how we might think about coming up with a controlled vocabulary that actually allows you to do formal clustering across phenotypes based and look for similar phenotypes. Sir there's a number of efforts underway and I know of Ada Hamash created OMIM that's working with the Baylor Hopkins sequencing grant created a system for NJ and others can probably talk about this too better capture phenotypes entered by physicians using the standardized vocabulary from the OMIM terms. So I know that that's one effort underway to try and capture things in a more standardized way and facilitate an easy way for physicians to enter patient information on their patients. Peter Robinson in the UK or in Germany where he's from, yes thanks. He's developed the HPO human phenotype ontology and trying to get traction on the use of that terminology and I know Donna and others at NCBI have been trying to match SNOMED codes and other things. So I don't think there's right now one system but I think there's some movement towards that. There's a meeting before ASHG this year all on phenotyping to try and work on this problem which is, as we all know, a huge problem. I'll just amplify slightly on what Heidi mentioned in terms of a tool that Ada Hamash and her team has developed which we call for the Mendelian centers which we call PhenoDB and it's designed to make entry of data relatively easy. The instructions were less than two minutes once you know your way around it and it captures a lot of data including image data, genetic data, other kinds of phenotypic data, family history. It can be surveyed, it collates the data so that someone who wants to look at phenotypes can see the standard representation over and over again and you can search it, query it and we're adding modules to it right now. There's a manuscript submitted describing it. It's freely available from the McEusick Nathan's Institute of Genetic Medicine for anybody who wants to kick the tires and I can give you a website where you can go look at it if you want it. Great, we probably should wind up this discussion but not quite yet. We started a little bit early so maybe a few more minutes and then we'll go on. Okay, maybe I'll just make this very brief. In thinking about this sort of database and resource I think one other thing that might be worth thinking about too is a way to quantify information so it would be really valuable to have not only sort of very granular details like if a clinician wants to go to a gene and look for families or other researchers who have families with similar phenotypes so you really dig into a specific gene that's clearly crucial and would be of great benefit. Now imagine another researcher who hypothesizes that they want to design a study that's a new brand new study on a new phenotype that they think is really important and interesting. It would be really great for them to be able to go to this resource and actually ask a really quantitative question to say has this phenotype been studied? What's the evidence for the model and the question that I have for the hypothesis of phenotype the mode of action? Maybe even a prior bias on the specific genes that I think are implicated. In a very, very powerfully concrete way so that you can actually say the model that I'm gonna write a grant on that I'm gonna ask for funding for given all the available resources that we know about this model hasn't been tested and so I actually have a powered strategy to actually go after it where in contrast to if they do hypothesize that and there is data to actually exclude the model that they're proposing maybe that's not a good investment of resources or maybe they can rephrase their question in a way that actually adds information. Another way you wanna like build confidence sets we're in the game of hypothesis rejection rather than testing so we want to build an area where we can say these models have been rejected and if we add information we can reject or accept these particular models and thus narrow down the area by which whether it's a complex disease or Mendelian disease we can actually make traction to figure out what parts of the genome matter and even moreover what variants matter so having that kind of resource I think in that context would I think be really valuable. Well and given what we heard last night in the discussion we had about all the false positives out there and how people continue to pursue them even though there's evidence against them it would be really neat for reviewers or others to be able to query you know is there evidence against this or how strong is the evidence for it so. Oh yes. So that sounds kinda like the clinical trials .gov database right in a way and do we think in general and the room would people be willing to put in negative findings right I mean that's part of the that database is that you have to register studies in advance and that you have to then put in the findings is that realistic do we think? Uh oh. Blank looks. Does anyone think that's realistic? Sure I mean how could it not be I mean and I think you know the reality of the you know we've been mentioning the reality of the single family sequencing is not that you can say it's negative or positive we can't yet draw a conclusion based on what we know right now and based on this family alone but other families will come along more information about how to interpret the genome will come along and you know success or failure is not a fixed you know one time shot so we need to have creative ways of making the data available so that further inference can be made you know weeks months or years later. So the clinical trials.gov model is an interesting one and I can hear the NLM people in the room going oh my god but in addition keep in mind that the major impetus to that was journal saying we won't publish your clinical trial and therefore you remember many clinical trials are done for drug discovery and drug approval and so FDA won't accept your data you know or accept your evidence unless it's in clinical trials.gov there may be less of a you know a carrot here but it'd be something to think about I think maybe just one last comment Joel. Just along those lines at Kim Molecular Genetics we've been discussing a forum for publishing kind of that single you know just short of burden of proof kind of study especially if there's supporting evidence that you know implicates but not in a way that we would say exceeds the burden of proof to sort of highlight that gene for others so we're still working that out but that may be one forum that's available that would not solve the full exome deposition although one could imagine requiring the full exome to be made available somewhere as part of those sorts of publications. Great okay why don't we go on to the statistical analysis group which is Suzanne if we could maybe change the timer back to 25 minutes and Suzanne take it.