 Okay, thanks Daniel, thanks Terry, thanks everyone for the invitation to talk. I guess my task was to sort of amplify with examples Daniel's initial charge to the group which is to explain why is this important or why this matters and I think it really does matter and I hope through examples that we can, you know, come to believe that that's actually true and I guess that's why we're here. So about six years ago, was it six years ago or seven years ago, we had a meeting at the sort of end of the hat map beginning of the GWAS era to explore the, you know, questions of, you know, what constitutes evidence in terms of genetic association between genotype and disease and I gave a brief presentation at that meeting which highlighted one example that we had been pursuing which was the case of Disbindin and Schizophrenia. There is a, there was a report in 2002, so 10 years ago now, of an association between variation in and around the Disbindin gene, not coding variation mind, you just snip markers in and around the Disbindin gene with Schizophrenia and as you can see here, actually is there a pointer, if you can read the text from the abstract of that paper, you can see the level of statistical evidence in favor of association to Schizophrenia. Single snips were significant, you know, slightly less than .01 and there was some mysterious three marker haplotype analysis involving a restricted condition that was a little bit better looking, but simply not something that we would think holds up to the standards of, you know, today that we've, better, that one, better, okay, great, because it's green. I like the red one actually. So in any event, what we did as the, you know, at that time, this paper came out and it was rapidly followed by a large slew of replication studies which claimed also to see very modest association to the Disbindin gene and we utilized the nascent HAPMAP to begin to, and this is a screenshot from Jeff Barrett's HAPLO view program, to explore the consistency between those reports of association because at the time the standard in our field was everybody just picks a few snips and develops assays for them and tests association in the region and unfortunately nobody ever seemed to pick the same snips and so it was very difficult to line all these things up. So what we did is we took the HAPMAP, built a very straightforward phylogeny because as you can see, if you remember every using HAPLO view, this entire gene is in a very large and significant block of linkage to see equilibrium. So it was very easy to reconstruct a phylogeny and then to begin to assign the associated variance from each of these papers to see whether or not there was a consistent story, the ultimate intent of this hoping to take 10 different papers, find out that they actually were talking about the same variation and claim that we actually found something with respect to schizophrenia. What we found was actually quite different in that each study that was published subsequent to the first study and the second study here was done by the same group with overlapping sets of families. Each subsequent study that was published and claimed replication actually was demonstrating replication to very different variation and so this first replication study by a German group came out and they claimed that this variation pattern, these two HAPLO types which has some common snips tagging them was associated which would mean by extension of course that these other variants were likely protective. And so the slew of papers continued claiming association to different variation patterns in this gene with no consistency and ultimately capped off by a paper which actually claimed that everything else but the original study, the finding was associated. So in fact the original study would have had to have been protective in this case. Of course that's not the way these papers were written. They were all written in a way that said we found some association less than 0.05. So I received such positive feedback to presenting this as a cautionary tale that we actually fleshed it out and published a paper from which those last slides are taken as a figure and as blunt as one could put it at the time, we described that the evidence of association was at present equivocal and unsatisfactory. I think that was about as polite as one could put it. So what's happened since that time is why I bring this back to everyone's attention. It's not that we actually, many of us really involved in schizophrenia GWAS studies in the last five years think that this is a legitimate association that we're excited about. In fact we have many, many other such associations that don't involve this gene. But there are in fact 238 PubMed entries when you search for disbinding in schizophrenia. Of course the first of those is that 2002 association study. That is the origin of all of this effort. And even last year at the World Congress of Psychiatric Genetics there was an entire session, a real symposium devoted to the role of disbinding in schizophrenia and psychiatric disorders even though I've highlighted for you the sum total of the evidence in favor of that gene. And in fact the sum total if you look simultaneously at the largest psychiatric GWAS consortium analysis presented at the same meeting and published shortly thereafter in Nature Genetics with more than 9,000 cases is there is not yet any substantive evidence of an unbiased genome-wide level that one would be interested in. So at best still be equivocal and highly unsatisfactory. But the main reason to bring this example back to the fore is that it's not the only example. In fact there are myriad choices I could have chosen. I just happen to have slides on this one and was most familiar with this as an example. But there are huge numbers of genes that still get tremendous amount of attention and in fact still a tremendous number of publications in 2011 and 2012 focused on models of this disease and in which neuronal cell types they must be exerting their action and all this stuff all motivated by that original association report in schizophrenia. We fail as a field when we accept this as the standard and this was the standard for a very long period of time and not to pick on this example in particular. But this is a collective community failure because you can imagine with this many papers how many years of graduate students and postdocs have been dedicated to this. How much funding has been dedicated to efforts to study the relationship between disbinding and schizophrenia. We can't accept that and happily we didn't accept it and we moved on with respect to common associations. And with these experiences in mind it was certainly endorsed at that meeting five, six years ago and embraced by the field that there was one and only one way of getting past our history of failure and that was to embrace extraordinarily strict thresholds for declaring association. And we had been through this process 10 years earlier with linkage analysis where we again it was an unpopular decision because everyone thought boy it's going to be too hard to reach a law score of three in our linkage studies for schizophrenia or bipolar. So it was. And it was thought that it would be too hard to reach P values of five times ten to the minus eight in GWAS studies. But we persevered through early very small GWASs and it's been to the credit of the field that it has been embraced uniformly as a hard statistical threshold. And in fact just to cite the example of Crohn's disease from the Barrett et al 2008 which was sort of one of the first landmark meta-analyses across the ocean of multiple GWAS studies, there were 32 genome-wide significant hits defined in that paper. And we come back now four years later the number of associations in the inflammatory bowel disease now sits at 163 in a paper that's just about to come out between the first authors being Luke from Jeff's lab and Stefan in our shop in Boston. And if we look back now a few years to 2008 where we were 32 loci earlier all 32 of those genome-wide significant hits are still fully confirmed in these latest analyses and much more significant than they were in 2008. So we did create a culture of replicability or something simply by setting extremely high standards and then going out and meeting them whatever cost it took. And moreover the statistics that have been used in GWAS studies have been very, very accurately calibrated. So if we go back to the 2008 paper we described in that paper that there were eight additional loci that were had nominal replication but didn't quite reach genome-wide significance. And that set of eight if we evaluated the distribution of test statistics across the genome we would have expected two or three of those to be false positives or two or three that list to be two or three long by chance. And in fact four of those eight have now been confirmed as well past genome-wide significant. Two as predicted have been confidently rejected and there are two that are still sort of hanging around at that intermediate genome-wide significance threshold that are yet to be confirmed or rejected convincingly. So the statistics were calibrated. The GWAS era has been a considerable success in as much as we have gotten past the history of complex diseases and even in Schizophrenia now, defining associations that were not replicable, not reproducible, or that were reproducible in creative ways such as the wrong alleles being associated in the next study. But now we move into a sequencing era and of course the, to some extent we've gone back in time again as we did when we first started doing GWAS, going back in time to the early discussions around linkage studies. But there was a broad conception that isn't it the case that rare variation is just fundamentally different. And in particular, we're just, if we're doing an exome focus study, we're just looking at coding variation in the exome. So why do we worry about these genome-wide things? We know the variance we're looking at is going to be functional variance and this is going to be all very happy. So the fact is that we have lost some of the steam in terms of statistical rigor, in terms of high-profile publications and high-profile journals with respect to rare variation, based significantly around the fact that we use additional parameters to substitute for statistical significance. That the variance or protein coding variance, so they obviously must be important. That we have mouse models that say these genes are important. That we have functional assays that say the variance are functional. And one example, again just to pick some examples, not because these are the only ones, is the case of SIAE. So this is a great candidate gene. There's wonderful mouse models that describe just how such a gene would be involved in autoimmunity. And there was a report a few years ago that described rare variation, rare protein coding variation in this gene, that we can confirm as deleterious to the gene function as associated to a variety of autoimmune diseases. And here's your statistical threshold. It looks pretty good, but of course some of the other lessons that we learned from GWAS studies is that you have to have extremely careful matching of cases to controls in order to believe the report of any association of any gene and lots of other quality control, you know, characteristics that made it very difficult in the past to do single candidate gene studies accurately and make it now very difficult to do single sequencing studies in genes, you know, very carefully. And so it was that some two years later, we find out that G, in fact, the rare and functional SIAE variants are actually not really associated to any autoimmune disease whatsoever. They are still functional variants. It's still a fine gene and it's still, you know, that these are coding variants and they are in many cases confirmed as defective, but there is absolutely no evidence of association. And here's a particular rare variant that, as you can see, there's really no evidence of association in a very, very large and compelling study, but it's confirmed absolutely as a defective variant. And even if you go to super rare variants that are complete loss of function of this gene, there's no compelling evidence of association. So that shows, all right, it's going to be difficult to define new genes for diseases based simply on sequencing without taking account of a lot of the principles of quality control, of population matching, and of strict statistical thresholds that we learned in GWAS studies. But certainly if we already know that the gene or maybe a closely related gene is involved in the phenotype, there can't be much left to be concerned about with respect to defining associations based on common sense seeing a functional variant in a gene we already know is related to. So here's some examples. And there are many. I'm not going to give you many because you're probably already grown tired of me, you know, harping on, and you know we're ongoing with this. So we've known for quite some time there's compelling evidence. You knock out CETP, or you have a pure loss of function mutation. This is reliably associated with significant increases in serum HDL. Here's a report. Long time ago, but it's a perfectly sensible report that's consistent with that. Here's a missense mutation in the CETP gene. And we found it by, in this case, looking at, not we, but this group found it by looking at two individuals who had extraordinarily high HDL levels. I don't remember how many signals. It was a lot of signals. And they found what was at that time an undocumented coding variant in CETP. Common sense would say, wow, this is about as good as it gets for a sequencing study. Unfortunately, now if we look at some of the very large population studies of HDL levels and CETP variation using next generation sequencing, it's rather unconvincing that the only two people we found with this variant are pretty much as close to the middle of the HDL distribution and certainly not seven sigma outliers or whatever that original finding was. And there's, of course, more. Even when we get into severe Mendelian disorders, there's this report here from 1998 identifies three different novel mutations in this, you know, E2 gene of the branched chain, alpha-keto acid dehydrogenase complex very likely in the estimation of anyone reading the article. But these would actually be the causal mutations because we're looking at three maple syrup urine disease patients. In fact, in the fullness of time, we've now seen that one of the mutations that they described isn't, in fact, a mutation. It's actually a 90% population variant. So it's actually the reference allele now in our current instance of the genome. This was, of course, done before we had, you know, the completed genome before we had reference databases. So, you know, it was understandable how this could come to pass. And it's still understandable how these things can actually come to pass in a commonplace way because most variants, neutral variants even, are not 90% variants. They're, you know, less than 1% variants in the population. And so all the time we're encountering these variants. These two variants and many, many other like them are still listed as functional disease causing or phenotype altering variants in HGND. So we've got a long way to go before we can actually start, you know, confidently rolling out sequencing for, you know, aggressive, you know, screening, prenatal screening, clinical screening of various flavors because while the sequencing itself is getting extraordinarily accurate, and I trust it as much as anyone, our ability to interpret it is still quite limited by adequate resources, adequate databases that are telling us what variants are functional and which variants are not functional. And here are just some examples of paper. And every author who has looked at this question from a different angle with respect to any database of mutations comes to the same conclusion that there are a huge number of often quite rare coding variants in these databases that are actually not likely to be disease causing. And the 1,000 Genomes Project just confining itself to the extraordinary conservative threshold that says if HGND says something causes a severe disease and it's more than 10%, that's probably not true. So if it's probably more than, you know, half a percent, it's probably not true. But in fact, there were a large number, even with that strict threshold. So this is a problem because, you know, while we can sort of cringe and, you know, perhaps laugh a little bit about the disbinding example, it is, you know, painful to think about how much money and public trust and, you know, student hours are wasted on those things. We're now getting into a space where people are making, you know, active health decisions about themselves, decisions about, you know, reproductive decisions for themselves, their children, their families. And we can't afford to have mistakes of any flavor going forward if this endeavor is going to succeed in the way that it has to. And we're, of course, heartened by the fact that we did manage to get past this in GWAS space. So I believe we can get past it. And part of getting past it is really just embracing the fact that each individual exome that we study is actually an extraordinarily diverse place. And there are many extremely rare, mis-sense variants, variants that are not in the databases that you're looking at. There are rare loss-of-function variants in each of our genomes that are totally not associated in any direct Mendelian way to severe disease. And so when faced with a patient with a severe disease, one such observation of those things quite, you know, obviously cannot be by itself evidence. And there are some things which we find much more convincing, not absolute slam dunks, but if we can find ourselves to de novo loss-of-function variants or an autosomal knockout, two-hit knockout of a gene that's otherwise extremely well preserved in the population, we get close to being confident that this might be meaningful, but we still see those types of things even in the general population. So a question that's often posed to me, which puts me in an awkward position of trying to explain why research I'm actively engaged in isn't as interesting as everyone thinks it is, is but, you know, this can't possibly apply to all those de novo mutations you're talking about in autism. Those must be causal, right? Well, no, in fact, they're not by and large. So if you combine the four recent publications of significant trio sequencing in autism, you can see here that there is a highly significant excess of de novo loss-of-function variants in affected individuals. And whether we compare what we observe in cases to what we expect from a very carefully calibrated mutational model, or whether we compare what we observe in cases to controls, we see that in fact there is a reliable statistical access here. In fact, the overwhelming majority of missense variants, however, no matter which direction you want to do that comparison, are quite clearly not at all related to autism risk. And they're often taken as such when people perform these sequencing studies. In fact, we don't yet have any compelling statistical evidence that any de novo missense variants are, though one has to believe this modest excess of 0.03 per individual is probably ultimately going to be significant. But at the end of the day, what we have in these studies is evidence that there is some signal, and this is certainly a reason to press forward with this, but we cannot put our finger on individual genes at this point in time with super high confidence when the signal is confined to 15.5% to 8.5% in controls or in unaffected. And even when we take the best case scenarios, the instances of five genes across those 945 trios that have been sequenced, that have been hit twice with loss of function mutations, it's still the case that if we fairly estimate the significance of those observations, it doesn't really reach what we would probably define for this type of study as genome-wide significance based on the number of genes we're studying, the number of different tests we might run. Though it's getting there, and we certainly believe that several of these genes, in particular CHD8, absolutely are likely to be confirmed as autism risk factors. So all that is meant to point out that I think this is actually a really important activity that we're engaged in. As it was when the GWAS era was beginning, the stakes are considerably higher in the types of things that we're doing with next generation sequencing, and we cannot afford to relive the mistakes of the past by any stretch of the imagination. And I think I've talked long enough and hopefully bugged everybody enough so that I'll be glad to move on to the next presentation. Excellent. So we definitely have time for questions for Mark. Does anyone want to have raised any discussion points at this time? I just thought the SIE example is it's very important as well because even though the finding was published in subsequent futures, the same story is you're talking about despondent still being cited with schizophrenia. And the thing is that if you look at that paper, it's been cited something like 50 odd times, including many of them in 2012 after David Van Heel's paper. And the refutation paper has only been cited twice. And that's because if you think about it, what's the paper you write where you're going to cite a refutation? You never start saying such and such a finding was refuted and therefore I conducted this experiment. And it means that those papers just sort of never, those papers which in some sense are the more important one of the two, that it's saying this is not actually true in a genetic sense regardless of the functional data. They never make it into the sort of, you know, into the collective knowledge of the field, I think, in a way that the big initial headline grabbing paper does. So it's, you know, even if you do manage to sort of disprove those things, there's still the possibility that they can last for a very long time. And it's still certainly true that SIE is still cited reasonably widely as the canonical example of rare variants that cause disease, which is pretty shocking. Great. Maybe worth noting as well that it's not just, this is not just an issue for genetics. The HDL example you used is mirrored in mainstream cardiology where we've believed that HDL was causal for the most common cause of death in our society for many, many years. And increasing evidence, including when delin randomization is pointing towards that perhaps not being the case. No, it's a very good point. It's a very good point. Russ. Yeah, thanks. Thanks very much. That was very, very helpful. I don't want to impute something. That's not true. But so you made a very good case that in the GWAS era, we sat back, we sat down, Neil and others thought about statistics and said, okay, we have to be much more strict. And this is a statistical discipline issue. But I want to ask if you think that this is going to be a similar statistical discipline type answer here because I worry that there isn't. In my world, I define rare as something that nobody would ever fund a study to learn about. So, you know, then, so by definition, the statistics are not available. And as soon as the statistics are available, it's not rare anymore. So, so tell me your thoughts about that. So I have two thoughts. One is, I guess I'm, I'm not prepared to say that the statistics aren't valid in this case. I think we use them to great effect in interpreting what we find in the de novo point mutations in autism. I have been, I've been convinced by findings in a single family that we have very clearly identified a gene for a Mendelian disorder because we're seeing homozygosity of a variant that has a less than two in 10,000 carrier rate and, and, you know, the gene fits. And then we found a second case, of course, after that, but statistics can still be employed in these cases. It's just challenging. But, you know, we thought for a time it was going to be very, very challenging to, you know, be able to achieve five times in my state as well. Now, the second half of the answer is, and I'm glad you asked, is I, I certainly do believe that we can utilize other information, particularly in this problem, much more so than in, in GWAS. And the reason I didn't come, I waited until, you know, you, someone was going to ask me, was that the, the subtlety here is that I think statistics have an integral role in that process as well. So whether it's we pre-specify a certain set of genes or we utilize protein, protein data to, you know, find excess, you know, relationship between known genes and new genes and so forth, I think we have to be very rigorous about those analyses in a statistical sense and not use those things as a, well, we found this and then it was in this list, you know, hand-waving thing, but actually really critically think about how we're using that information and, and bring it in, in the same rigorous way. And so we can modify the statistics, not abandon them, not supplant them, but integrate them in. So I think we had David and, yeah, so David. So I guess one way to look at the problem is, you know, how much is out there that has been claimed to be pathogenic that isn't and you've made a nice review of that and that's pretty discouraging. But the other thing that one could look at is arguments that are being used in papers right now that are clearly wrong. And, you know, and I think it's, it's probably something that we need to figure out a way to solve because, you know, almost any journey you pick up, you know, you'll find arguments such as, you know, here's a stop mutation that we identified and we didn't see any other stop mutations. And that's part of the argumentation that the, that the variant is causal. You know, when there's no reason to believe in the first place that it had to be a stop, you know, that it's actually, you know, causing the condition that there's no proof of that. And it's acting as if the sequence data is, you know, comprehensive. The other thing that you see all the time is, you know, a family that's multiplex and you'll see that the variant has a little bit of co-segregation, right? And it's perfectly clear that there's no significant genome-wide linkage in the family, right? But yet, there's some degree of co-segregation, so there's a claim. So if you look around, you see that everywhere and nothing is really done about it. And if that stays as it is, then it's an insoluble problem. So I guess part of what we need to do is have a mechanism for doing something when this stuff appears in papers. So do you have any view on that? No, I agree 100 percent. I mean, one of the reasons I'm so, you know, motivated by, you know, working as a group on this particular topic area is because, you know, I, like you, read a lot of these papers and since we are just, you know, repeating all of the things that got us into trouble when we were doing candidate gene association studies 10 years ago and just assuming, you know, that, you know, everything will come out nicely at the end because it sort of makes sense and, you know, we know it simply won't. I mean, the most aggravating, you know, thing is these, you know, so many papers that, you know, that don't come right out and say it, but sort of, you know, basically write the paper in a way that you're supposed to think that every one of these de novo mutations that has been found must be important to the disease or must be important to something. And it just is not the case. So. Other comments? Great. All right. Well, thank you very much, Mark. I might note that for the first time in genome history, we have a live streaming web of this meeting. Apparently everybody on this side of the table accept I can get it. But anyway, if you want to look at it, because it may, if you're having trouble, you know, doing the slides or whatever, it's at www.genome.gov. And then there's a eight-digit number. So, and that is 275-499-59. So 275-499-59. If you'd like to see this live. Okay. Great. Why don't we, should we go on then to Heidi and