 Thanks very much. Can people hear me at the back? Is this working okay? Just a little bit up. That's right. Okay. That's much better. All right. Thank you. All right. So it's a great pleasure to be here today. I wanted to spend my presentation today talking about the analysis we've been doing in individuals who, as far as we know, do not actually suffer from severe diseases. So this is not a rare disease sequencing effort. This is really about trying to collate information from as many people as possible who don't actually have severe diseases and see what we can learn from that effort. And this is an effort that falls under the general umbrella of the exome aggregation consortium or EXAC. So the fundamental idea behind everything that we do is that in order to make sense of the variants that we discover in any one patient's genome, we need to be able to interpret those variants within a population context. So ideally, what we'd like to be able to do is, for instance, in my lab, we study patients with rare neuromuscular diseases. We'd like to be able to take every variant in one of those patients, look that up in everyone who's ever been sequenced anywhere in the world, and to say, has that variant ever been seen before? If so, how common is it in a variety of different populations? And do the carriers of that variant have any specific phenotype that resembles what we see in our patients? And in many ways we live in a golden age for these types of studies. We know worldwide somewhere north of a million genomes and exomes have been sequenced, or at least that's our best estimate of the current numbers out there. So this is a phenomenal amount of data about the patterns of human genetic variation that at least in theory we could access. Now in practice, of course, everyone here knows that most of these data are largely inaccessible for a whole variety of different reasons. There's ethical constraints, there are political reasons why people don't want to give up their data for the academic advantage or commercial advantage, and of course there's also very mundane technical reasons. It's hard to move large amounts of data around, and also critically, each of the project that's generating these data tends to use their own slightly idiosyncratic pipeline for processing and calling the variants. So it's difficult to merge these data together. So since about 2013 I've been leading a consortium, the exome aggregation consortium, or EXAC, with the goal of trying to overcome some of these barriers and pull together a large and harmonised collection of exome sequencing data for the benefit of the broader rare disease community. So our first release, which was made public back in October 2014, started with a collection of raw sequencing data from 92,000 human exomes. As you can see in this table here, these come from a whole range of different projects, with the lion's share of the data coming from large case control studies of common complex diseases like type 2 diabetes, heart disease, and neuropsychiatric conditions. There's also data from TCGA here, I should emphasise these are germ line exomes only, not tumor exomes. We took the roughly a petabyte of raw sequencing data, that's about 1,000 terabytes of raw sequencing data, from these 92,000 samples. We pushed all of that data using computing resources at the Broad Institute, through the same processing pipeline, and then did joint variant calling across all of those samples, and that resulted in a single unified call set. We then proceeded to remove about a third of those samples and release a publicly accessible call set of just over 60,000 samples that is clean, in the sense that we have restricted only to high quality samples, to individuals who are unrelated to one another, to those who have consent for public data sharing, and also to individuals who, as far as we can tell, do not have severe pediatric disease, and are not the first degree relatives of those with severe pediatric disease, although we don't know that for everyone, as I'll mention later. So this reference data set, although we can't for consent reasons share the individual level genotypes for these 60,000 people, is a data set where we can share frequency data for all of the variants that we discover. This is a massive data set, it's by far the largest collection of sequenced individuals that's been made available to date. Here, in this slightly grandiose slide, I'm comparing the size of EXAC compared to the two previously publicly available data sets of genetic variation in terms of the number of individuals. This is the 1000 Genomes project, this is the NHLBI's exome sequencing project. In EXAC, we were able to assemble nearly 10 times the number of samples compared to the ASP, and although we do have a very European centric component to EXAC, that's this big blue chunk here, we do have representation from thousands of individuals from many other continental groups as well. So there's a substantial amount of genetic diversity here. And overall, the project has generated by far the largest catalogue of human protein coding genetic variation. Over 10 million variants in total, so that corresponds to approximately one variant every six base pairs throughout the coding region and their flanking sequences. So it's an incredibly high resolution of genetic variation. The vast majority of these variants are novel, they've never been seen before. We believe that almost all of them are real, based on a whole series of QC analyses that I'm not going to go into here. And most of them are extremely rare. So in fact, if you look at this plot here, you can see that half of the variants, more than 50% of the variants that we discover in EXAC, are seen only once in our dataset. So that's an allele frequency of less than 1 in 100,000 and more than 75% are seen with an allele frequency of less than 1 in 10,000. So that's great, this gives us an unprecedented insight into this really low end of the frequency spectrum, which is where we expect most severe disease causing mutations are found. The goal of producing this dataset was to create something for the broader community. So in October, 2014, we launched a website, which hopefully many of you have already looked at, exac.brotinstitute.org, where you can access coverage information and variant information for your favorite gene. You can look up any variant that you've seen in your patient and see whether we've seen it or not and how common it is across different populations. You can also visualize, for many sites, the raw read data that gives you a sense to how confident we are that we've actually called it. This has been very successful. So as sometime this week, we will just tick over 4 million page views for the exac browser, and it's now become one of the default go-to reference datasets for clinical genomics. Now, of course, there's a number of major caveats to the exac sample collection. The first one is that we've been entirely opportunistic here. We simply took samples from the largest exome sequencing projects that we could get our hands on, where the PIs were amenable to sharing their data. That means that most of the samples have extremely limited phenotype data, and almost none of them are actually consented for recontact. We think maybe a quarter to a third of the samples are consented for possible recontact. Many of those, we haven't tried that yet. Also, although we've tried to remove severe pediatric disease cases, or severe Mendelian disease cases in general from this dataset, some will remain. We have some projects where we really have almost no phenotype data, so it's possible that some of these people do actually have severe disease. Nonetheless, those caveats aside, and I'll return to them later, exac improves the analysis of variants of uncertain significance in a number of important ways, and I'm just gonna step through three of them in this talk. The first, I think, and really obvious one, is that it allows us to improve our ability to filter our variants seen in a patient, which is simply too common in the general population to be causal for a particular disease. I'll show some data on that. The second thing we can do is in that, if we're fortunate enough to have large collections of disease cases for a particular disease, we can do a case control comparison that allows us to assess the pathogenicity and the penetrance of a whole range of variants that are discovered within that particular disease, and I'll use the example of prion disease to illustrate that. And then finally, and this is work that we're still very actively working on, it's possible to use this data, as we discussed this morning, to identify genes or regions of genes that are extremely depleted for variation in the general population, and therefore represent regions where we believe disease-causing variation is much more likely to be found. So first, just to illustrate the power of EXAC for rare disease filtering, here in this very simple analysis, what we've done is to take 100 independent individuals from each of five different continental populations, and then simply ask, if we were to apply a 0.1%, that is a one in a thousand frequency filter, using either the exome sequencing project or EXAC, how many variants are we left with within that particular exome? So this is a reasonable filter we might apply if we were looking, for instance, at a dominant disease. And the short answer is, if we were to apply the, if we were to use ESP as our reference dataset, we're left with somewhere between 600 and 1,000 variants based on frequency filtering alone. EXAC, because it's much bigger, and importantly, because it's much more diverse, reduces that number much further. So we have somewhere between 100 and 200 variants that we can then filter through using segregation, functional information, and other approaches. So both size and ancestral power are extremely important in improving this filtering capabilities. Now we of course can apply this to our own patients or to anyone else's patients, but we can also go back to the literature and to databases of reported pathogenic variants and see how many of those are actually present at frequencies that are too common to be consistent with disease causation. And the answer, as I suspect many of you have guessed, is that there is a depressingly large number of variants in basically all databases of alleged disease-causing mutations that are clearly not pathogenic, that have resulted from an incorrect assertion of pathogenicity. Just to give you perhaps the most obvious examples, here we took a set of variants that had been reported as being pathogenic in either ClinVar or HGMD, but that had a frequency of greater than 1% in at least one of the EXAC populations. And then Anne O'Donnell, a medical student in my lab, went through and manually curated the evidence supporting pathogenicity for each of those variants. And exactly as you would expect, basically none of these turned out to actually be pathogenic, with a handful of exceptions, CFTR, Delta 50, F508, and a handful of other mild Mendelian diseases. Most of them are clear frank errors in the literature and some of them are database errors. But even once we remove these really obvious errors, we're left with many hundreds to thousands of variants in the EXAC dataset that are clearly present, that we believe that are confidently asserted in the literature has been causal, that are very rare in the general population and that are nonetheless present in EXAC individuals who we believe probably don't have severe disease. So the question is, what is going on with these particular variants? And there's multiple possible explanations. Firstly, I think the default explanation is that many of these are, in fact, false assertions of pathogenicity. It's also possible that some of them are present in undiagnosed cases of disease within EXAC. We have identified some examples of somatic mosaicism where there is clearly a variant present in that person's blood, but it is present at a low allele balance, suggestive that it is actually just increased in frequency within their blood cells, but is not present in other tissues within their body. And finally, and I think importantly, there's the possibility that this variant is indeed a disease-causing variant, is indeed carried by that individual, but there is incomplete penetrance. So for whatever reason, that person is not actually suffering from that disease. So to pick apart that last case, we actually, this is work that was just published a couple of months ago in Science Translational Medicine. We chose to use prion disease as a model for studying, for picking these particular mutations apart. We chose to pursue this for a number of different reasons. Prion diseases are severe fatal, invariably fatal, adult onset, neurodegenerative diseases. They're rare, they have a lifetime frequency of about one in 10,000. We know that approximately 15% of these are genetic. The rest appear to be caused by sporadic misfolding of prion protein within the brain. These 15% of genetic cases, as far as we know, are all due to dominant gain of function mutations in the PRNP gene, which encodes the prion protein. One of the benefits of studying prion disease is that virtually every case of prion disease that occurs in the industrialized world is reported to a surveillance center and the PRNP gene is sequenced. And the reason there's such close surveillance of these cases is of course, because of the notorious 1% of prion disease cases that are infectious in origin. So we have PRNP frequency for almost every individual who's reported to have prion disease in pretty much every industrialized country. So if we are to take the 60 or so reported dominant gain of function mutations in PRNP and look them up in EXAC, what we find is an overall frequency of people carrying these mutations that is 30 times higher than what we would expect given the frequency of the disease in the general population. So clearly something is going on here. It cannot be the case that we have 30 times more people who will die of prion disease than we expect. Something is happening that is resulting in these variants being much more common than we would expect. And the answer comes when we compare the frequency of these variants in cases versus controls. So here what Eric Minnickel in my lab managed to do through various feats of political heroism is to assemble 10,500 sequenced cases of prion disease from surveillance centers in US, UK, Australia and Japan and to then compare the frequency of those variants in the cases that's here on this X-axis against the frequency of the same variants in the EXAC dataset, our 60,000 population controls shown here on the Y-axis. And what we find is that our prion disease variants, our ledgered prion disease variants cluster into three broad categories. Sitting along the X-axis, we have variants that are present in cases but basically absent in controls. And these are basically all variants with very strong evidence for pathogenicity and these all appear to be completely penetrant as far as we can tell. On the X-axis, we have variants that are seen rarely if at all in cases but are present at some reasonable frequency and controls. And we believe almost all of these are actually false assertions of pathogenicity. There's no evidence here that these are actually pathogenic. They almost certainly just happened to turn up in an individual who had a sporadic case of prion disease and therefore there was a false assertion of pathogenicity. And then there's three very interesting variants clustering in the middle here who were too common in cases to be completely benign but also too common in controls to be fully pathogenic. And we can show statistically that all three of these variants have evidence for genuine but incompletely penetrant effects on disease risk with wildly differing effects on lifetime risk ranging from less than one in a thousand lifetime risk up to nearly 10% lifetime risk for suffering from these catastrophic diseases. So this is the first time we've been able to put quantitative estimates using these very large sample sizes on the lifetime risk of this type of disease. And obviously this has major implications for genetic counseling for the families affected by these particular diseases. So that's how we can use large data sets to quantitatively assess penetrant. So I wanted to finish by talking about how we can also study variants that not just the variants that are present in these data sets but also the variants that are missing from these data sets in particular using a mutational model that Cricket mentioned this morning that was developed by Caitlin Simoka and Mark Daly's group. We can actually predict for every gene in the genome how many variants we should expect to see under a random mutational model and then look up those same genes in XR can say how many variants we do actually observe and the difference between those two numbers tells us how strongly that particular gene is depleted for that particular class of variation. So here, for instance, you can see for synonymous variants that Caitlin's model on the X-axis provides an extremely good fit for the exact data on the Y-axis. That's for silent mutations. If we now look at loss of function mutations, you can see every dot here is a gene that most genes in the genome have substantially fewer loss of function mutations than we would expect to see by chance and that's absolutely no surprise. All that it tells us is that most loss of function variants are deleterious and are acted against by natural selection. What's cool about this data is that with 60,000 samples for the first time, we can actually see how far below the line each gene falls and so we can infer how strong the natural selection is that's acting against those loss of function variants in each gene. So let me just give you two very quick examples. This is a gene where we know exactly what happens if you have a heterozygous missense or loss of function mutation. It causes catastrophic neurodevelopmental diseases. This gene, Dink1H1, is a well-known de novo cause of a whole variety of different neurodevelopmental diseases. In this gene, we can see Caitlin's model predicts almost perfectly the number of synonymous variants in the population. There is a two-thirds deficit of missense mutations and an almost complete absence of protein truncating variants in this gene consistent with the fact that most of these variants actually do result in very severe phenotypes and are therefore removed from the population and are not observed in EXAC. So that's for a gene where we know what happens when you inactivate it. This is a gene where we actually have no idea what disease is associated with it. We know that this gene, UBR5, is somehow involved in the ubiquitin pathway. We know that if you knock it out in mice, it kills them during embryogenesis. But to date, as far as I know, there's no known human loss of function phenotype and yet this gene has a profile of missense depletion and most strikingly of loss of function depletion that is completely consistent with a very nasty haploinsufficient disease gene. So you can see here a nearly complete absence of loss of function mutations in our population. So that suggests that although we've never observed an individual who has a disease caused by these mutations that in fact even heterozygous loss of function in this gene does have some quite severe effect on human phenotypes. You may be wondering what's happening with this one individual here. We think in that particular case, this variant is not actually loss of function. It's close to the end of the gene. So overall, using this approach, we can identify over 2,500 genes that have a high probability of loss of function intolerance statistically. So that means they have far fewer loss of function variants in EXAC than we would expect to see by chance. This category of genes contains almost all of the known haploinsufficient disease genes, but more than three quarters of them, we have absolutely no idea what they do when they're inactivated. So this set of course is now a very high priority set for assessing for the potential for causal mutations in severe diseases and has already proved successful in autism and schizophrenia. I'm gonna fly through this last data slide, which basically is some early results from Caitlyn showing that we can also apply the same approach to missense constraint in particular regions. And here in the CDKL5 gene, we show that the constrained, the missense constrained region of the gene is in fact exactly where almost all of the known dominant missense mutations that cause this disease are found. So even if we had no idea where disease mutations were found, we could predict that this N-terminal region was the most constrained region of the gene. So what next for EXAC? Firstly, we're building up our sample sizes. The next release will have 120,000 exomes. We're also very interested in expanding this to include non-coding variants. So we have completed a test run on five and a half thousand genomes. We're hoping to release a data set spanning over 20,000 genomes later on this year. And then finally, we're starting to test genotype-based recall. So taking individuals in EXAC who are very extreme or unusual genotypes, like homozygous loss of function or dominant disease-causing mutations and recalling that very small percentage of them where we actually have the ability to do so to learn whether or not they actually have that disease and if so, what other information we can gain from that. I wanted to finish with a slide that I think summarizes some of the key challenges that we've learned about in performing EXAC and that I think may provoke some discussion later on this afternoon. The first is that it's critical that we build that for almost all of the analysis that I described today, bigger samples will make an enormous difference to our power to actually do these analyses. So having more samples, making sure that the data are harmonized and centralized and the variants can actually be linked ethically to phenotypes is absolutely critical moving forward. That will require regulatory support, both for data aggregation and for reuse. And this often falls into a very difficult ethical gray area at the moment. We need much more clarity in that space. And that will be extremely useful for instance, for the ability to reuse the many tens of thousands or hundreds of thousands of samples, and sequence as part of the common disease centers as reference data sets for rare disease. We need an increased focus on sequencing samples that are consented upfront for recontact, deeper phenotyping and data sharing. That's no surprise to anyone in this room. And of course, it's already a focus for the PMI and other projects. And if we want to be able to assess penetrance uniformly, we also need to be collecting large, well-acetained collections of cases for a whole variety of diseases, not just common diseases, but also rare to really understand the full spectrum of variation in those genes. So with that, I'll finish by thanking everyone involved in putting the EXAC data set together, the analysts in my lab, as well as Caitlin and Mark's lab, all of the principal investigators, and huge thanks to the broad genomics and data sciences platform, which provided all of the resources that were required to make this project possible. Thanks very much. Great, thank you. We have time for a couple of questions. Eric? Daniel? Two questions. I'm just trying to get my head around a couple of numbers if you know them. So one number is, you're talking about at the very beginning something like over a million genome sequence, the great majority of those being exomes. Do you know what that ratio is? Or are you an estimate of what that ratio might be? Exomes versus whole genomes? So there's a number of difficult things about this calculation. I mean, that number includes whole genomes, both low coverage and high coverage. There are obviously many more low coverage genomes out there than high coverage genomes. I think Richard Durbin told me there's probably somewhere around the order of somewhere between 50 and 100,000 whole genomes that have been sequenced to date, and the rest of the numbers we think are mostly in the exome space. Okay, and that's about the number that's consistent with the number I've heard from others. So maybe just barely at the threshold of 100,000 for whole genomes. That's what you said, right? That's right, 100,000, yeah. Second question is, if you look, based on all of the collected variants that you've now put across the exome, what fraction of exome bases have you detected a variant on? So it's about one in six exome bases we have detected a variant. And at some bases there are some classes of base that we've almost completely saturated. So for instance, if we look at CPG sites, we observe an EXAC, 80% of all possible CPG mutations we observe at least once in EXAC. So it depends on how high the mutation rate is, but we're hitting a big chunk of it. But overall, that was about one in six. That's correct, yeah. Thanks. I didn't see any other hands, so what, oh, I'm sorry, Stephen. Just a quick, very technical question. So one thing that we've learned fairly recently is that most variant calling algorithms over correct removing novel ultra-rare variants. Could you just speak to that in terms of your variant calling algorithms? Yeah, we spent a lot of time tweaking the GOTK sensitivity parameters to overcome that, because you're absolutely right. With large-scale joint calling, there is a bias against rare variants. We took a number of metrics, I'm happy to talk to you about it afterwards, we took a number of metrics that correlate to singleton sensitivity and basically tweaked those until we hit the point that we were capturing the vast majority of singletons without, surprisingly, without increasing our false positive rate highly for the common variants. So it actually does seem to work quite well. You just have to spend some time tweaking the parameters. And just one follow-on question, if I'm allowed. You've got some very powerful assets there in terms of genes that are likely to be associated with a phenotype and probably also genes that we can eliminate from being candidates for novel phenotypes. Are those available? Sure, yeah, so, I mean, all the data, of course, the raw data is at a per variant level is fully, freely available and anyone can download that. The constraint data is also available through the EXAC FTP site. The second question that you asked, which is genes that we can rule out as being causal. So this is actually a little bit more subtle. So there are many genes that contain at least one homozygous loss of function variant in the general population. The challenge here is that homozygous loss of function variants in EXAC are enriched for false positives, relatively few sequencing errors, but lots of annotation false positives. So you have to be extremely careful about ruling out a gene on the basis of observing, for instance, loss of function variants in EXAC. But I'm happy to talk to you more about that offline. I think we're gonna move on to the next, and so save your question for the open discussion. So next up will be Douglas Fowler.