 So I will. Did everyone do the reading? You did? You read what you gave us. You read. Good. OK, because I circulated about eight or nine papers with an annotated paragraph of each. Well, I'm going to assume you've done the reading. There you go. So thank you for the invitation to come and talk. Our goal was to really try to talk about the scientific opportunities. The goal is not talk about centers or large-scale sequencing centers, but rather to talk about the scientific goals over the next five or 10 years for which large-scale activities really are critical. That's the key thing, is what should this institute be doing? But it happens to be deeply informed by what's been going on over the past several years, including some of the things that have come out of the large-scale sequencing program. So if we were to put up the mantra that I wish the institute would adopt as its mantra, it is that the central goal of human genetics and I think by implication of NHGRI is to discover the genomic basis of human disease and physiology and make the knowledge actionable to improve human health. I think that's basically the goal. The question is, how are we doing toward that goal? Are we doing pretty well at understanding the genomic basis of human disease, for example? I'll focus on that. I'll turn over to Rick and Richard who'll talk about the other aspects of this as well. So with respect to rare Mendelian disease, we're doing pretty well, mostly on target. The majority of the Mendelian Inheritance in Man catalog is actually associated with individual genes. And many of the projects can be done with smaller scale. With regard to common and inherited disease, while there's been a tremendous amount of progress with regard to identifying common variants, it's still the case that perhaps only about a third of the heritability has been explained so far. And in both common inherited disease and in cancer, it's pretty clear. In fact, we're gonna argue based on the data that have emerged over the past several years, much over the past year or two, that we begin to understand where we are on that and what needs to be done to do it. So I'm gonna turn first to common inherited disease, then I'll turn to cancer. With regard to common inherited disease, the vast majority of disease that accounts for the vast majority of morbidity and mortality, there have been two basic methods that have been used, common variant association studies and rare variant association studies. Needless to say, one wants to look at all possible alleles, but the technology's made it possible to look at common alleles earlier than looking at rare alleles. And the common variant studies are pretty straightforward. You test an individual allele, see if it's at a higher frequency in cases versus controls. And they've identified now more than 3,000 loci associated with more than 150 different diseases that have pointed to many novel pathways. But they still only explain about a third of the heritability in the cases that have been looked at deeply. And the rest has gotta be explained before these are really clinically very useful. And since most of these effects, while biologically very important, are small for the individual patient, 10 or 20% risk, we're particularly interested in finding the stronger alleles. That's where rare variant association studies have been coming in. And one of the papers I sent out to you is about the design of rare variant association studies. Here it's a bit more complicated to do. And as I'll explain, small studies over the past year have begun to show us that there is a goldmine there but that we are pathetically underpowered with the studies that have been undertaken so far. And if we're serious about undertaking the genomic basis of common diseases, scale is going to matter. So just to emphasize that scale matters, I'll cite an example from common variant association studies, the analysis of the genetic basis of schizophrenia. This is a project that some of my colleagues have been very much involved in at Skolnick and Steve Hyman and Steve McCarroll. And I think the important thing to say is the initial study of about 3,000 cases and 3,000 controls with genome-wide association study or I would call it common variant association study came up with nothing whatsoever, no significant loci. On the strength of that, many people would say, well, we're not really learning anything, let's stop. But of course, if you look at the data closely, although nothing is significant, the things that aren't significant are biased toward being significant, so you know there's more there. So a study expanded to about 10,000, 10,000, identified five genome-wide significant loci. The study was then expanded to 25,000 cases and 25,000 controls approximately. 62 significant loci were identified. At 40,000, 50,000, 93 loci were identified and these aren't just any old loci. For example, there's the L-type calcium channel. It has four subunits, all four genes come up. In fact, more broadly there's an enrichment, as I'll tell you in a moment, for a variety of calcium channels. Certain sets associate with postsynaptic densities, with RNAs bound by the fragile excidental retardation protein. This is the sort of thing that you only begin to see when you saturate a pathway. And had one given up at a few thousand or said, oh, five is enough, let's understand those five before we go on, one would not see any of this going on. So scales mattered tremendously for the common variants association studies. They've mattered for inflammatory bowel disease now with about 150 loci that have been identified in 11 pathways, et cetera. So the question is, what about rare variant association studies? How are we doing on rare variants? Well, as I said, the common variants are pretty darn easy. You pick a SNP, you test it in cases and controls. You can do it with a chip if you decide all your SNPs in advance. Rare variants are different. You pick a gene, you collect all the mutations in the gene, you decide which ones you wanna aggregate. Just the coding region, other things. It's non-synonymous ones, missense ones, disruptive ones, you gotta make choices. Should you filter it by the type, by the frequency, by predicted function? And when you've made a decision of what you're aggregating, you test it in cases and controls. But there are a lot of important choices. Well, what are we learning? Just in the past six months, from the studies that have gone on, important things have emerged with regard to schizophrenia, paper just out in nature, early onset MI and coronary heart disease, papers I believe in press, type two diabetes, paper just out in nature from the studies that have been supported here. Schizophrenia, this recent paper in nature was now a sequencing study, not a common variant study. And by sequencing, no single gene is significant because looking at 2,500 cases and 2,500 controls was still too small. But with the help of the common variants association studies, you know what sets you might look at. And looking here on the right, darned if you don't include all the voltage-gated calcium channels and you find a 13 to 1 ratio of disruptive mutations, stop codons and splice sites in cases to controls. No one gene can ring the bell on, but this is clearly an important set. And if the set was 10 times bigger, each of the genes would be individually significant. Similarly with regard to these post-synaptic density genes. So we know there is a lot under the hood that we're not getting because of scale. Early onset myocardial infarction, the LDL receptor, it's not a surprise to know that that plays a role in heart attack, but it comes out of a rare variant association study and you pick it up as an odds ratio of 17 to 1. ApoA5, you pick up with disruptive alleles at nods ratio of 4.5 to 1. Beautiful work from St. Cathreson with triglycerides and coronary heart disease. ApoC3 here, this is really important because it is disruptive mutations. Stop codons and ApoC3 are protective against coronary heart disease, meaning this is a potentially interesting drug target. Inhibiting ApoC3 offers protection, at least if you have it from birth in the heterozygous state. And it also decreases your triglycerides. A beautiful paper out in the past couple of weeks on type two diabetes, a particular zinc transporter that came up from common variant studies. Now rare variant association studies have told us that stop codons in that gene are threefold protective against diabetes. Inhibiting a zinc transporter, which by the way is a drug-able target, offers protection against type two diabetes. And I haven't mentioned other examples that are known LDL cholesterol and PCSK9, the null allele there that offers significant protection against high LDL and coronary heart disease. And in Alzheimer's, a beautiful paper from Decode last year showing that a particular missense allele in APP that prevents the cleavage of that protein offers protection against cognitive decline and Alzheimer's disease. So these are all really important things that are emerging, but we are pathetically underpowered. This is a power curve or rather a lack of power curve. I've indicated there with a little arrow, it's a 10-fold that says if you were looking for a 10-fold effect, you wouldn't have any power at all for alleles that had 10 to the minus two selection. That's a good, trust me, that's about the right range to be thinking about right now. The only things you really have power to detect with 2000 and 2000 cases and controls are things in the neighborhood of 20, 30-fold effects. It's a wonderment that we're finding some things, but we're finding things about consistent with having one or 2% power, which is roughly what it seems to be that we have right now. You really need 25,000 cases, 25,000 controls to have the right kind of power to be doing serious genetics there. Now, can you have all those cases? Well, yeah, they're sitting in our freezers. They're sitting in our freezers from the common variant association studies. Just, I just surveyed our own freezer at the Broad and other people's freezers have more. There are more than 400,000 samples already consented for sequencing in these diseases where we could actually complete the job and get the genomic basis of the disease. So it would seem to me we want to do it. But of course it isn't just case control studies. Now, case control studies are incredibly important because they're the only place you'll make discoveries. Let me just barely be blunt. You need a case control study because you need to have a lot of cases. If you go off into a population survey, you don't get enough cases in the population to do the discovery. But you do need a population survey if you want to attach the right risk to something because that is really important. You don't see the many instances as is coming clear that modi genes that we know are strong modi genes. When you look in the general population, you find them and a lot of those people don't have modi, for example. So if you want to properly estimate rates, you need cohort studies. And that means to translate this to the practice of medicine, we need both case control studies for discovery and cohort studies for being able to attach proper risks to mutant alleles. And if you want to study the physiology of these alleles, you better have either clinical studies or recall cohorts where you can get those patients back and look at them. So there's not a question of one of these is right. It's a question of NHGRI's portfolio is gonna need a collection of each of these things in order to do it. But without power calculations and examples of how things work, you're shooting in the dark. And I think what's really emerged over the past couple of years is we now have successful examples. We now have rigorous mathematical theory and you can put them together and now design to address this question of how we find the genomic basis of inherited disease. For cancer, the story's very similar and I'll just rip through it very quickly. Cancer genomics, when the cancer genome atlas project was begun, there was a lot of thought that, well, everybody knows the genes for cancer. Why are we wasting our time? The answer was, as usually happens, genomic approaches show us that we don't know all the answers by any means. That many new classes, many new genes and many whole new classes have emerged by unbiased genomic looking. Metap genes involved in metabolism, chromatin regulation, lineage specification, RNA splicing, all very different than your traditional receptor tyrosine kinase signaling pathways there. So that's emerged from the TCGA-like work that's gone on so far. So are we done yet? Can we declare victory? The answer is no. And we know that the answer is no. And I hear a site of paper, which I happen to be a co-author on, Gaddy Getz is the senior author on this, toward a complete catalog of cancer genes. If you do an analysis of 5,000 tumors across 21 different cancer types, you can ask, how much do you discover in this very large data set? Make a long story short, you can look for an excess of non-synonymous to synonymous changes, you can look for clustering, you can look for mutations involved in conserved sites. When you look visually, the genes you know are important to leap out. NRAS would be significant from the back of the room. PIC3CA is significant from the back of the room. RB has a very different pattern, but highly significant with these red stop codons. NAPC has lots of stop codons, in the first 60% of the protein, just in colon, you can patterns tell you. So if you do this, you can analyze each of the 21 tumor types and identify what are the significant genes, and I will leave out all the rigorous mathematical stuff that you have to do to make sure you don't fool yourself, et cetera, et cetera. But you're gonna address key questions. First, can this genome analysis reproduce the discoveries that have been made before in the 21 types? Can it find new cancer genes, meaningful ones, biologically? And how far are we from being done? So the short answer to the first question is yes. I can give you a longer answer, it's in the paper. But basically the answer is yes. Essentially, everything can be found in that way. And I won't say more, but it should be yes with an asterisk, and you should read the paper for the asterisk. Can we find new cancer genes? Yes. At least 33 new genes emerge from that. By the way, that's a 25% increase in the number of genes that had previously been known. And they involve, and these are just the ones where the mutation patterns are as expected, as you would expect for genes of those functions, and the functions make sense. Antiproliferative, proliferation, proepoptotic, genome stability, chromatin regulation, immune evasion, RNA processing, protein homeostasis, all genes that had not previously been implicated in any cancer before. So this suggests we're not saturating yet. There's a lot of things there. And these are only for the ones that we can tell biologically, have a function that makes sense with the hallmarks of cancer, and have mutations in the right directions. Losses or gains consistent with what you'd expect for the function. The examples are compelling small GTPAs, five exactly identical amino acid substitutions at precisely a base in the effector domain. Another small GTPAs, six amino acid substitutions, five of them identical, all occurring in the effector domain of a small GTPAs, no accident. Rad 21 with stop codons enriched in AML, and two of its physical partners are also known to be sites of mutations in AML. A particular protein here that's involved in blocking the translation of genes by binding to poly-C regions, and you'll find the pile up of mutations at two leucines, leucine 100, leucine 102, in regions that mediate the dimerization of KH domains. These are both mathematically significant, nothing made the list if it wasn't mathematically significant, and biologically significant. So can we find new genes? Yes. But how far are we from being done? The short answer is far. You can tell that, but just making a saturation plot. This was a study of 5,000 tumors. Rerun the analysis with 4,000, 3,000, 2,000, 1,000. Look at the curve. Curve's going up. It's going up across each tumor type. It's going up across tumor types. It's going up across patient numbers. If you sort it by frequency, you find that the curve is flattening out for alleles present at a frequency greater than 20% in any cancer, max frequency somewhere where there aren't any more to discover. For every other class, 10 to 20%, which by the way is EGFR and lung cancer, and ALC and lung cancer, and many other important drug-able targets were not close to saturated yet. You can calculate how many samples you need. It's between 707,000, depending on the background mutation rate. You can be the rate. You can look up the saturation. We're far from saturation. If we wanted to saturate across 50 cancer types with a median number, it would be about 50 times 2,000 or about 100,000 samples there. So that can be done. So anyway, that was a lightning tour, but you have the reading and the homework, and we can discuss it later, of the first part of this, that a goal over this next five-year period has got to be to discover the genomic basis of human disease, which means common disease and cancer, and the tools and technologies are there. The samples are there, but it's going to take serious studies. For each disease, 25,000 cases, 25,000 controls plus replication samples is where we should be thinking. I'm going to stop there and turn over to my colleague Rick.