 I'm delighted to introduce the next speaker, Eric Lander of the Broad Institute, who is going to tell us about the human genome at 10, an overview, or it looks like it's now a decade later the title might have changed. No, I'll have a drink then if it's okay. It's really exciting to be here to celebrate. It has been quite a ride. I want to make my own personal thank you to three people. Jim Watson, who had a lunch in 1989, said, you need to get involved in the genome project. He was very directive and he went around to many young people at that time and said, you know, sort of like Uncle Sam, I want you, you have to be involved. And for many of us, that was a really important moment, a call to service to science. And certainly changed my life. Thank you, Jim. To Francis, who signed up a few years later to then lead the project through the bulk of the sequencing of the human genome and through many times, high and low and stressful and exciting and tremendous, just tremendous leadership, we're grateful to you in so many ways. And to Eric Green, who has agreed to lead us into the future. And it's great to have all three of you here and to be here celebrating with NHGRI. So somehow I agreed to write a review on everything that had happened in the last decade. This must have been a moment of weakness, I think. But once you sign up for these things, you say, wow, it gives you an opportunity to go back and read the literature and that's what I've been doing. So what I'd like to do is sort of share a bit of an overview of what happened in the last decade, what the view was like from the year 2000. We've got many young students here in the audience, some of whom can't appreciate what it was like in the year 2000 or a bit before and what we were looking at. The value of doing that is not to pat ourselves on the back about what happened in the last decade. It's to indicate what the derivative of the function is, what the rate of change can be. It's to give us some real inspiration about how we can do even more remarkably in the decade ahead. That's really the point of looking back. So I'm going to do that. It's, of course, going to be a light overview to be followed by many talks that will go into much more detail on many of the topics I'll touch on lightly. So if you would, it's queuing up the speakers to come. So the Human Genome Project declared victory on multiple occasions. It was completed in June of 2000, in February of 2001, April of 2003, and October of 2004. The genome was finished on all of those occasions. I tend to, if you're having a good time celebrating, you might as well keep going. I'm old fashioned and sort of like the idea of a published scientific paper as an event. And so I think what we're doing today is the real celebration of the human genome in that it was the publication of about 90% of the sequence of the human genome. And quite a remarkable process leading to the data and to the publication involving hundreds of authors and many, many, many months of trying to sift through these data and analyze it. What's happened since then? This one map, well, this one map of the human genome, this physical, structural, sequence kind of map, has provided a scaffold to do so many things. Once you've got a sequence of the genome, you can take fragmentary information of all sorts and start layering on top of that map and make more maps and more maps and more maps. And so the original genetic maps that could be used to trace inheritance and the physical maps of overlapping pieces of DNA and the sequence maps then have piled on top of them gene maps and evolutionary conservation maps and chromatin state maps and inherited variation maps and disease association maps and evolutionary selection maps and cancer gene maps. They have also, not just by giving you a map on which you can lay things, but by giving you completeness, catalogs, allowed you to recognize things based on reduced signatures. If I know that I have the complete list of nucleotides and I see a little stretch of nucleotides, I can look it up and say there's only one thing that could have been from this gene. So I can build a raise for genes that let me detect all the 20-yard-thousand protein coding genes in the genome. I can fly a piece of peptide on the mass spec, get its sequence and look it up and say the only one place that could have come from. And so maps and complete catalogs have really been a major change, something that couldn't happen without a human genome project as a scaffold. Now what has it told us in terms of science? What have we learned? What's changed? What have we learned since about genome sequencing, about the functional elements in the genome, about the evolution of genomes, about the basis of inherited disease of cancer and about human history? I'm going to talk lightly on each of these topics and try to put you back to the year 2000, bring you forward to today and then think a little bit about where that's going. So let's start with genome sequencing. Well, the obligatory statements have made and slides have been shown. I'll do them again because it is so remarkable about this new world of DNA sequencing, but first, the old world. The view from the year 2000, just before the genome paper was published, there had been four eukaryotic genomes published, yeast, fly, worm and Arabidopsis, 38 prokaryotes and the whole thing together, less than 500 megabases of DNA. Today, more than 250 eukaryotes have been sequenced, totaling 120 gigabytes, 4,000 bacteria and viruses, about 5 gigabases. These numbers are probably out of date already. Metagenomic samples have been done from multiple different places in the natural world and on our own bodies. More than 500 human genomes have been resequenced, probably a lot more than that, it's very hard to get a meaningful number. I think I'm pretty sure it's over 1,000 by now. There's certainly discussions about sequencing 10,000 vertebrate genomes. All of this is possible because the fundamental technology for sequencing has changed in important ways. It's gone from capillary machines, in which you could only have 100 lanes or so, to two-dimensional optical imaging where you watch DNA strands being synthesized in every separate spot, and you can watch a billion of them at a time. And so, our own experience at the Broad Institute looked like this. We used to be extremely proud of this graph of one billion bases produced in 1999 as part of the Human Genome Project, getting all the way to 70 billion bases by the year 2006. And as the new technologies came in, I add two years to this graph, and it looked like that, going up to 1,700 billion bases. I'll add another year to that, up to 20,000 billion bases. And now we're past the first of the year, and so I can add the 2010 numbers, and they go way off the slide to 125,000 billion bases, just produced at our center. This is the experience for us. And our own cost curves, we actually have slightly different numbers than Francis gave, and I go back a couple more years, 1999 to about 2010, show a drop of about 100,000 fold over this period. There ain't nothing that's ever dropped 100,000 fold in cost that I can find. It is totally remarkable. And then there's probably another factor of 10 to be had in it before I think we're going to hit some leveling off on this technology, although there are other disruptive technologies that come behind it. So what does it all mean? Well, sequencing is becoming a general tool for anything. If you can do it by sequencing, you should do it by sequencing, because it'll be the cheapest way to do it. You can do it, of course, not just to sequence genomes, but as a trick for reading out different kinds of molecular readouts, looking at populations in different ways. And as it gets cheaper and cheaper, and frankly it's got to get better and better engineered for the clinic, it will be routinely useful for medicine for germline studies, cancer studies, as people will figure out how to correlate immune responses read out by VDJ recombinations and such to different immunological responses and microbiomes. The costs still need to drop to the neighbor of about $1,000 and then perhaps a few hundred dollars, but I think all those things are conceivable over the course of the next decade. Now what have we learned about what's in the genome, understanding the genome? Well, happily in the year 2000, we already knew lots of things. We knew that the human genome had between 35,000 and 120,000 genes. In fact, we knew this very well because there was two papers back to back in Nature Genetics in June of 2000 that reported that the human genome had 35,000 genes and the subsequent paper reported that the human genome had 120,000 genes, which together indicated that we had no clue how many genes the human genome had. Well, we did know, I taught freshman biology, was that the human genome, most of the information was in protein coding sequences with a little promoter and some enhancers sequences, that there were some non-coding RNAs, not a lot of them, and the transposons were this big burden on the genome, largely junk about 50% of the genome. What we now know is all that's wrong. How do we know these things? Well, we know this first from things like studies of evolutionary conservation. We began with the human within a year or so we had a sequence of the mouse. We then wanted the rat and the dog, and we got them. Work has continued to get more than 29 mammals, more than 40 vertebrate genomes that have been sequenced, and when you line them up, you see some remarkable things. For starters, the human protein coding gene count is much smaller than we ever expected. In the 90s, we said it was 100,000 protein coding genes. In the human genome paper that we celebrate today, we said 30,000 to 40,000. Truth be told, we couldn't see 30,000 to 40,000, but we were very uncomfortable about saying less, and so we covered ourselves by saying 30,000 to 40,000 because, you know, we had to make some estimates for what we weren't seeing. In fact, it was high. As the sequence got better and as we compared it, the estimates fell, and today, by careful evolutionary comparison, it's only about the neighborhood of 21,000 protein coding genes that can be found in the genome, and there are a lot of careful work that has to be done as a footnote to that statement to say that the evolutionary comparison is meaningful that a whole bunch of new genes weren't invented last week or something, but that can all be done, and it's somewhere in this neighborhood. But in the course of this, the surprise came that there was an awful lot more evolutionarily conserved information in the human genome than we had been anticipating, that 5 to 6 percent of the human genome was conserved lovingly by evolution over the course of 100 million years, and yet only 1.2 percent was protein coded. So the vast majority of what evolution cared about was not protein coding sequence. It was non-coding sequence. With just four mammals, one could pick about a half a million of those elements, and only the biggest ones and the very best conserved ones. With 29 mammals now and a paper that I think will come out soon, about 3 million of those non-coding elements can be picked out, covering about 4.7 percent of the genome, and you can begin to pick out elements down to about 10 or 12 base pairs or so. What do we know about these things? Well, those very highly conserved non-coding elements, if you look at the most highly conserved ones, are not randomly distributed across the genome. They pile up in gene-poor regions, around 200 regions of early developmentally important genes. There's one example here, SATB1, with a tremendous amount of highly conserved non-coding sequence around it, a relatively small bit of protein coding sequence, and a tremendous amount of regulation to get this early developmentally important gene control just right. By comparing sequences across placental mammals and marsupial mammals, one can even begin to estimate the clocks of how quickly this stuff is being invented. And you can see that very little new protein coding sequence has been invented, say, in the time from the divergence from marsupials to the placental mammals. Not a lot of innovation of new protein coding sequence. Most of the innovation has been in non-coding or regulatory sequence. We can infer from other mammals, primarily, not in having different proteins, but in having different regulation of those proteins. How do you invent all this regulation? It's not easy to think about how you're going to, you know, I want to regulate a new circuit. So, I mean, I have 37 genes that are all co-regulated in some way. How do I invent that? Well, an interesting way to do it is invent it once and distribute it around the genome. That may be vastly more efficient than co-evolving it in 37 locations. And sure enough, when you look closely, genome comparison has shown us that at least 18% of the newly invented regulatory stuff lives in transposons. And it's probably a lot higher than that because the transposon sequence has degenerated in many cases. So, in fact, it's kind of obvious in retrospect that there's a tremendous amount of invention that goes on somewhere and then gets distributed around the genome and reused. So those are some of the things that emerge about genomics from looking at sequence. Now, other kinds of ways of looking at functional elements in the genome come from epigenomics. The modifications that sit on top of the genome, often modifying the chromatin wrapping the DNA sequence. Here, massively parallel sequencing technology has been very, very helpful. If you have a particular modification you want to study, oh, let's say trimethylation of lysine-4 on histone-3, get yourself an antibody to it. Bind it to that chromatin, which is previously crossed-linked to the DNA, and use it to pull down all the DNA that has that modification on its chromatin. Throw it in the sequencer, see where it comes from. Make a map. Pick any chromatin modification, any transcription factor. If you can get an antibody to it, you can pull down and see where it's localized in the genome and build these maps. All sorts of functional regions in the genome can be identified this way. For example, an actively transcribed gene has this green mark. It's actually not green. In reality, it's lysine-4 trimethylation at its promoter, and there's blue mark, lysine-36 trimethylation across the transcribed region. If you see one of those, you say, that's an actively transcribed gene. So this was of great interest to us, because while originally we only thought about 20,000 genes in the genome, a lot of evidence began to pile up that there was lots and lots more transcription going on, and the ENCODE paper supported by the NHGRI, there was evidence of transcription everywhere in the genome. But this was very controversial, because some of this transcription was very low-level stuff, two orders of magnitude, three orders of magnitude below typical transcription. Is it real? Is it not? So there was lots of controversy about the ubiquitous transcription and what it means, et cetera. On the other hand, there really were only a limited number of bona fide honest to goodness proven functional non-coding RNAs in the genome. So these epigenomic marks were very helpful in zooming in on many new ones. If you look across the genome, you say, looks like a gene, looks like a gene, looks like a gene, you start looking it up, comparing it to protein coding, and you say protein coding, protein coding new. By doing this, something like 4,000 functional, large intergenic non-coding RNAs have been identified. These things show evolutionary conservation, not as strong as proteins, but way above background. Their expression patterns implicate them in lots of cellular processes. Many of them now appear to be involved in gene repression by interactions with chromatin-modifying proteins, and there's some suggestions that they may act as scaffold, flexible scaffold for assembling proteins into complexes. There's a tremendous amount to be learned about these non-coding RNAs, but the idea that there are so many of them and that they play important roles in cellular processes and development. There's a student in my lab who's been... Mitch Gutman who's been demonstrating this for ES cells really is quite surprising here. Then there's things like the three-dimensional structure of the genome. You really couldn't figure what part of the genome was near what part of the genome in the 1990s. Indeed, in the year 2000, what we mostly knew about the three-dimensional structure of the genome was very general stuff, how the DNA was wrapped around, and then how those bits were wrapped, and chromatin was wrapped in bigger fibers and bigger fibers, but it was generic. Knowledge of what locus A might be near what locus B in the genome we didn't have. You could do it by fluorescent in C2, but that was not a very satisfying process, and you really didn't get good resolution. You couldn't do very many of them. Job Decker in 2002 began to say, oh, if I know where two things are in the genome, I could cross-link genome, I could try to PCR between them and develop the method called 3C, chromatin confirmation capture, where he began to work out three-dimensional structure of the genome. Then more recently, a student of mine, Ares Lieberman, working with Job Decker, developed a way to generalize this to the whole genome. Just cross-link the whole genome, cut it with restriction enzyme, ligate, and if you do it in a right way where you can mark those junctions, pull them all down, massively sequence them, and see what bits have been glued to the genome, and you can build a lookup table of who's near who, and you can see at a first-order approximation that the genome falls into two classes, two compartments, open chromatin compartments and a closed chromatin compartments, and the sequences in each compartments are near each other, but further away from the other, and you can build models and show that the classical model of how we thought the genome folded, something called an equilibrium globule, isn't right, that in fact it folds into a fractal, and you pick that up from various properties of the data you get at. What do we have to do? We're not done. It's going to be another decade to collect all the information we need to collect about the structure of the genome. We need and now have the tools to collect all the transcripts and all the circumstances, all the long-range genomic interactions, epigenomic modifications. I think you'll hear from Brad Bernstein about how, by correlating epigenomic modifications across many self-states, we have to figure out what this element is doing and who it's talking to, and really to document all the interactions between proteins, DNA and RNA. Beyond that, intellectually, we have to figure out not just where it is, but we have to figure out how these things act as processes, how a chunk of DNA and enhancer to which eight things bind act as a processor that integrate information, and for that, no amount of looking is going to suffice. It can't just be done observationally. We're going to have to get into, in a big way, writing our own DNA, and synthetic enhancers, and getting good at it, only when we actually know how to write these things and get the desired behavior will we really know that we understand them. So that's a picture of what's in the genome. Now, let me turn, let me turn next to disease. Understanding the basis of disease, that was actually a major, major goal of the Human Genome Project. I would say in the 1980s, the primary reason we all wanted the sequence of the Human Genome was because the thought of having students schlepping along the chromosome, spending five years cloning cystic fibrosis or ten years cloning Huntington's disease was unbelievably boring. It was. You guys can't imagine it. You know, people talked about, oh, the Human Genome Project is going to be so boring, so mindless. They weren't thinking clearly what was going on in human genetics labs and what was boring and mindless. And smart students wouldn't, in the long run, they'd sign up with Francis to do cystic fibrosis once, but they weren't going to be there for the 100th gene like that. We had to make this thing easy. We had to make this so that a smart young student wasn't limited by the technical tools to get a gene out, but could get it out and begin to apply their creativity to the biology. That, if anything, was the motive force behind the Human Genome Project in the late 1980s. Well, how did it do? Well, for Mendelian traits, not bad. When the Human Genome Project was first launched about two decades ago, my count is about 70 Mendelian disease genes were known at a molecular level. Give or take. By the time of the Human Genome paper, that number had grown to about 1,300 Mendelian disease genes had been positionally cloned. By today, my count is about 2,850 positionally cloned. It's a different table on OMIM, Francis. Francis quoted you a higher number. It's a different table from that same website. He's screwing up his face, but that's why I chose it. Here, because this is actually the one I believe. This is the right thing. We'll talk later. The road ahead. My count is another 1,800 Mendelian disorders. That is similar to what Francis said. Then, of course, there are all these new Mendelian disorders that we don't know about yet, but we're going to be finding out, and Francis gave you a great example of it. There is some wild optimism that you just take one patient and sequence the patients and you're done. I think not. It doesn't tend to work like that, except maybe in a rare recessive where you can get homozygosity or something. But with a smallish number of patients and all that, one stands a pretty good chance of being able to accomplish the identification of disease genes. I think the majority of these 1,800 unknowns will certainly fall. But that's only a small part of medicine. There is, of course, all these common diseases, the inherited components to so many common diseases that are in the population. Diabetes and Alzheimer's disease and inherited risk-stick cancer and such. How are we doing on that? Well, in 1990, before the Human Genome Project launched, the number of loci that had been really definitively shown to be involved in common human genetic diseases was give or take one HLA. It involved in a bunch of autoimmune diseases, mind you. There were a lot of different associations, but it was kind of HLA, and in a way it was cheating because you just looked at HLA. Nobody tried to look systematically across the genome. By the year 2000, when the genome paper came out, it wasn't an awful lot more. It was about two dozen. ApoE and Alzheimer's, for example. But no systematic way. It was mostly candidate genes, and the scientific literature was littered with papers of individual candidate genes for which association studies had been performed, and these association studies found the p-value that was a little less than .05, you know, .03, and you'd write about it. And these things rarely held up for the obvious reason. How are we doing now? Well, in the mid-1990s, several of us proposed the nutty idea that we might be able to make some progress by looking at all the common genetic variation in the population. You know, there's only 10, 15, 20 million common genetic variants in the population. Collect them all and correlate them with the risk of disease in patients with diabetes and without. The only problem with this notion was it was nuts. It would require that you knew all 10 or 20 million common genetic variations, that you could genotype them in thousands of patients, which would be tens of billions of genotypes. The students at the time objected because genotyping was done one marker at a time, and the thought of collecting 10 billion genotypes seemed of some concern to many of the graduate students. And so a lot would have to change to make this practical. But it did. The catalogs of common genetic variants from 1,000 or so to 1.4 million SNPs and a companion paper to that human genome paper we're talking about today to more than 20 million SNPs and a million short indels known today by virtue of the 1,000 Genomes Project. It turns out you don't have to look at them all individually because the discovery around 2001 showed that the genetic markers in a region are locally correlated with each other. And if you knew that correlation structure, you could get away by using it, using some of them to proxy for others. And so the International Haplotype Map Project worked out the correlation structure, meaning that you didn't have to look at all of them, but maybe half a million or a million of them. And then genotyping went from one at a time to 10 at a time to a million at a time. So now I've seen chips that do 3 million at a time, and there'll be 5 million at a time. And so you can actually do this thing. So what did that mean? Well, Francis has shown his version of this graph. My version is this. We tutored along finding about one common genetic variance related in some way to common disease per year until 2005, 2006, 2007. That's when it really kicked in. And then today, there's more than 1,100 such loci that have been associated with more than 165 common genetic traits and diseases. What's important is not just the count. What's important is that it's pointing us to biological pathways. A favorite example of mine is inflammatory bowel disease, where more than 100 loci have been implicated, 71 for Crohn's disease, 50 for ulcerative colitis. This is the work of Mark Daley and Ron McZavier in the IBD Consortium. Some of them fall clearly into pathways like autophagy and innate immunity. And these are very specific to Crohn's disease, where others, like IL-23 signaling, is also seen in ulcerative colitis and multiple sclerosis and psoriasis. And if you take the IBD loci, they fall into a variety of other pathways, pan of cells, ER stress, cell migration. And you really see the biology being laid out for you. The same way as when you do a mutant hunt in flyer and yeast, and you collect lots of mutants and you begin to arrange them in pathways, you begin to see. This is happening for many things. 95 loci for lipids. Macular degeneration is now known to be a disease of the complement pathway. Diabetes, all the loci that are being found seem to have nothing to do with fasting glucose levels, so that's where all the candidate genes were. The mystery of how fetal hemoglobin was regulated seems to have been cracked open by genome-wide association studies. Autoimmune disease is about half the loci that are popping up or specific to the individual autoimmune disease, and about half of them are shared across and onward like this. Remarkably, some surprises are emerging, like a locus for type 2 diabetes and a locus for heart disease live right next to each other. Very, very close. Not the same place. A few tens of k be apart. And then when you look at close in that region, it's not just those two, but around this gene that encodes P16, you've got associations to breast cancer, glioma, to leukemia, tendrometriosis, early myocardial infarction, melanoma, and they're laid out in slightly different places as if, but this requires proof, that there are individual enhancers of this gene sitting in different places here that affect when this cell cycle gene is being expressed. I think this is all cool, but at the same time, there's a bunch of hammering. People say, but wait a second, we're very disappointed because the effect sizes are very small. It's only a 10% increase in risk or a 20% increase in risk. My response to that is, you're missing the point. The point is biology. You want to find the pathway. If you know the pathway, you can learn a lot. My best example is, of course, Brown and Goldstein's beautiful example, HMG-CoA reductase. There is a snip in HMG-CoA reductase gene. It has a pretty modest effect on LDL cholesterol, about three migs per deciliter. But it is the target of the statins that tens of millions of people take because even though the naturally occurring snip has a small effect, a drug against the protein can have a very big effect. The same is true with fetal hemoglobin, where there's a modest snip in the BCL11A gene, but if you actually knock that gene down by RNAi, you can get 50% of all the hemoglobin in that cell to be fetal hemoglobin. So don't be confused by the fact that a modest effect walking around in the population has any bearing on what the pharmaceutical impact might be. The other thing that makes people worry a great deal is this business of missing heritability. The initial studies estimated that the quote heritability associated with the loci that had been found for a gene, that is, I find 27 loci associated with the disease, I add up their effects in a certain bookkeeping way that says how much the heritability, they explain, and they were explaining that 5%, and people were saying, oh my god, the common variants only explain 5%, it must be due to what's left, rare variants. Now, it may be due to rare variants, but the argument doesn't follow quite like that. In fact, we now have a much more nuanced idea of what may be going on. First, over the past several years, it's clear there's tons more of this heritability stuff that's being explained by common variants. My counts, and I went back through the literature for a bunch of these diseases. Type 1 diabetes, about 60% of the heritability explained field hemoglobin, about 50%, macular degeneration, about 50%, Crohn's is 20 to 25, type 2 diabetes is 20 to 25, it involves a re-estimate of Lambda S, Francis, based on the literature, I think the quoted Lambda S is wrong. I knew he'd be worried. HDL, LDL cholesterol, about 25% of the heritability explained height, about 12%. And what's also true is there are now some very clever analyses from Peter Vischer and Peter Park and others that show that there are a lot more loci that fall below significance but can collectively be shown to be there and explain a lot more heritability and those estimates explain it. Now, rare variants surely do play an important role. There's no doubt there will be rare variants of large effect. We want them to. But nobody's done really systematic studies. There have been a few really important and influential studies. Rick Lifton will presumably talk about some beautiful work he has done that have shown that rare variants of large effect can play an important role, but we need, and many people are doing but haven't yet published, genome-wide studies to begin to collect them. Don't imagine that you do 50 people. If you do a power calculation and you want to find rare variants that have a two-fold effect, it's 6,000 cases, 6,000 controls that you have to sequence for the exome. That's what we're into, and that's okay. I don't mind that, but we have to do it. And how much of the heritability do we play a role to? And then finally, I'll mention technically for those who are technical about these things, epistasis. Hang on to your seats a second. Let me try to explain the point. This business of estimating the heritability of disease, you do it by getting epidemiological data on risk to relatives. And you do a prevalence and you calculate, you grind the calculation and it tells you the heritability. What they don't tell you is the heritability calculation assumes that the genes interact additively. Have you ever wondered what would happen to that calculation if the genes didn't interact additively? So one can show that if the genes don't interact additively, that calculation can produce nonsense. So you can write a fairly simple model where you can have a model that has three pathways instead of one pathway. These are additive, but these interact non-additively. And you calculate a heritability number. But then if you sequence a zillion people, find all the genes, and since you made up the model, you know it's all the genes, you can show that all the genes together explain only 45% of the heritability. The 55% of so-called missing heritability isn't missing heritability, it's phantom heritability. It was never there in the first place. Anyway, we're writing a paper about this because I have to be in my bonnet about the fact that one should know what we're dealing with. All of this together will push us along, I think. But my take-home message is the idea that we should map loci, explain the heritability, and then understand the biology maybe backwards. It may be that what we have to do is map loci, understand the biology, and then we'll be able to explain the heritability for these things. Let me turn to cancer. Cancer. Cancer, too, was an important motivation for the Human Genome Project. In 1986, in which he said, look, it's clear that cancer is a disease of the genome. Either we're going to do this, you know, piecemeal one at a time, or we're going to do this wholesale. We're going to get us whole genomes and start looking at whole genomes to find things. And indeed, that's what's happened. At the time, 1990, when the Human Genome Project started, there were about 12 genes that had been identified that were known to be associated with solid tumors. By 2000, it was up to about 80. By today, it's about 240. And there have been impacts in the last decade on therapeutics, with BRAF therapeutics, for example, EGFR and lung cancer, the exciting discovery of IDH1 and glioma, which have not yet produced the therapeutic but will. It's quite recent. New kinds of oncogenes, such as transcription factors involved in lineage survival, translocations in solid cancers, which weren't supposed to be there. They were only supposed to be there in the blood cancers, but no, they're in the solid cancers, et cetera. This has led to such things as the cancer genome atlas, which began with ancient sequencing, capillary sequencing, but has now really incorporated these massively parallel sequencers, and we're beginning to turn out information like this, where we can say the comparison between an individual's genome and a reference sequence is about one difference in 1,000, between an individual's genome and their tumor might be one difference in a million. So that's a modest number. You could find them. You could collect about 30 coding differences or 40 coding differences. You ought to be able to find them, and you ought to be able to see if you look at enough patients what genes are getting hit again and again. I'll give you an example of a story that's coming out in just a couple of weeks in Nature from Todd Gallup and Gatti Getz and others at the Broad, just because it's a nice example, but there are many such examples of multiple myeloma. Multiple myeloma is a blood cancer of plasma cells that affects about 20,000 patients a year. We sequence 39 tumor-normal pairs, whole genomes, whole exomes. It's about three mutations per megabase, and to make a very long story short, what do we see? 40% of the patients have mutations in protein homeostasis-related genes. Many of them are in two particular genes, one of known function, one of totally unknown function, but whose correlation with ribosomal proteins is perfect. There are a bunch of other individual cases here, singleton cases, two cases, et cetera. About 43%, 42% of the patients makes good sense. These multiple myeloma cells are protein factories churning out immunoglobulins. This is probably very important. 25% of the cases have mutations in the NF-Kappa-B pathway, distributed across at least six different targets there. IRF4-related genes, the interferon regulatory factor 4, there were only two mutations we found in the initial set, but they were the identical mutations, smoking gun. There are also two mutations of interferon regulatory factor 4. BRAF, one patient had a BRAF mutation. Ah, but BRAF is druggable, so we're interested. So we looked at another 161 patients, 4% of patients with multiple myeloma have BRAF mutations. The coagulation pathway, 16% of patients have mutations in the coagulation pathway, making thrombin. That's extracellular, what's that doing? Trombin may be mitogenic through activation of the PAR-1 receptor. All sorts of things that we might not have suspected emerge from this. So there's a long list of things that come from just looking at tumors. Now, this is just multiple myeloma. All across the TCGA people are doing things. This is our scorecard at the moment. We've looked at about 1,400 tumor normal pairs, mostly Halexomes, across a wide variety of cancers so far. And all sorts of interesting things emerge. Today's issue of nature has a paper from Leigh-Garroway on prostate cancer, which shows that prostate cancers have these amazing daisy chains of translocations. A is translocated to B, to C, to D, to E, to F, and back to here in one big daisy chain. Not just that, these daisy chains involve genes all of which occur in that one in the open chromatin compartment I was telling you about before, different kinds of prostate cancers that don't have the X-translocations have these daisy chains, but they occur in the closed chromatin compartment. Ooh, that's interesting, who knows. Across the cancers, we're beginning to see very different rates and profiles of mutations. Some are centered around 1 per megabase, but some are centered around 10 per megabase. Some even get up to 100 per megabase here. Eric Green will be terrified to know, maybe Francis too, that these cancers, it will not suffice to look at 500 tumor normal pairs. The calculation of 500 tumor normal pairs included the background. In fact, it will probably be several thousand tumor normal pairs to reach the same sensitivity. That's okay, we're not as scared as that anymore. It's better we know that now and didn't know it then. So what's the road ahead for all this? We're going to need larger studies. If we're going to get the rare variance that we want to get, that's 6,000 cases, 6,000 controls 100 diseases. If we're going to do these cancers, it's several thousand samples times tumor normal across 50 cancer types. We should probably face it, Eric. We are looking at a one million genomes project. That is the right next step to be thinking about. It is about a one million genomes project, which as one gets down to a thousand dollars a genome, that's about a billion dollars and as you spread it out across a couple countries in five years or something, it's not going to happen anymore, right? It's about, you know, it's less than half your budget, a 30 year budget or something, but the problem will be getting the samples. It's getting the samples to do the million genomes, getting through the consents, getting the samples. We have to start thinking that way. Finally, last thing, and I'll say it very quickly, we've learned a lot about human evolution, human history. Back in the year 2000, we knew the out-of-Africa story in mitochondrial eves of African splitting and splitting and splitting. We knew one good story, one good just so story well supported by data of positive selection, which was the blood disorders and malaria, but we didn't know a lot more than that. Where are we today? Well, with really dense collections of genetic variants, we see that human population history wasn't just splitting and splitting and splitting, it was lots of mixing that's going on. India was a mixture from north and south of two populations. Neanderthal, 4% of the human genome, is Neanderthal, a remarkable thing that has been found in the last year or so. Positive selection. We've learned to read the signatures of selective sweeps, of positive selection in the human genome, and now can read out more than 300 regions across the human genome, which have been subject to strong positive selection, and the genes preferentially have to do with infectious disease and immune responses and skin color, etc. And then positive selection, reaching back a little further to our divergence from chimpanzee regions that underwent accelerated evolution in the human compared to other mammalian lineages, pointing to such things like R1 and R2, R1 and RNA, a non-coding RNA that works in a particular layer of the brain that has undergone massive expansion of the human lineage. All right. What's happened in the last 10 years? A lot. We've learned a lot about all these areas. A tremendous amount of what we thought about the genome before was wrong, or we just didn't even imagine things in the genome. A tremendous amount of medicine has been cracked open, but it's just barely a start. Most of what we need to know about the genome still lies before us. The vast majority of what we need to know about the basis of disease lies before us, even before we think about getting to translation or something, there's an awful lot of foundation that has also got to be done here. We've also learned about building scientific community, the importance of building infrastructure, the importance of sharing data freely. In such a talk, it's traditional to put up acknowledges. I was a little baffled as to what I was going to put up as an acknowledgment slide, and I eventually decided the only sensible acknowledgment slide was this. This is the work of the whole world, working together. It's been fabulous to see the scientific community together as a world, freely share the goals and the data, and it's fabulous to see that one of the great legacies of the Human Genome Project is not just the sequence lying around somewhere, but it's a generation of young people who like to set bold goals, like to work together in teams to tackle projects bigger than themselves, really like to change the world. In the long run, that legacy will be even more important than any particular aspect of the Human Genome Sequence. Thanks so much for the invitation. We have time for one quick question, if anyone's going to be bold enough. Deanna? I'll be bold enough. Hey, Deanna. Hey, Eric, that was a great talk. But you, Eric, and Francis, all got up and showed slides talking about the diminishing cost of sequencing, which I think was great. But what I'd really like you to comment on is where the increasing cost of analysis has gone, because these two are really tied hand in hand, and now you're talking about a million genomes, and we want to update the reference assembly. People don't want to reanalyze a thousand genomes. So could you comment on where we need to go to get the analysis faster, cheaper, and more reproducible? Yeah, so that's a great question. There's really three parts to the cost of what we have to do going ahead. There's actually four parts. Collecting samples, preparing the samples, sequencing samples, and analyzing data. The sequencing part is continuing to drop. That's very good. Every other part, we're going to have to put a tremendous amount of attention to. Sample prep is will within the next year or so begin to match the cost of sequencing for whole exomes. We got to nail that down. Collecting the samples, that's a non-trivial expense. And then... Yes. And then analyzing the data. Right now, data storage alone used to be a trivial sliver of the pie. It's now a visible part of the pie, and if you calculate another five-fold decrease in cost, it just storage alone becomes a significant part of the pie. A big problem of running faster than Moore's Law is you no longer have Moore's Law to keep decreasing the storage at the rate you need. And so one is going to need to store only parts of it. One is going to have to have reduced representations, not just compression techniques. One is going to say, I've seen this genome a million times before. The salient features I need to store are the following. We are going to need tremendous input and scientists who think about these kinds of reduced representations about efficient computing. But this is in the spirit of a genome project. It has always been reaching out to different fields. And Lord knows at this point we are going to need help from a lot of fields to bring all of those costs down in parallel to really deliver on his million genomes project. All right. I know there's another talk. We're going to have to move on, though.