 I want to say a couple of thank yous and a couple of things. First, Jeff Trent. It is a tremendous honor to come and give the Trent lecture. I think it's great naming lectures after people while they're still alive. It's better than coming in the Trent Memorial lecture. They give the Trent lecture while there's still a Trent to enjoy it. And so I salute you for honoring Jeff for the wonderful thing he did in starting the intramural at HGRI. I very much want to also salute the scientists of NISC. Eric Green and all of the people who have worked with NISC. At the beginning of the day, they stood up and we saluted them, but many more people have come and gone over the course of the day and there's a bunch of new people. So if it's okay, I would like to ask the fantastic scientists who created NISC, who continue to run NISC, and who have done this amazing thing by ensuring that the world's best biomedical campus, the NIH, has a world-class sequencing center. So if I could ask all the people from this who are here today, because many of them are, to please stand. I think we want to salute you again. We've had the pleasure of working with NISC on many, many projects, admiring many other projects, their role in the mammalian genome projects and ENCODE and mouse genome and cancer genome anatomy, and storming many projects still on the drawing boards and soon to get underway. So well done. Happy 10th birthday and we look forward to many, many more and watching the impact that you have on the NIH and the impact you have on the world continue to grow. I just want to thank all the preceding speakers who are good and close friends from the world of genomes. Claire and Richard and Rick and Wiley and Rick and Andy and Evan and David. And many of the things I'll touch on are things they have already touched on, because in fact we're all interested in this broad common world of what you can learn from genomes. So since you'll have heard bits and pieces of all of these amazing ideas from people over the course of the day, what I'm going to try to do is draw together, run a thread through it, and really address the question of genomic information, what we can learn from it. Because I think the single greatest change over the course of the last 20 years or so in biology is the recognition that biology is yes it's the study of organisms and yes it's the study of molecules and things, but that at its very core it is about information and that there is genomic information. By genomic I don't necessarily mean the DNA, I mean genome scale, comprehensive, complete information about all the components of the cell, DNA, RNA, proteins, modifications thereof, and that by laying out all of that information we can transform the sort of questions that we can address. All of the speakers today have shown beautiful examples of that, and I'm particularly delighted to see so many young people post-docs graduate students in the audience, because this is the world you guys are inheriting, this world where it is not just about the experiments on your own bench, but the experiments of the entire world laid out before you to pick through and figure out how to extract information from. So that's the theme, and I'm going to touch on many different forms of genomic information if I can, but the granddaddy of all genomic information projects of course was this human genome project. It taught us some very important things. It taught us it wasn't a bad idea to lay out some clear goals. Gold directed science had a bad name originally, but the idea that if we thought clearly and had some things we had to get and we could define some goals, it wouldn't be bad to lay out those goals and go for them and hold ourselves toward them. It also taught us that if we're making a project about information, it was absolutely crucial that that information be completely and freely and immediately available to anybody, because it was simply absurd that the people who were producing the projects were the only ones who could use it well. We needed to enlist the ideas and the creativity of everybody around the world in any country, in academia or in industry, and so that was an important lesson that emerged from it. We learned the importance of laying out concrete plans, timelines. There was a plan and timeline laid out for the human genome project over the course of 15 years, and it actually pretty much worked according to plan. There were lots of innovations along the way, but there was a sensible plan and we learned how to plan together, including planning in the face sometimes of huge uncertainty. And we learned the importance of collaboration. The importance of international collaboration. Human genome project again had a granddaddy here involving six countries, 20 centers, but every project that we talk about has been an international project involving many groups in the United States and many groups in other countries in this ever-changing mix of centers helping one another to stay at the edge. In the case of the human genome project, as you all know, a rough draft sequence came out in 2001. A finished sequence came out in 2003. There was another little lesson there, finished. Finished is a technical term in the world of genomics. It means the vast majority of, but there's still 300 gaps and that's okay. We're aware of it. Absolute completion shouldn't be the enemy of getting the vast majority of the information out. And there are many things we can state and have stated that we can't quite get the last little bit out, but we can get the first 95, 98% out, and we should get out in the hands of as many scientists as possible. And of course, what's been the impact of it? Well, it's laid out before us the landscape of a human genome. It's a beautiful landscape with all of these interesting mountains and valleys, dense gene regions, poor regions, all sorts of these striking things. But the real test has been its impact on medicine. When the human genome project started, there had only been about 70 diseases that had identified molecularly. Single monogenic Mendelian disorders had been identified before the human genome project with the tools that have emerged during the course of the human genome project. We're now up to some 2,600 Mendelian conditions for which we know the guilty gene and people can study them in great detail. So that was all fun, but that's past history. It was a human genome project. What about beyond the human genome project? What is the agenda today? What are the sorts of things that genome centers that people around the world are trying to ensure that we have freely available on the web for everyone? Well, the human genome project had a goal. Know all the sequence in the human genome. All is an italics because it means the vast majority have done me a hard time at the last percentage or so. Here's some other things. We need to know all the genetic variation in the human population and its relationship to disease. We need to know all the functional elements in the human genome. We've been hearing about these things already from the speakers today. We need to know all the signatures of cellular responses. Cells only know how to do a limited number of things. I don't know if it's 500 or 5,000, but there's a limited number and we're going to be able to recognize what those things are by some reduced signatures of cellular responses. We need to be able to modulate all the genes in the genome. We need to know all the mechanisms of cancer and we need to know similar information about the genomes of all the major infectious agents. That's a good to-do list and that is the to-do list for not the 21st century for goodness sakes. It's a to-do list for the next 10 years and indeed for those of us involved in this, we know that more than half the stuff on this has already been great progress and we can begin to start putting check marks next to things on this list because we're quite far along on them and there's nothing on this list I think that should take us more than the next decade or so with the appropriate interpretation of the word all. There will still be things to discover 30 years from now and all that, but to get the vast majority of out there. It is helped by as one of the themes in this symposium, the continuing innovation in technologies. The Human Genome Project was helped greatly by the appearance of first fluorescence sequencing, the capillary sequencing, and then we've had the appearance of all sorts of next generation and next generation and next generation sequencers, these four or five fours and solexes and solids and helicosis and others and I won't fuss over what their throughputs and read lengths are because they're changing every day and people are continuing to improve these machines, but one's getting up to points of gigabases per run or perhaps two gigabases per run. I've heard of four gigabases per run on some of these platforms and there seems to be no reason why those things can't be achieved. So I want to turn to the topics I was talking about, human genetic variation. Let me take that one first and just describe what has been just a remarkable, remarkable period since the Human Genome Project. Now, as various speakers have referred to, there is a fair amount of polymorphism in the human population. It's actually not that large compared to most mammalian species. They're more polymorphic than we are, but we have about one heterozygous base per 1,000 bases or so, or 1,300 bases in the human genome. And if I take a random heterozygous base in you, the probability is greater than 90% that it's shared with other people in this room. That is the vast majority of the variation in you is common genetic variation. It's not these rare Mendelian things that are private mutations. The vast majority of what you've got is common genetic variation. And what does it do? Well, we know some examples. It's already been referred to Lipoprotein E, has a common genetic variant, widely referred to that confers risk of Alzheimer's disease. We've got some other examples of a common genetic variant in CCR5 that confers protection against HIV, but we really had no systematic way of looking at what might be the medical implications of common genetic variation. So in 1996, several folks, myself included, began to get very interested in the idea, even before we had the sequence of the human genome all tied it up, in fact, before we even had most of it, in the idea that we needed more than a sequence. We really needed to understand all the common genetic variation in the human population. Well, a simple back in the envelope calculations could tell you that there are about 12 million common genetic variants. And the hallucination was this, that one might be able to simply write down all the genetic variants along the top of an Excel spreadsheet, write down all the diseases along the side of the Excel spreadsheet, and human genetics might reduce simply to saying which genetic variants were enriched in which diseases. That would be very nice. It was also a kind of nutty thing 10 years ago to think about that, because it implied having 12 million genetic variants. We had nothing close to that. It implied being able to genotype these 12 million genetic variants in thousands and thousands of patients. And mind you, near completeness was necessary. If you only could do 10% of it, well, you'd only catch 10% of the things you were looking for. You really had to get the whole thing. But as these kind of genomic information projects have taught us, put one foot in front of another and consistently you may be able to build to these goals. To indicate just how poor the information resources were when we started, one could publish, in fact we did publish a paper in 1998 entitled Large Scale Identification of Snips that could report 4,000 snips and call it large scale. That was just an indication of where we were at that point. But through efforts like this and others, the idea came along that we should be able to collect snips in a systematic fashion. A public-private consortium was put together, the SNP consortium in 1999, with what sounded like an ambitious goal, 300,000 snips across the genome. That proved quickly to be underambitious as the SNP consortium within two years reached 1.4 million snips. Then as the human genome project came rolling along, it was quickly increased to 2 million snips, 3 million snips, although 8 million snips, something like 10 million snips. Now, the vast majority of the common genetic variation in the human population is already in the public databases. If we find a heterozygous site in you, we know empirically that the odds are very good it is already in the databases. Now, the problem is still, how are you going to type 10 million snips across each patient? Could you get away with less somehow without sacrificing the information? Well, here, some of the ideas from Mendelian diseases became very helpful in organizing the thinking. Some of the Mendelian diseases that occurred in isolated populations with single founder chromosomes reminded us that every mutation occurs on a single ancestral chromosome that has a bunch of polymorphisms on it, and as it's passed down through the generations, recombination widdles away the markers at far distances, but nearby, you still have strong correlation amongst the markers that are there. You still have linkage disequilibrium, and you could use it for mapping, for example, in places like Finland, without even families, just looking at the population of Finns with a rare genetic disease, you could map it by linkage disequilibrium, that signature of ancestral chromosomes. A very important paper from Mark Daley showed that even in a general European population in Toronto, you could, if you were up close and personal, detect that linkage disequilibrium, and he found in a population of patients with Crohn's disease that there was a highly stereotyped pattern of blocks of genetic markers that hung together so well that you only needed a couple of those genetic markers to be able to trace the proxy for the entire block. And so that gave rise to this notion that if we only knew that correlation structure across the genome, the haplotype structure across the genome, we'd be able to pick out a mere three or four hundred thousand genetic markers and trace inheritance this way. Well, from a random proposal there of wouldn't it be good to do that, the community swung into action within a year, a haplotype map project was launched. Again, the same pattern involving multiple countries, multiple centers, clear goals, free information sharing, and by 2006 it was largely completed, and that nice correlation structure is quite evident in this correlation gram here across the tiny region of the genome, but this slide goes on all the way across the NIH campus. Then you also needed technologies to genotype. Even three hundred thousand is a big number, but here a variety of different ideas in both the private sector and the public sector came together to allow multiplexing of one marker, ten markers, a thousand markers, 50, by the last year half a million genetic markers being simultaneously genotyped on DNA chips, it's up to a million this year, and so suddenly one had to put up or shut up, one had to actually say you had the genetic variation in the human population, you had the tools for genotyping across people, why not do it? And many groups around the world have been doing just that for the past year, and it has been an anus mirabilis 2007, a year of miracles. Just to give you a graph here of the confirmed common disease, common variants involved in common disease, 2000 a single very interesting report of PPR, Gamma, and Type 2 Diabetes, Crohn's disease, two published in 2001, another diabetes gene in 2003, age-related macular degeneration in 2005, 2006, several more, 2007 through April when the tools became available, through August, through September, I don't have October, I'm getting tired continuing to remake this slide here, and it's going to have a lot of trouble fitting on by the end of December, but it's clear that there is an extraordinary explosion right now of disease genes, disease associations of common genetic variants, and why is that? It's because of the continued investment in infrastructure, in building the tools in human genome projects, SNP consortiums, HapMap projects, genotyping rays, it's the NIH behind many of these things, it's the private sector behind many of these things, it's private public partnerships behind these things, it's the willingness to actually roll up sleeves and create that infrastructure and then make it broadly available to a community. What are we learning from these sorts of findings already in what just has been about a year of this? Well, with regard to the common disease, common genetic variant idea, we've learned it works, you can find lots of them, and the significance levels are extraordinary, 10 to the minus 10th is hardly impressing anybody any day, they're 10 to the minus 60th, 10 to the 120th, they're significant. We're learning that the vast majority of the genes that play a role are not the genes that were prior candidates on anybody's list, it's perhaps no surprise we knew this from the Mendelian diseases, we were bad guessers, we're bad guessers about the common diseases as well, and we're also learning that many of the risk factors are not in coding sequences, they are non-coding, they are probably regulatory sequences, it's not a shock we've already heard from the speakers that a significant fraction of the human genome of functional stuff in the human genome is non-coding while a significant fraction of the variation that affects disease is non-coding, we have our work cut out for us to understand it, but it's in the population, it does affect risk, and it's probably going to be a very good handle into what these things do. It's revealing new pathways, the complement pathway and macular degeneration, autophagy involved with multiple loci and inflammatory bowel disease, beta cell function in particular all sorts of new things, zinc transporters, et cetera, and type 2 diabetes. It's revealing connections between diseases, already referred to this morning, chromosome 9, this interesting region that has myocardial infarction risk factor and a type 2 diabetes risk factor very close to each other. What does that mean? They're not the same, they're a little bit apart, but very, very close. We're learning that the effect sizes may be modest, but they may be very important, PPA or gamma, it's only a 1.3 fold increase in your risk, it may be a drug target for a drug that's useful in type 2 diabetes. We're learning that some of these markers, for example, in type 2 diabetes again, can be very useful in a clinical sense of identifying which pre-diabetic patients will benefit most from early interventions. We're learning about ethnic variation and health disparities, about 8.224, a risk factor for prostate cancer that is in present in all populations, but at higher frequency in African Americans and may explain the somewhat higher frequency of prostate cancer in African Americans. We're learning that it's often hard to find the specific gene, the specific allele, a lot of work is going to be needed for that, and we'll come back to that. We're learning that more is more. Larger sample sizes will yield even more. I can tell you stories from inflammatory bowel disease that Mark Daley tells me that the first 1,000 or so patients identified 6 loci, but when 3 different groups pooled their data to get 3 or 4,000 patients, they're now up to something like 30 highly significant loci that come with larger sample sizes. We were learning that there's still much more of the genetic variance to explain. We've explained maybe 50% of the variation for macular degeneration but perhaps 5% of the variation for type 2 diabetes. Why? Is it we're missing the genes? Is it epistasis between them? Is it environment? Well, it's only been a year. Nobody knows the dust hasn't come close to settling. But these are the sorts of questions. So what do we need? Well, what we've really learned is we've barely scratched the surface of this. We've scratched the surface probably of the genes and barely scratched the surface of the biology. What do we need? Well, 3 things. Larger samples and more diverse populations. Most of the work has gone on in European derived populations. We know that different alleles are at different frequencies and you'll spot different things. You'll have more power to spot different things if the allele frequencies are somewhat different. And so African American populations will reveal different loci not because there's fundamental differences but because the allele frequency fluctuations between populations make it easier to spot some things. Asian populations, Hispanic populations, this is essential to really being able to do the biology as well as being able to investigate health disparities. Beyond that, as several of the speakers notably Richard Gibbs referred to this morning, we've only examined some of the range of genetic variation. We have looked only, really with these genome-wide association studies, at the genetic variance between 50% and 5%. Polymorphism in the human population, the word technically means down to about 1%, common variation in the human population, segregating variation. That is to say, variation common enough that if you've got a thousand patients you'd see it multiple times, enough times to recognize that it was an increased risk factor, runs down another log below that 5% to at least half a percent. And yet the studies now are not powered to do that. We don't have catalogs even that run down there. And yet we know there's important stuff. Helen Hobbs's beautiful work on PCSK9 with variance in the range of 2% to 3%, common genetic variation but not yet assayed by the types of maps we are using. We need to have genome-wide projects, whole genome, there's discussions of thousand genome projects to collect all that genetic variation. So in this HapMap type fashion, we can exploit all of that to do common variation studies. For now as regions come up, people are extremely interested in sequencing those regions to find the lower frequency variance. But here, since they are, in fact, common enough that we could collect them all, as Richard referred to, let's collect them all. And then of course there are rare mutations. There are spike mutations and they can be very revealing too. Helen Hobbs is beautifully shown in a population of patients with low HDL that a couple of genes have just too many rare singleton mutations and that too is a signature. A signature that can't be caught by the common genetic variation and we need the tools for that. And I'll take for granted, but Evan Eichler has made a very good point about that, that human genome also has much more than SNPs. It has copy number variation in these interesting repeated regions and we need to be able to put all of that into this pipeline as well and look at the copy number variation across the genome. And for all of it there's a tremendous amount of sequencing that's going to have to go on in the next couple of years. But like with these other projects, I think it's guaranteed to give us the kinds of catalogs and tools we need to drive this problem home. At least to drive it home with regard to finding genetic variance. What do they mean? Well we need tools to connect these genetic variance to physiology. We can't forget that by piling up 20 things that might be involved in inflammatory bowel disease, 20 things that might be involved in type 2 diabetes, that's of course just the start. How are we going to keep up with that pace in the laboratory? Well I want to turn to some of the things we need for that. So let's put aside all this human genetic variation and collecting it, I'm confident that can happen. What about breathing functional meaning into the genome so we can make sense of this human genetic variation. So we can connect it with disease. So I want to turn to a little bit about talking about all the functional elements in the genome. Well there are two different ways that one can approach them that I'll at least mention. There are probably some others. Conservation maps looking at the portions of the human genome that evolution has voted on as really mattering. And David Hausler has referred to this quite beautifully. That looking at the patterns of conservation across the genome one can learn a lot about what matters in the genome even if the mouse knockout doesn't show a phenotype. If evolution tells you it's not willing to change that base, I go with evolution. It knows what it's doing. And then I also want to talk about chromatin state maps, a new kind of map that I think we want to collect a lot of and put them on the web. So let's turn to ways of annotating the human genome so we'll be able to make sense of some of these disease loci. So conservation maps, clearly the first thing after the human genome project was to get the mouse genome done and many of the people in this room played crucial roles in that including folks at NIST of getting the mouse genome done. And then using that mouse genome by lining up the mouse genome with the human genome and with a few other genomes, the dog genome, the rat genome. And lining up just the first handful of genomes has revealed a number of important things. Genomic comparison has already revealed that the human gene catalog is very different than we thought. It's not the 100,000 that was in the textbooks a decade ago. It's not even the 30,000 or 40,000 that we all wrote in the human genome paper back in 2001. It's not even, I think, the 25,000 protein coding genes that are in the current catalog that were in the current catalogs last year. In fact, comparative work from the handful of mammalian species by Michelle Clamp has very nicely shown in a paper coming out very shortly. Probably the human protein coding gene count is really in the neighborhood of about 20 to 21,000. That the current databases probably only have about 20,400 real protein coding genes. And much of the rest of the stuff are simply open reading frames that are there spuriously. And I don't have time to go into the arguments. And that you can pick that out of by comparison. The number of really primate specific things is modest, measured in the hundreds. And they are the sort of things that Evan Eichler talked about. These very exciting gene families. They're getting born. There is new stuff. But for the most part the story with protein coding genes is pairing them down and whittling them away. But even as they're getting paired away, the coding things, the non-coding things in the genome are really crying out for our attention. They're burgeoning. As you look across the genome as various speakers have referred to, we find that there are patches of conservation, clear conservation ranging from these ultra conserved elements to smaller binding sites that evolution is lovingly preserved. And that something like two-thirds of all the stuff evolution is preserved is this non-coding stuff covering about five percent to the human genome. We know in a few cases that they're regulatory elements because when you knock them out of a mouse, you're able to see that it dysregulates genes nearby. But that's a pretty tough thing to do to annotate half a million elements. Half a million mouse knockouts is daunting even for me to contemplate which is big. So the best way to really home in and clean this up is to increase the power of the data first. With just a human and a mouse or a dog, there's a limit to how much you could get. But evolution kindly made many mammals. And by comparing more and more genomes, we're able to refine those signals, get rid of the noise, pull up the signal. And so various groups came together, but here I particularly want to credit the folks at NISC collaborating with some folks at the Broad for proposing a concrete program to sequence a large number, about two dozen mammalian genomes. And that program, the NIH launched involving all of the sequencing centers with elephants and armadillos and rabbits and bats and cats and hedgehogs and all that. And the project's essentially complete. There are aspects of it still being tidied up, but the vast majority of these data are already freely available on the web. David Housel has referred to some of this already and groups around the world are putting together all these two dozen sequences and saying can we get down not just the 200 base pair conserved elements, but 150, 10? Can we pick out 10 base pair elements, et cetera? And there's just an explosion of interest in folks who are comfortable with both genomes and bioinformatics in squeezing out all of the information that evolution was kind enough to leave us from the experiment that's called the mammalian radiation. So I'll give you some examples of things that come out if we're looking at genomes. Here's one, I'm fond of this one. If you line up many genomes and you start looking at what's conserved you find a funny little sight here that's, it's not that little, a funny sight here that's present about 5,000 times across the human genome and when it occurs it's very well conserved. What in the world does it mean? So we used that, we took a biotinylated version of that piece of DNA and pulled down with it protein. We took cellular extract and bound to the biotinylated sequence that contains that motif there, cellular extract and found that when we pulled it down and flew it on a mass spec the CTCF insulator protein, an insulator protein blocks the spreading of gene expression. Only about three insulator sites in the human had ever been characterized, but suddenly maybe the genome has given us 5,000 candidate insulator sites. How are you going to prove that they're really insulators? You're going to go knock them all out? It's a lot of work. Turns out again, genomic information can give you a very good clue right away. Just take all the genes in the genome that are divergently transcribed. If they're divergently transcribed and this thing is an insulator sequence, when there's an insulator sequence in the middle, those genes should have uncorrelated gene expression. If there's no insulator sequence, they should have correlated gene expression. Get the public databases, look at their gene expression patterns, it works. The guys who have this tend to be uncorrelated, the guys who don't tend to be correlated. So you can pick that out of the information. Obviously you want to go do biochemistry after that but it's very nice to be able to do this because you can do this in the afternoon. There are other things that you can come out. You can take the things that David refers to these ultra, ultra conserved sequences way out at the end or a little less ultra conserved, maybe super conserved or very conserved or something. The most 5% most conserved sequences across the genome and see where they are across the genome. And when you do that you find the following curious fact that the most conserved non-coding sequences across the genome are not in your genes. They're in gene deserts, gene poor regions. But not no genes, just gene poor. What genes are in those gene poor regions? Developmentally important transcription factors. Almost every one of those 200 regions that have peaks of highly conserved non-coding elements are enriched for developmentally important transcription factors or axon guidance receptors. Half of that very conserved stuff is focused around these regions. They must be very interesting. What do they do? So we were curious about understanding what was going on special at these regions and that led us into the second part of the work. Chromatin state maps. Because we took a guess that maybe chromatin would be one way in which those loci were special. And so we began to explore the chromatin structure of these funny regions and I'll tell you about that now. Chromatin structure is enormously complex. Histones have these tails that are decorated with all sorts of modifications, but for the moment I'll keep it simple and refer to only two histone modifications. One, lysine 4 trimethylation, which I'll color green because it's associated with active genes, and lysine 27 trimethylation, which I'll color red because it's been historically associated with inactive genes. One can then go look and what we did was using chromatin immunoprecipitation on a microarray, a DNA microarray for just these special regions of the genome. We began to explore chromatin structure of those regions, and we found that in mature cells sometimes they had the green mark, sometimes they had the red mark, sometimes they didn't have any mark, but you never see both together, which was consistent with the literature that it was either a green or it was an on or an off. Until we looked at embryonic stem cells. And in ES cells we found a very curious phenomenon. Right around those developmentally important genes in those regions, we found that in embryonic stem cells they were marked with both red and green, both an on and an off mark, and yet were silent, as if they were poised for either activation or repression according to which lineage they might go down into. At least that was our hallucination there. Well, to really look at that in a serious way, one's got to expand to more cell types and expand to the genome, and as Rick Myers has already referred to, the idea of doing chromatin immunoprecipitation and hybridizing it to a DNA array is something that's so 2006, it's really not at all okuron. The right way to do it now is do chromatin immunoprecipitation, get the DNA and run it on one of these ultra high throughput sequencers that give you little reads and you map them back to the genome. So we did that using a Selexa, and the data are, as they would say, comparable. The top line is sequencing, the bottom line is a microarray, they look pretty the same. And so we could do this across various cell types and for a variety of different chromatin marks, and I'll summarize a bunch of data for the following sort of questions. The question we really want to know deeply, we want to know, how does a cell decide to take up a career? When a cell decides to go from being an ES cell to a fully differentiated cell, it makes a variety of career decisions along the way. It loses potential. It makes commitment. We say that in developmental biology, but what do we mean by it? What are the molecular correlates of a cell being committed to do something or having the potential still to do something? We don't really in developmental biology have a clear, crisp way to read out what career decisions have been made and which lie ahead. So what we've been trying to do is study that with chromatin, and I'll give you a brief summary of where we're at at the moment, and this will be slightly oversimplifying the data, but it's not a bad description of it. In embryonic stem cells, genes break up into three different categories. There are some AT-rich promoters and they're fickle. They come on, they come off in different cell types. They're very fickle, and my sense is these guys here come on or off depending on whether there's a transcription factor to turn them on or off. Very fickle. 70% of the genes are CPG-rich islands and they're housekeeping genes, and they're on all the time. 15% of the genes, somewhat more than just in those special regions, but highly enriched in those special regions are these bivalent genes that start off in ESLs in this bipotential state of red and green, and then in different lineages may go green or red, but we're finding now sometimes stay bipotential in some of those lineages. In which lineages do they stay bipotential, stay bivalent? Woofley speaking, in those lineages that still have choices ahead involving that gene. So if we're looking at myoblasts, neural cells, and fibroblasts, and we're talking about a gene that's involved in hemopoietic cells, there are no more decisions to be made. It's made a final decision. But a gene involved in differentiation of some neurons still is bipotential here in a neuronal precursor, and a gene involved in differentiation of adipocytes but not other descendants of fibroblasts is still bipotential there. And so very roughly, and this is the happy thing of when you only have a limited amount of data, you can make a very simple happy model. So the very simple happy model right now is this bivalent mark is an indication of decisions still ahead. As we collect more data, the model will surely become more complicated, but happily I don't know enough yet to complicate you with it. So, but that's kind of the picture. These chromatin state maps are very interesting. They're revealing all sorts of things. Here's a gene in embryonic stem cells. The coding region here is the glucadherin gene that has a zillion different promoters. And you can see in embryonic stem cells, every one of these promoters is marked as a bivalent promoter independently with a green and a red, except that one, which is just green, and it's the one that's used in embryonic stem cells. You can, oh, we also put CTCF, that insulator on this, and it nicely insulates between each promoter. You can pick out the microRNA genes. Here's a microRNA. It's very hard to figure out what the primary transcript is for a microRNA, but in fact here is this green mark of activation, and this other mark, K36, that identifies transcribed regions, and it's very easy to pick out this must-be-the-transcript that results in this mature microRNA. Similarly, you can find new promoters for genes, FoxP1 instead of FoxP2, that was talked about before. Here's a little promoter here. Here's the transcript. But in embryonic fibroblasts, there's another promoter being used and you can clearly read off the transcript there. You can read off which allele is being used because you're sequencing. So you can tell polymorphisms between the little reads, and you can tell that in hybrid mice, F1 hybrid mice, you can tell that the green mark is on one parental chromosome and a different red mark called K9 is on a different parental chromosome. This is imprinted. This is active. That's the imprinted chromosome. You can read it off from the chromatin state map. And you can also tell that different alleles here, all of the transcription is occurring here off the castanius allele, not the 129 allele. And so you can pick out and you can do this with humans as well. And finally, going back to this human genetic variation, we began to look at marks, the K4 mark, not trimethylation, but dimethylation and monomethylation. These marks, I don't want to confuse you with too many marks, but these marks are marks that seem to indicate open chromatin and enhancers in particular. They're associated with hypersensitive sites in DNA. And you can kind of read these off as at least proto-enhancer marks. And I put this region up for one reason, which is remember I said chromosome 9 had this funny bit that was non-coding that was associated with both myocardial infarction and type 2 diabetes? It's there. And it's got all sorts of interesting enhancer things over it. Now I note these enhancers are in a totally irrelevant cell type. They're an HL60 cancer cell and they're in human umbilical vein cells here. But nonetheless, one can get cell types now and mark up those enhancer structures in more relevant cell types. And my guess is there's a lot of interesting action going on over here in terms of enhancers. And maybe that'll help guide us in. Anyway, I'm going to quickly, I'll just say and won't really talk about, we've been doing the same thing now with methylation. We've been taking the DNA and studying its chromatin structure, its epigenomic structure with regard to methylation. And you can do this by, you know, some genes have CPG islands which sometimes could become methylated and turn the genes off. And you can study this by treating the DNA with bisulfite and you can then shotgun sequence. The problem is it's a lot of DNA and so we've come up with and I'll just mention some interesting tricks where you can slice out 1% of the genome on a gel that contains just the MSP1 fragments of a certain size and since MSP1 cuts its CPGs these things are highly enriched for CPG islands. And you can assay about 90% of the CPG islands in the genome by sequencing about 1% of the genome and you can pick out those regions that have, for example, become highly methylated in developed cells. I'll mention the following fact which is when you begin to measure methylation changes as cells develop you take embryonic stem cells and you develop them into SOX1 positive cells and then to neural precursor cells and astrocytes there's a huge change of methylation that occurs here. Very un-methylated huge change to guys becoming methylated in this change and then they stay the same past there. This got Alex Meisner who did this work, beautiful work, very excited. I mention it because Alex Meisner is also very careful. We now think this is a very interesting artifact. We think that now we look at actual cells from tissue in vivo tissue as opposed to cells being differentiated in cell culture we don't see this methylation. In fact, it looks like there's some very important changes in methylation that occur in cell culture in the same cell types but are not occurring in vivo. And this is of interest because the one place where you do see this methylation is in cancer. There's something very funny going on with regard to methylation. I mention this because there's been some talk about using bisulfite sequencing and we're very excited and about to go describe all this and now it's very clear there are some very interesting artifacts that I think in the end will tell us more about cancer than development with regard to methylation. But I mention it anyway. All right, so those are those things. But those are annotating the genome. What about functional tools? What about the kind of genomic information that's going to shed light on cellular circuitry? I want to take a little bit of time and talk about tools for doing that. Not for marking up the genome anymore with variation or marking up with conservation or marking up with chromatin state maps although I think all those things are very important and we've got to keep generating them and getting them out on the web. But the tools for somewhat more high throughput biology to explore pathways. And so here I want to describe work of a student, Piyush Gupta, to indicate that even the very sensitive cell biological experiments of a type that you might not think would yield to genomic approaches are being made to yield to genomic approaches. So I'll describe briefly Piyush Gupta who came to our lab from Bob Weinberg's lab. He's a cancer person, Piyush is, and was extremely interested in deploying the tools of RNAi screening. So RNAi is, of course, a fabulous technology for knocking out the gene of your choice and with a couple of groups including our own who have built genome-wide RNAi libraries you can at least imagine the idea of doing genome-wide screens with RNAi's to find all the genes that might matter in a process. Well, the process Piyush cared about was to understand the signaling of the herb B2 receptor. He cared a lot about this problem because he was very interested in breast cancer. And breast cancer comes in five basic groups as defined by gene expression patterns. Two of them, these first two, have very poor prognoses and we need much better therapies for them. And this first class here has prominent signaling through the herb B2 receptor and we need much better therapies for this class. So Piyush said, could I use high-throughput RNAi screening as a genomic information tool to tease apart the pathway? Now here's the problem. The sphenotype is very subtle. When you add her regulant to cells, breast cancer cells start off clustered next to each other and when you add her regulant they move apart a little bit and they get a little spiky. They put out Philopodia-marked by F-actin, they separate a little bit. You can see it but imagine trying to screen hundreds of thousands of wells for that phenotype. That's not going to be an easy thing to do but that's what Piyush wanted to do. He wanted to say use a genomic approach to screen a very subtle cellular phenotype. And here happily we had some colleagues who also think genomically but with regard to image analysis. David Sabatini and particularly Ann Carpenter. So Piyush takes a long time to come up. Did I get it? Yep, there we go. You can see the cells here without her regulant, with her regulant, have moved apart a little bit and have got a little blotchy with F-actin. This is not a friendly thing to imagine doing a high throughput screen for. But Piyush was an optimist. So he took Ann Carpenter's software that's very good at detecting all sorts of objects, shapes of cell boundaries here and other funny things and used it to analyze lots of images and got all sorts of dimensions counting F-actin, punkta, nearest neighbor, this is cell-shaped metrics, et cetera, et cetera, got all of these different readouts of cells and then went away being very smart and mathematical and attempted to build a classifier. And after three months, this is the negative control here, he was unable to do it. Then he went back to Ann Carpenter and said, got any other tricks? And Ann said, well we've been working on something called cell classifier. It works like this. Cell classifier gives you 50 pictures. With your mouse you drag the ones that you think are in category A over to the left and the ones that are in category B over to the right and it goes off and makes up its own rules. Based on its rules it gives you 50 more pictures but this time it's divided them and said I think these are A's and these are B's. Is that what you mean? And you move around the ones that got wrong. It goes away, gives you back. After a couple hours with cell profiler, it's doing a mighty fine job and in fact it was able to accomplish in one such sitting a pretty good classification of cells as either looking like they had been activated by her regular or not. Anyway to make a long story short with this he undertook a high throughput screen involving about a thousand genes in this case with multiple rep, five replicates, many hairpins per whatever and found a number of established genes lots of new genes but most interestingly they fall into very sensible pathways. Three pathways that had been known to be involved in RB2 signaling come out right away, the PI3 kinase, NF-Kappa B, Jack stat and one entirely new pathway, Junk 3 and not previously known to be involved and it's an interesting pathway because there are inhibitors involved there are inhibitors that have been developed against Junk 3 but for neurodegeneration maybe they'll have a use here. In addition recurrent functions come up in neuroide extension cell migration ligand induced receptor endocytosis the vast majority of those genes sort out nicely into different pathways and provide great sense for it. So I bring this up to say that even when you're talking about subtle cellular phenotypes the genomic approaches can be quite handy and are quite tractable and these are the sort of things I at least am on record as having advised PioH should be a terrible screen but in fact turn out to be quite a reasonable screen and you can get a lot of really good pathways emerging out of that. I'll talk about another kind of way to recognize cellular signatures and I'll just yeah refer to that which is ways of recognizing cellular signatures based on gene expression and I just want to describe what's a beautiful project that's been continuing to grow of Todd Gallup and Justin Lamb at the Broad whose idea is we basically want to take any subtle process we're studying whether it's a disease the action of a drug the action of a gene and put them all in one common language one lingua franca that whatever we're working on the way to talk about it is its effect on perturbing RNA expression and if we were to make a big database of that we would pick up all sorts of connections by putting it in this common language that we would never otherwise have seen and they've demonstrated very beautifully that one can do this they've put together now a database of response signatures to number of human drugs a couple hundred human drugs now against numbers of human cell lines and their idea is this for any biological signature you want take your biological signature run it against the database kind of googling it and out will pop the things that are similar to it any disease state any other state any gene inhibition see if there are any drugs or other perturbations that are similar just show you examples of this treat rats with estrogen paper in the literature treats rats with estrogen looks at gene expression changes in uterus take those genes that go up and down and respond straight out of the paper in the literature run it against this connectivity map database out pops all the known estrogen analogs out pops something that wasn't known to be an estrogen analog but was proven to be an estrogen analog if you put in the minus of that signature down when it should be up up when it should be down you get the estrogen inhibition inhibition here you get to moxifins you get the selective estrogen receptor modulators to read this stuff right out a beautiful example is they took the signature of leukemia cells that are sensitive to dexamethasone treatment some are and leukemia cells that are not sensitive to dexamethasone treatment some are not and you get the differential gene signature toss it into the database see say ever seen a drug that looks like it induces the signature of being sensitive to dexamethasone and the database pops back and says the immune suppressant rapamycin does that and then you say wow I wonder if rapamycin does more than just induce the signature of sensitivity to dexamethasone but maybe it'll make cells sensitive to dexamethasone and you do the experiment and it does but who'd have thought of using dexamethasone we're certainly not smart enough but a genomic information database is smart enough that if you simply ask it the question it'll tell you it's the best fit and similarly I'm going to skip through this to simply say in a screening experiment to find small molecules that could block androgen signaling Todd and his colleagues found these two natural products from these two plants that block androgen signaling had no idea what they did but of course you don't need to know anything you just toss its signature into the connectivity map and the connectivity map implies boy that signature looks an awful lot like HSP90 inhibitors even though your molecules don't resemble any known HSP90 inhibitors they clearly must be blocking that pathway and they've gone on to show it is blocking that pathway what we need I would say is again genomic information databases we need to have signatures of all the FDA approved drugs of all the RNAIs of all the bioactive compounds freely available on the web how are we going to get that cheap enough well we've begun to realize that if we're going to do lots of this even doing it on microarrays for gene expression is too expensive but Todd is coming up with ways to do this by sequencing and it may be the new sequencing technologies make this affordable well those are ways of doing cellular circuitry I'll briefly mention because who's referred to by Rick this morning we still got to know all the mechanisms of cancer that's the next thing on the list there very briefly mapping the cancer genome is going to be one of the most important things over the next several years these chips that let us track polymorphism in the human population also let you track deletions and amplifications and cancers and this has become a very important and active thing and sequencing it's already been referred to by Rick that finding individual mutations like EGFR mutations in lung cancer has pointed out that there are subsets of lung cancer that have a distinct form of the disease that are responsive to particular drugs like Tarsiva and Eressa and so a task force at the NCI recommended a couple of years ago I got to serve on this task force that there ought to be a significant cancer genome project and that has morphed into this pilot project the cancer genome atlas project that is now underway with groups around the country and I think is increasingly involving groups around the world as it must the concerns that have sometimes been expressed about this are either we already know all the cancer genes or cancer is hopelessly complicated I don't think either of those positions is justified by the data I just put up a list of the 21st century cancer genes that have been discovered in major cancers here and what's really striking is that virtually all of them have come out of genomic approaches on prior candidates that of the drugable genes in common cancers all have emerged in the 21st century from genomic approaches that the genomic approaches have pointed us to new kinds of oncogenes we didn't know before lineage specific factors like MITF and TITF translocations and epithelial cancers that used the thought to be confined to blood cancers and that this is all as Rick Wilson said from screens that have been highly limited to really phosphatases, kinases, etc and what we really need are unbiased genomic screens of the sort that have been talked about today, what is the future of cancer genomics it will be get a tumor, get RNA and DNA from the tumor and sequence, sequence what? Well in the first instance by sequencing in limited ways you can get whole genomic copy number and rearrangement, you can sequence all the exomes as Richard Gibbs is referred to you can sequence from cDNA as Rick Wilson is referred to Cromatin and methylation maps and all of that all told the bill is less than probably 100 million short reads and 100 million short reads is not such a big deal anymore or won't be such a big deal anymore in the next couple of years. This isn't re-sequencing the entire cancer genome, the entire cancer genome is probably 3,000 million short reads which is still unthinkable for the next 12 to 14 to 24 months or so not so distant future, nobody will fuss over the first couple of lines, we'll go to the latter but you know those of us who are highly practical say the first four lines there will be the focus for the next five years and then it will be for more and more focus on probably being able to do the whole genome. Anyway genomic information, there are so many kinds of genomic information, there's of course all the sequence in the genome, there's all the genetic variation of the population and its relation to disease, all these functional maps from conservation, from chrome in the state, these signature maps like connectivity maps that let you look things up or these tools like RNAi inhibitions and databases that are being built of the effects of RNAi inhibition. All of the cancer mutations were just barely at the starting point to that but I predict where you are going to see an explosion of that over the next five years or so. I haven't talked about but Claire Frazier has referred very much to the genomes of all major infectious organisms and really being able to detail those as well. For the young people in the audience, this isn't what biology looked like two decades ago. It really was a world where what you did on your bench was primarily the data you were looking at. Now what you do on your bench is the starting point but of course it's comparison to everything out there. All the genomic information out there in the world is at your disposal. We are by no means done. The Human Genome Project is a good start. There's a lot more still to do. There are many projects here and there are many more still to go and I encourage all of you to be thinking whenever you do any experiment ask if I'm going to do it more than three times what's the genomic resource that would have been helpful for me to have? It is a remarkable, remarkable period we're living through. It still is very much unclear where and when it will end. I think we keep thinking maybe it's going to top off but I see no sign of it topping off for quite some time to come. Well I want to close by acknowledging the obvious which is this is the work of an extraordinary community. I want to acknowledge my own colleagues at the Broad Institute many of them working in many of these areas who it's been fabulous to work with them and I can't say enough about what a friendly and collaborative spirit there is in Boston amongst MIT and Harvard scientists and Harvard hospital scientists but I also want to acknowledge something you often don't acknowledge which is the extraordinary role of consortia. So much of what I've talked about was not the result of anyone labs or not anyone institute not anyone city but it was the result of being willing to put together consortia to get things done and there's been this floating group of consortia I just put down some of the ones whose data I've referred to here SNP consortia, RNAI consortia, all sorts of consortia that have emerged over the years and this has become such a powerful way to do science in the age of genomic information and then lastly I want to make a special acknowledgement to the sequencing centers. Over the course of now almost 18 years the sequencing centers have worked together in all sorts of combinations to help try to bring about this revolution and get data out rapidly and I think we all feel an enormous bond to each other I want to acknowledge Washu and Baylor and Tiger and Sanger and the Joint Genome Institute, the Stanford Genome Center and others and I particularly want to acknowledge because it's a birthday party for the extraordinary role it has played in making sure that this genomic revolution and genomic information that is happening all over the world is happening in spades here on the campus of the NIH. It's a great day, a great birthday party. The great thing about celebrating a first decade in this case is that one can be sure that the next decade is going to be vastly more exciting. So thanks for the opportunity to kind of tie it all up today and hats off to everybody here for what they're doing. Happy birthday! So we have time for a couple of questions before we adjourn to a reception while people are finding their way. Eric the ability to generate vast amounts of data is outstripping I think most people's expectations although I suppose it shouldn't be said that we weren't sort of warned about this, are we going to keep up in terms of the analysis capabilities that we have to put together to make sense out of all this or are we facing a mismatch in terms of algorithms, in terms of trainees, are we in trouble or is everything just nicely dovetailed? I have enormous faith over the long term in young people. I think it's clear that the next generation has already figured out that there is no distinction between being a wet scientist and a dry scientist. They're all recognizing they're damp, that they are both. And we're seeing many more people going into biology now who consider it to regret to have done bioinformatics training and such. So if you say over the course of the next 15 years will the young people lead us into this promised land by virtue of their understanding this new world? This whole generation may not fully enter that promised land but the new generation will and they understand it. Now will they all fully show up in full force within the next 24 months to deal with the data or will there be this deluge of data beyond what the existing training base is? We're going to be just overburdened with tons and tons of data but that's okay. We'll manage to extract the most interesting things that we see in the data so far and then as more and more people come in more things will be extracted. The thing we've got to do is make sure that the training programs are there. We've got to make sure, I hardly need to say this is something I think NIH believes deeply in, but NIH is the leader in training in the world here and we've got to make sure that essentially everybody going into biology even if they think they're going to be a cell biologist studying some cellular process understands how to connect to this world and also that we bring in large, large numbers of people who have real training in mathematics and computer science etc. So in a 15 year time horizon I think the whole notion of what it means to be a biologist will change and the young people here will solve it. In the short term well we're just going to do all the paddling we can do to stay afloat. Ok.