 If you could all take your seats, please, we'll start the proceedings. Well, good morning, everybody. I'd like to, first of all, on behalf of NIH, welcome every one of you to this memorable day. There have been many memorable days at NIH, but I think there are more memorable days than others, and this one is one of them. There's no doubt that this is a historic event for the entire community of science. And I'd like to also point out that the President sent us his greetings, the letter that you see on the screen. It's interesting to see how science can have watershed events that occur at multiple intervals of times, and those events can drive a field for many years. This conference today is to celebrate two of those events, the discovery of the DNA structure, the elucidation of the DNA structure 50 years ago by Dr. Watson and Crick, and the completion of the human genome by an international consortium in an effort that is really unprecedented in science. We would like to welcome to the symposium the many scientists who have played a major role in leading to these achievements, and you can see from the program that the NSGRI, which has led the effort for the symposium and the celebrations of the human genome completion and the 50th anniversary, have done an outstanding job. One thing I'd like to say is this, is that when you really think back at how science and how in fact our journey in understanding ourselves evolves, I can't but think of the exploration of the universe by astronomers, where in fact over the past 50 years we've understood now that the universe came out of a big bang 15 billion years ago, or 13.7 actually, and then you try to find parallels between this exploration of the universe when in fact a whole community of scientists, astronomers, are going back in time to the first event that created the universe. Well, we have the similar event in probably in our past four or five billion years ago. There was a big bang of life, and that big bang of life is still not well understood by all of us, and there's no doubt that we're engaged probably in again the same pattern in traveling back in time to understand those very first events. Although we do not understand the very first milliseconds of a big bang of life, there's no question that DNA at some point after the big bang occurred. And the journey that we are undertaking after the human genome sequencing effort is to in fact travel back and understand even more so than we have so far how this evolution of four and a half billion years has led to where we are today. Today I think we're going to celebrate not just the completion, not just the end of a particularly challenging project which has come to a successful conclusion. We're really celebrating the dawn of a new era in research and medicine. So with that I'd like to introduce the leader of this effort, Dr. Francis Collins, who over the years has brought his leadership and passion to all of the entire community with a greatly successful effort. Francis. Well, thank you, Elias. And good morning to all of you gathered here on this historic day. I wanted to thank a number of organizations and individuals that have made it possible for us to all gather here connected up by the web and by satellite download and by another remote location here on the NIH campus where various of our own staff and people from the outside are watching over in Mazur Auditorium because there was such interest in this gathering we couldn't fit everybody all into this natural auditorium despite its wonderful size. Particularly I'd like to thank the Department of Energy, our co-sponsors for this symposium and Ari Petrinos will speak next and say a bit about that. And also to thank the Foundation for NIH, a wonderful organization who put countless hours into helping out with a variety of activities that are happening over these three days and particularly helping with some of the associated components of today and tomorrow. Our thanks very sincerely to the Foundation. Also to other institutes at NIH who work closely with us in the design of the program and in the case of NLM and the Office of Scientific Education in the Office of Rare Disease Research helped in other financial ways to make it possible to do what we're doing today. And I should say thanks to my own wonderful team at NHGRI, the team April that worked tirelessly on this for about a year led by Alan Gutmacher and Jane Peterson with ABLE assistance from a host of other people, especially Susan Vazquez and Chris Wetterstrand. You guys were awesome. I want to say a word about the vignettes that you will be seeing as part of this program, an unusual component of a remarkably historic program. These were done by State of the Art, a communications organization here in D.C. with executive producer being Larry Thompson, my communications chief. They carried out extensive interviews with some of the legendary figures of genetics and genomics and then put them together into these vignettes, which I think you will greatly enjoy and which you can carry off with you in the DVD that you've been given as part of coming to this symposium. So I think you will find that a memorable part of a memorable day. One brief substitution to tell you about Jane Rogers is going to serve as the moderator for the first session this morning. Her boyer had a family emergency and is regrettably not able to be here, but you will see him in the vignettes. So you will get some of his words by that mode. Well, this is a landmark occasion here in the very month of the 50th anniversary of the discovery of DNA's double helix. I am pleased and honored. Perhaps I should say exhilarated to declare the goals of the Human Genome Project to be completed. That carefully constructed plan outlined with wisdom and boldness by the panel that Bruce Alberts chaired back in 1988, those goals have been accomplished and in particular the International Human Genome Sequencing Consortium made up of individuals from 16 laboratories in six countries and a host of other important participants, especially our friends at the databases, NCBI, Santa Cruz and EBI, have delivered on the promises and they were bold and audacious of having an essentially complete sequence of the human genome by 2005. But as you can tell we are now here two and a half years ahead of that deadline and I'm happy to say for any government types in the room at a budget substantially less than what was originally projected by Bruce and others who tried to map this effort out in 1990. So this is in fact a grand occasion. And before going any further I would like to ask all of those who have worked on this project of finishing the human genome sequence to stand up so that we can recognize you and see who you are. So please rise to your feet and those of you who have worked on this. It is of course also an incredible anniversary, the 50th anniversary of DNA. It's also the 50th anniversary of a few other things. The Department of Health and Human Services by the way had its 50th anniversary just last Friday. The Corvette is experiencing its 50th birthday party. That means the Corvette has now got membership in the AARP. Think about that. TV Guide, Cheese Whiz and Marshmallow Peeps were also founded in 1953 but we're not here to celebrate those. Remarkable though they may be. I can't help but note however there's another 50th anniversary which is rather an interesting parallel and that is the Ascension of Mount Everest by Sir Edmund Hillary in 1953. He had to get up 30,000 feet. We had to find 30,000 genes or thereabouts. There were some technological challenges involved in both cases. There was a lot of planning needed. It was an international effort and I might say there were a lot of white knuckles along the way. But here we are. So today you are going to hear this morning from some of the legendary figures who made this all possible by the work that went on in 1953 and thereafter to set the stage for this project. This afternoon you will hear in some considerable detail what we've learned so far about our own instruction book. But then the Chinese proverb comes to mind. Behind one high mountain lies yet a higher one. And so tomorrow we will be looking beyond the mountain called the Human Genome Project to the next phase of how it is that we can apply this to the betterment of humankind for advances in medicine which were after all always the point. And that will be also I think a wonderful set of presentations. So thank you all for being here. It's a great privilege to have a chance to be the one to stand up and announce this remarkable milestone on behalf of all of my wonderful colleagues with whom I cannot tell you what a privilege it's been to serve. Such a dedicated group of individuals all focused on this goal and all determined to make all the data immediately available. That is unprecedented I believe in the history of biological science and hopefully sets the stage for many other such enterprises yet to come. So welcome to the genome era. Let me now introduce my good friend and colleague Ari Petrinos from the Department of Energy. And thank you all. Thank you Francis and what a day indeed. I missed the guitar accompaniment however I must say. Thank you very much for the honor to welcome you all here today. My welcome is on behalf of the Department of Energy from Secretary Orbach, from Secretary Abraham to my boss Ray Orbach, I just promoted him, Ray Orbach the Director of the Office of Science. The welcome is also on behalf and to the scientists from the DOE National Laboratories, the academic community and the private sector who have been associated with the DOE program over all these years. I also would like to add a very special welcome from my colleagues in the biological and environmental research program that I have the privilege of leading who have also been with me in the human genome management trenches and they have held steadfast through ups and downs for close to 17 years. This is for me the perfect occasion to express my gratitude and admiration to Francis Collins for his exceptional leadership and for his warm friendship. I also salute all our wonderful friends at the National Human Genome Research Institute for the camaraderie we have enjoyed over these memorable years. I stand in awe of this joyous occasion and I feel also privileged to have had a front row seat and maybe even played a small role in this most noble of research undertakings. Welcome and thank you. Thanks Ari and it has been a wonderful privilege working with you and your colleagues at DOE. They said that the government agencies can't work together. Well here is the best example I think of why that statement is not correct. We really put our shoulders collectively to the wheel and here we are and thank you for your leadership. It's now my privilege to introduce the next two presentations, legendary presentations and legendary people. Francis Crick who will be the second of those presenters is not able to travel and so was not able to be here in person but graciously agreed to prepare a video specifically for this occasion which we will be showing you in a few moments. His collaborator in that remarkable discovery in 1953 James Watson is here with us and we are awfully pleased and privileged that that is the case. Jim in my view is probably the best known living biological scientist or perhaps the best known living scientist. His role in that discovery of the double helix of DNA is familiar to all of you. His role in many other areas particularly what he's done with Cold Spring Harbor Laboratory bringing it from an enterprise that was seriously flagging into one of the premier biomedical research institutions in the world is another truly remarkable contribution. But I must say that for me having the chance to introduce him today it was his role getting the human genome project off the ground that made this project credible at a time when large fractions of the scientific community were not enamored of the whole thing. And Jim with his style which is direct and forthright and sometimes rattled a few cages made this enterprise credible, put it on the map made all of the original plans about how to get it done established in a way that was bound to succeed and it was my great privilege to step in after he had initiated this effort and have a chance to lead it over the last ten years. So there is no person who deserves more to come to this podium right now and speak to you on this occasion the 50th anniversary of what he and Francis discovered back in 1953 and the moment of announcing the completion of the genome project which he got off the ground back in 1988. Jim please come and speak to us. Over the past couple of weeks I've been very intrigued by a book by local Washington historian how the Scots invented the modern world. Now my name Watson is Lowland Scots so I sort of warmed to the title. But I find it really a great relevance now to the completion of the human genome. What the book says was that Scotland was the intellectual center of the world all during the 18th century and into the 19th century. Was it Oxford or Cambridge or Paris? It was Scotland. And why? And it was there that modern moral philosophy was invented. They were worrying about human behavior. And the leader was David Hume who said that human beings are not motivated by reason but by their passions. Reason is a way of achieving your passions. Now the passions which to me are relevant today are one self-interest. I think this was the one that Adam Smith went on to say. That human beings work best when they work for themselves. Not for the state, not for the king, not for God, but for themselves. And that sounds, you know, when I was young that was a rather bleak thought that self-interest dominated this. And the second was that human beings, their human nature gave man a great desire to help each other, help other human beings. So that it didn't come from, I picked someone up because he'll later pick me up but it's an instinctive reaction if you can help another person who do so. So, and then the other passion which I want to emphasize which not so much, but it was curiosity. You just want to understand things. So if you put these three together, curiosity, self-interest and compassion for others, then you can get a modern society. Now when Francis and I found the double helix we were motivated of course by curiosity. We wanted to understand life and DNA seemed to be at the center of life or I didn't want to think about anything else except girls. But you know that was another component of human nature which I don't want to emphasize. But curiosity was really there. And of course we had our self-interest, we wanted to find it ourselves. We didn't expect to find it ourselves. We expected really to find it working with the people of kings because particularly Morris Wilkins, he was a good friend of Francis and I like Morris and so it was a sort of awkward when we found it what you do about the people of kings. Well Francis quite correctly said well we should have a joint paper. And I think Morris in part because of the complexity of the situation between him, Rosalind Franklin said no. Afterwards he said it was a great mistake because not enough credit went to kings. So that's how it happened to come out for the three papers. There couldn't be one paper from kings. So I think you know when you do something in science and you know you've got a competitor, you feel better if you can publish together. Or at least a lot of people do. I think you know the civility of science demands that sort of everyone comes first sometime. That is people who are competent. So now to you know go on to the Human Genome Project, there was just natural curiosity. The human DNA sequence was you know the program for human development and functioning. So nothing could you know satisfy our curiosity better than to find out what we are. Just again you know if we're curious about anything we should be curious about ourselves. There was also self-interest. Driving me in 1986 was the realization or the slow and painful realization that we had a son with a very disturbed and unfunctional mind. And I thought probably never understand why Rufus is sick until we can get the genetic causation of mental disease. So I wanted to get the Human Genome Project fast. Now to achieve it, it seemed to me very clear that it wasn't going to be one person, you're going to have to get a lot of people working together. And I always saw it as an international project. And I think we've really seen just the best of human beings in the way the consortium has worked together to come out with this sequence. The decision taking it permuted just to put the data on the web the next day and not worry about you yourself using it was an unbelievably important thing. I think it again showed the human's instinctive reaction is you've got to work together. And the self-interest, the project was so important that you really had to focus much more on working together than being the first. I think that was it. And that we're here is one reason I look forward to the future. I mean you can take this two views of society. You've got all this knowledge and it's going to be misused by people who are going to try and put the rich ahead of the poor and the developed world against that. I think our knowledge is going to be used this way because we really want and instinctively want all people in the world to have meaningful lives like the people in this room have. And it's whether you believe the origin of this comes from God or your genes, it doesn't matter. I think what we have and we should reward it and celebrate it whatever we can. Thank you. This very interesting meeting. On Thursday, February 26, 1953, we had no idea what the structure of DNA was. The next day Jim and I were told by Jerry Donahue that the formula in the textbooks for two of the DNA bases were given in the wrong tautomeric form. Since all four bases, the keto, not the enol form, was correct. Jim had arranged for the workshop at the Cabinet to make scale metal models of the four bases. But waiting for them to be made, he became impatient and made some cardboard ones. On the morning of Saturday, February the 28th, he was idly shifting them around on the top of his desk when he noticed that an AT pair he had made resembled in shape a GC pair. He had not used Chagas rules to the amount of t equals the amount of a and similarly for GC. But he immediately realized that his base pairs obeyed these rules. When he showed me the models, I saw that they had the right symmetry. That is, the glycosidic bonds which joined the base of the sugar related by a diet axis perpendicular to the axis of the helix. Implying that the two chains ran in opposite directions, what I have already deduced from Roger and Franklin's data on the space group, face centered on the clinic on the A form of DNA. Jim had never liked this idea. He had tried quite unsuccessfully to build a backbone assuming the two chains were parallel. This implied a rotation of 18 degrees from one nucleotide to the next. Whereas anti-parallel chains require this angle to be 36 degrees. The 18 degree rotation was much too tight to build properly, but with a 36 degrees equation it was easy. In a few days we had built a possible scale model. It turned out to be incorrect in detail, but its broad features were correct. Did we realize the significance of our discovery? Yes we did. Jim recalls that I announced in the local pub, the Eagle, that we had discovered the secret of life. We spelt out ideas on gene replication, a second paper in nature which appeared just five weeks later on the 30th of May. Did we foresee the sequencing of the human genome? No, we didn't. We saw as far as the genetic code that we mistakenly thought that the rival syndrome RNA was the messenger RNA. But we did not foresee either introns or RNA editing. We thought then that sequencing DNA would be very difficult and time consuming, nor did we foresee recombinant DNA. But I think it's a rather general rule that one can sell them predict correctly more than about 10 or 15 years ahead. Unexpected discoveries can often change the picture completely. You're all familiar with the present situation, what of the future? Notice that, at least in prokaryotes, the determination of sequence from DNA to RNA to protein, which Jim's still calling correctly the central drug one, is in informational terms, lightly a pure feed-forward process. This is not true of the rates of those transformations which are strongly influenced by other proteins, other gene products, including transcription factors. We are confronted not with a merely feed-forward process, but with a non-linear system, the theory of which is fragmentary, complex and confused. This and the interaction of large groups of proteins are the problem that now confront us. Curiously enough, they are somewhat similar to the problems of trying to understand the complex neural networks of the brain. There seems to be no limit to the problems that now confront us. I shall not lead to see their solutions, but many of you should survive long enough to see many radically new techniques and striking discoveries. Good luck to you. Having heard from the two people who discovered the structure of DNA, we're now going to move on this morning to the first session and hear about the first discoveries that followed the discovery of the structure itself. We're going to start with a short vignette, as Frances mentioned. The first vignette is about how DNA works, and it encompasses a number of people who talked about the discoveries that they undertook and their experiences and how their thought processes moved on from the structure itself to understanding more about life. So the first vignette. The 20th century opened with the rediscovery of Mendel and ended with the sequencing of the human genome. By the middle of that century, the structure of DNA was deduced by James Watson and Francis Crick using the X-ray crystallography images of Rosalind Franklin. I just turned over a page in my life and sent this as the new book. None of these things could be understood until we knew the structure of DNA. The characteristic of this discovery is it was so desperately needed. It was desperately needed. Anybody who wanted to apply science to understanding life was just marking time, waiting for that to happen. There one saw immediately how all many of the questions that had arisen in genetics could be answered and that the model actually explained how one could get a chemical basis, if you like, for inheritance. So that from somebody like Jim and Francis' standpoint, they could say the critical question in biology was how does DNA work? And we'll learn about that by looking at its structure. The questions had a linear flow to them and with Francis there to tell us what they really, you know, to help formulate them. It was like a logic game. I was meeting together with Jim and Francis and Sidney Brenner who was visiting and Jim points out the window and says, see that fellow down there? That's Frank Stahl and he thinks he's pretty hot stuff. Let's give him the Hershey Chase Blender experiment to do all by himself in a single afternoon and see if he can do that. So I thought, oh, there's a poor guy down there. I better go talk to him. So I went downstairs and I introduced myself and here's this fellow who is actually selling gin and tonics. He had a big bottle of gin and tonic and ice and limes and people would come by and he'd sell them a gin and tonic and make a few for himself that way. And he was trying to solve a problem that involved radiation genetics of bacteriophage. And we got to be friends and started talking and turned out he was going to Caltech. At that time, this would be in the 50s, early 60s. Caltech was a major center for the pneumo-molecular biology. Both of them were being done there and the fact that everybody who was active in the field would come through Caltech at one time, another in talk. The famous Mezzles and Stahl experiment was being done at Caltech where they demonstrated that the two strands actually do come apart on replication. I had an idea which wasn't the right idea, but with Frank Stahl we got to the right idea eventually for how to test this semi-conservative replication. And so the idea of the experiment was to start by growing bacteria in heavy medium. We used heavy nitrogen which you could buy in those days. They still buy. And then that would give only one band in the place where heavy DNA should go. And then quickly separate those bacterial cells from their medium by centrifuging them and resuspend them in light medium. So now any new DNA would be made out of light nitrogen and that would form a band higher up in the tube, near the top. Now if DNA replicates semi-conservatively, that heavy DNA, both chains are heavy. If it comes apart each chain separately and makes a new chain, then you'd have DNA that has one heavy chain and one light one and that would be half heavy. It would form a band in between the fully heavy and the fully light. And when everything is replicated once and only once, the band would be exactly in between and it would be the only thing you'd see. And that's what happened. Before Messelsen-Stahl's experiment, there were several competing theories about how DNA replicated itself. Messelsen-Stahl demonstrated decisively that only one of these was right. And this was the one that had been advanced by Watson and Crick in their original paper in 1953. The news of Messelsen-Stahl traveled very rapidly through the world of molecular biology on both sides of the Atlantic. It had been known for some time that proteins were not made in the nucleus, but they were made in the cytoplasm. And RNA was involved in this, and so people thought that RNA would be involved in the manufacture of proteins once this had come about. And so the puzzle was that once we had got to the idea that proteins were made in ribosomes, sorry, then the question is where was the information? How did you get the information out of DNA and get it into ribosomes? And out of this came the messenger RNA hypothesis with Renje Koeb and Brenner and Messelsen were involved in putting that forward. And then it was called the messenger because it took the message from the DNA, you see, the message from the DNA into the cytoplasm where it was transformed. In England, Sidney Brenner and in France, François Jacob were devising a theory about how the information in DNA was carried out to the cell. This involved a molecule called RNA, and in 1960 they went to Caltech to use the Messelsen stall methods to see if they were right. We arrived there to do this experiment. We had about three weeks to do it in, and it didn't work for quite a long time. We were centrifuging ribosomes in very strong salt and didn't occur to us. We had to up the magnesium because the salt was competing with it and everything came apart. It was a very delicate experiment, very difficult to do. And so once that realization came to me on a beach, we ran back to the lab and because I got up and started to jump up and down and say it's the magnesium, it's the magnesium, let's go. So we did the experiment again. It was the last time and we actually found that it worked. The key element in the central dogma was the adapter, the tRNA, another idea of Crix. Gamoff had created the coding problem, formulated the coding problem as one in which he looked at the DNA and saw various cavities. And so he very naively assumed some properties about these cavities which were wrong. Francis Crix solved that when he suggested the adapter molecules, RNA, and the messenger concept which came from Jacob and Brenner and others. The adapter was nucleic acid and that you had an enzyme that coupled the amino acid to the adapter, right? And then the adapter would go and find its place on the nucleic acid and of course at the time he put this forward, biochemists stood up and said this is impossible. On the grounds that had there been 20 enzymes, they would have already discovered them, so they aren't 20. And that was the last piece of the puzzle because now you had the transfer from DNA, the messenger RNA, the adapters that made it into protein and then proteins could go out and do their thing. But that still left open the question of the code. That is, how is it that the sequence of base pairs in the DNA actually instructs the ribosomes in the cell what kind of proteins to produce? Watson and Crix wrote two papers. The first paper was the structure itself and the second paper was about the implications, that last sentence of the first paper was. And in that it's already clear that we're talking about a code, that we're talking about ways of decoding. Meanwhile the coding problem sort of bounced along. Brenner and Crix did their very ingenious experiment suggesting that it was probably a triplet code. We needed then to see that we could explain everything by mapping one sequence written in a four-letter language onto another sequence written in a 20-letter language. And that formulated the problem of the genetic code. Nirenberg did the dramatic experiment with the in vitro protein synthesizing system showing that polyuredilic acid coded for polyphenylalanine. One of the polymers we made was a polymer with just urodilic acid residues. And one day Marshall Nirenberg appeared, he worked down the hall in another lab and he appeared at the door and wanted some polyu. And that was the first opportunity to define one of the codons in DNA. And of course the famous polyuredilic acid worked. It made a uniform product of polyphenylalanine which you could test by incorporation. And so it became very clear that you could get quite far. We call that ultimately the genetic code. And it took about 10 years after the DNA structure for the genetic code itself to be worked out. And then when it happened you had this enormous rapid development, getting the code, understanding mutation, understanding protein synthesis. It all had to happen like just a trunk opening up and spilling out. Today Dr. Marshall Nirenberg is Chief of the Laboratory of Biochemical Genetics in the National Heart, Lung and Blood Institute at the NIH. In the early 60s he was a young post-doc working in the NIH and as you just heard was working on the decipherment of the genetic code. For this work in 1968 he was awarded a Nobel Prize with Hargibin Karana and Robert Holley. The title of his talk this morning is Deciphering the Genetic Code. Thank you. In 1958 towards the end of my post-doctoral fellowship which I'd been doing an enzomology here at the NIH, Gordon Hopkins offered me a position as an independent investigator in his section. And I decided, I thought long and hard about what I should do. I decided to switch fields and the thing I wanted to do more than anything else was to work on an important problem and to explore. And the two of the most important problems that were being worked on at the time were cell-free protein synthesis, the mechanism of cell-free protein synthesis and genetic experiments trying to find out how the beta-galactosidase gene is regulated in E. coli. And so I decided to work in cell-free protein synthesis. My initial objective was to find out if messenger RNA existed. This was 1958. At that time the major thing that was known about protein synthesis was that amino acids are incorporated into protein on ribosomes. And transfer RNA had just been discovered but it wasn't known if transfer RNA was an obligatory intermediate in protein synthesis at the time. So my long-range objective was to get the cell-free synthesis of a small inducible protein penicillinase. And so I began preparing ribosomal RNA because I thought that ribosomes would have a small amount of messenger RNA present and DNA preparations and adding them to cell-free extracts to see if I could get the cell-free synthesis of penicillinase. I worked for about a year and a half on this problem and I must say that I was acutely aware that this was an extremely dangerous project to work on because I was working alone. There were some of the best biochemists in the world were working on cell-free protein synthesis. And I wondered what chance do I have without any experience in the field working alone to find something useful. But actually the desire to explore was greater than the fear of failure. And so I continued to do this. After about a year and a half I was getting a small increase in penicillinase synthesis that was dependent upon the addition of ribosomal RNA. But it was clear that I needed a more sensitive assay. Then Heinrich Matthae came to my lab as a postdoctoral fellow and I immediately gave him the more sensitive assay to use. That is to use amino acid incorporation into protein to see if the ribosomal RNA would stimulate radioactive amino acid incorporation. And the very first experiments worked. We used the same conditions that I had devised for penicillinase synthesis. The very first experiments worked and I literally jumped for joy because I knew that we had a cell-free assay then for the function of messenger RNA. Now there were three rather trivial things that we developed that were extremely important in the long run. The first was that I devised a method of stabilizing the extract so we could freeze them. And we didn't have to make them fresh for every day and we could freeze them in small aliquots and use them for a long time. The second was that the endogenous incorporation of amino acids into protein was very high and we had to devise a method of reducing the blank. This is incorporation of C14 valine into protein against minutes. Without any addition you can see that there's a lot of incorporation of valine into protein. But a number of workers had reported that the addition of DNAase to E. coli extracts inhibited the incorporation of amino acids into protein. And I confirmed this and so I incubated the extracts for about 40 minutes until endogenous incorporation ceased. And then I froze the extracts. These were incubated in the absence of a radioactive amino acid. Then I could freeze the extracts and then the background endogenous incorporation was very low. And if you added messenger RNA preparations you could see a large increase. The other really important change in the methodology that we made was to, at that time, the standard way of washing the precipitates was by repetitive centrifugation. And it was extremely laborious to remove unincorporated free amino acids from the protein by repetitive centrifugation. It took an entire day to just do the washings. And I tried, at that time, a new kind of filter had just appeared, millipore filters. And so I tried to wash the precipitates on millipore filters and it worked just as well as the repetitive centrifugations. And then I could do in one day what it would take 10 people to do formally. And this kind of evened the playing field so that, and that proved to be very important. The next slide. I got as many different kinds of RNA as I could. For one thing we did was to fractionate the ribosomal RNA and we could show that only a small fraction of ribosomal RNA actually was active as a template, as we expected. I found that yeast ribosomal RNA also was active, that tobacco mosaic virus RNA was extremely active as a template for protein synthesis. And whereas we were getting hundreds of counts incorporated with ribosomal RNA, with TMV RNA, we got thousands of counts going into protein. And then I got some synthetic polynucleotide, poly-U, which we wanted to test for template activity. I started to work on the TMV RNA and I gave to Heinrich the poly-U and we went over the protocol for the poly-U experiment. The object of the experiment was to find out whether poly-U was active as messenger RNA. So we made 20 different amino acid solutions, each with 19 cold amino acids and a different radioactive amino acid and tested each one with the poly-U. And Heinrich found that poly-U stimulated phenylalanine incorporation into protein. And the incorporation was remarkably large. So it was clear that a sequence of use in poly-U corresponded to the RNA codon for phenylalanine. And this is a picture taken in 1962 of Heinrich and myself. So this provided a relatively easy method, a straightforward method for determining base compositions of RNA codons. By using synthetic polynucleotides with different bases and different proportions of base shown here. In this slide is shown the minimum species of bases required for mRNA codons. And we were able to determine the base compositions of all the codons. We could show it was a triplet code by varying the proportions for example of C and A in poly-AC and correlating the theoretical frequency of a codon with one C or two or with one A with the actual amount, the frequency of incorporation of an amino acid into protein. And Bob Martin played a major role in this early phase of determining the base compositions of RNA codons. We were subjected almost immediately to intense competition from Severo Ochoa. Ochoa was one of the best biochemists in the world and he was a fierce competitor. But I found that I actually didn't mind competition at all. I actually kind of enjoyed it because it focused me, it made me more efficient. I didn't really care about winning or losing, but I did care about trying to find out to get to the bottom of the problem that we were working on. And at any rate I accomplished more with competition, so I think that that served a very useful function. But we needed to determine the nucleotide sequences. Also, as shown here, this is the percent of C14 phenylalanine incorporated into protein. It's 100% here and this is the first antisense experiment. When we added polyA to the polyU we formed double-stranded and then triple-stranded helices here which were completely inactive as a template for protein synthesis. Whereas the addition of polyC had no effect or little effect on incorporation. But we needed to determine the nucleotide sequences. And we tried various methods that did not work. But one day I wondered what the smallest functional message would be that would direct the binding of aminoacyl tRNA to ribosomes. And I wondered if a triplet would work. I went to Leon Heppel. At that time there were only six or eight nucleic acid biochemists in the world. And Heppel was one of the really expert nucleic acid biochemists. And he had some triplets that he had prepared and he gave them to me. And the very first experiment worked that we, that's shown on this next slide. This is the binding of C14 lysine to ribosomes due to the addition of either a doublet, a triplet, or higher homologs as shown here. The doublet didn't work. Triplet directed the binding specifically of lysine tRNA to ribosomes. And an increase was only observed with hexanucleotide as shown here. Now at this time, most of the 64 triplets had never been isolated or synthesized. They were new compounds. And so our major problem was how do we synthesize the 64 triplets? Phil Lieder found an advertisement in the journal. Somebody in Europe was selling a half gram of each of the 16 doublets. And they were being sold for a large price, $1,500 per half gram. $1,500 a couple of years earlier was a half a year salary for me. So it was a large amount. But we bought everything. And when they arrived, there were only 15 vials. And when I checked to see what happened with the 16th vial, I found that the U.S. Custom Service had taken one of the vials and had tested the entire thing for drugs. But we used these doublets as primers for the synthesis of triplets. And we used two methods. And the first method, which is shown here in the first slide, was poly-nucleotide, using catalyzed by poly-nucleotide phosphorylase. Maxine Singer was an expert on poly-nucleotide phosphorylase. And at that time, Marianne Grunberg-Monago was visiting her lab. Marianne had discovered poly-nucleotide phosphorylase together with Sabir Ochoa a few years earlier. And Phil Lieder, Maxine Singer, and Marianne Grunberg-Monago worked to find conditions that would add one or a few nucleotide residues to the three prime end of a doublet. And they found these conditions as shown here. And then Leon Heppel suggested an esoteric method that was simply a one-sentence observation in one of his papers that we used here. That is, using pancreatic ribonuclease aid, which he had found would catalyze the transfer of a pyrimidine two prime, three prime cyclic phosphate to the five prime end of a primer, such as a doublet, to form triplets in higher homologues, such as shown here. And we used both methods. Mert Bernfield in our lab synthesized about half of the triplets using this method, and we synthesized the other half of the triplets using the poly-nucleotide phosphorylase method. It took about a year to synthesize the 64 triplets and to test each triplet against the 20 aminoacyl tRNAs. I should say that Gobind Karana also synthesized the 64 triplets chemically and actually confirmed many of our assignments and made some new assignments as well. So the final code that is shown here, and it became clear that there's a systematic degeneracy, that the third base varies in systematic ways. For example, UC, AG, such as shown here, and that all 64 triplets, it's a fully degenerate code function to code for amino acids. Now, there's an order in the code. For example, amino acids with hydrophobic side chains, such as shown here, have you as the central base, whereas many hydrophilic amino acids have A as the second base. And chemically similar amino acids, such as glutamic acid, aspartic acid, have chemically similar codons, glutamine and asparagene, for example. And so this suggests that the code may have evolved from a simpler code. And these are the kinds of degeneracies that we found. Bob Holly gave us some yeast alanine tRNA that he had sequenced. This was the first nucleic acid to be sequenced. And we tested them and found that the species of tRNA recognized GCU, GCC, and GCA. And the base that recognized the three bases in the third position, the wobble position, was hyaxanthine, was a trace base. And others subsequently showed that the trace bases recognized the other forms of degeneracy. The four kinds of degeneracy as shown here. And on the last slide are all the people who were in my lab during the five or 60 years that it took to decipher the genetic code. Many of these people went on to become some of the best scientists actually in the world. I always thought that I was extremely fortunate to work with such wonderful people. After the code was deciphered, I switched fields again and went into neurobiology, went to explore problems in neurobiology, which I'm still doing. Thank you very much. Thank you very much for those insights and memories. Our next speaker is Professor Stanley Cohen. He's a professor of genetics and a professor of medicine at Stanford University. And in 1973, Professor Cohen, together with Dr. Herbert Boyer, for whom I'm standing in this morning, discovered methods for manipulating DNA that we all rely on so much today. So Professor Cohen, would you like to come and talk to us about the pathway from the double helix to DNA cloning? Okay, thanks. I should start by pointing out that once the structure of DNA had been discovered, it was only a matter of time until ways would be found to manipulate DNA and isolate the genes it contains. Remarkably, the road from the double helix to DNA cloning was traveled in a short 20 years, and the double helix itself provided the roadmap. Genetics has always been largely an experimental science. Its father, Austrian monk Gregor Mendel, used grafting and selective breeding of P-plans to gain a basic understanding of the rules that govern heriting. While such manipulations were employed for the next 100 years and methods of gene transfer were developed for bacteria by Lederberg and Zinder and others, the inability to isolate individual genes from the complex chromosomes of higher organisms was a major obstacle to continued progress. This all changed in the mid-1970s when it became possible to transplant genes between cells of different species and cause them to be inherited by descendants of the new hosts. This process, DNA cloning, or sometimes known as recombinant DNA, has allowed gene structure and function to be studied at the molecular level and has enabled the construction of cellular factories engineered to produce biologically and medically important proteomes. The usefulness of recombinant DNA depends on both the ability to splice together DNA fragments and the ability to propagate and isolate genes in foreign species. Ordinarily, viable offspring are produced in animals and plants only between members of the same or closely related species, and it had been widely believed that natural barriers, that's a term from the 70s, would prevent interspecies mixing of DNAs. However, in 1973, Herb Boyer and I and our colleagues discovered that genes from virtually any biological species can be cloned by linking them to DNA molecules that have the capacity for replication in the intended host. The invention of DNA stemmed from two lines of basic research. One involved studies of bacterial plasmids. Small, physically autonomous circles of DNA that can replicate separately from the chromosomes of the bacteria that harbor them. The other focused on manipulations of DNA itself. In 1968, when I joined the Stanford University faculty, I began to investigate the property of a group of plasmids that confer antibiotic resistance on their host. There was little known about the mechanism's underlying resistance or how such plasmids evolved and were propagated. I wanted to understand these mechanisms. The field of molecular biology had developed in the 1960s as an amalgam of the disciplines of genetics, biochemistry and microbiology. It had focused on bacterial viruses, partly because viral reproduction generates millions of genetically identical copies of virus DNA. To study plasmid genes, it would be necessary, I realized, to develop analogous methods for cloning plasmid DNA. Approaches devised at Caltech by Jerome Vinograd enabled the isolation of circular plasmid DNA molecules from bacteria. Here you see some circular plasmids. They were isolated as supercoil DNA. However, I wanted to be able to modify such isolated plasmids and reintroduce them into living cells. Mandel and Higa at the University of Hawaii reported in 1970 that the chemical calcium chloride opens up pores in the envelope surrounding the bacterium E. coli, enabling uptake of DNA of a simple virus. While efforts to transform the genetic properties of bacteria using chromosomal DNA had not been successful, work in my laboratory showed that plasmid DNA circles could enter calcium chloride treated bacteria and be propagated in its descendants. This enabled bacteria containing plasmids to produce colonies when exposed to the cognate antibiotic. In nature, genes for antibiotic resistance become attached to the replication machinery of plasmids by genetic recombination mechanisms working within living cells. I had begun to break plasmids apart by mechanical shearing. And to introduce the resulting DNA fragments into E. coli, I hope that novel fragment combinations would be formed in these cells by the same mechanisms. I also wondered whether plasmid DNA fragments could be linked together while the DNA was outside of cells. A biochemical approach for DNA joining had, in fact, had been developed in the laboratory of Gobind Karana, a virus-encoded enzyme called DNA ligase. Could splice together DNA molecules that had blunt ends, although this occurred at very low efficiency. However, it was known that efficient DNA joining could be accomplished by using a central feature of DNA. The ability of nucleotides on one strand to pair... Let's see, I don't see it on my screen, but not on... There's something wrong. Oh, there we go, thank you. The ability of complementary nucleotides on one strand to pair with complementary nucleotides on the other. Virus lambda uses this strategy when it replicates nucleotides present on a single strand extension, single strand extensions at the ends of lambda DNA can interact by the bonding of hydrogen atoms with complementary nucleotides from one strand and the other strand, holding the two ends of lambda DNA together to form a circle. Martin Gellert and others had shown that the nicks in the strands of the DNA circles can be repaired by DNA ligase, producing a permanent joint. The complementary ends of lambda DNA provided a paradigm for DNA splicing. However, in nature, nucleotides at the ends of DNA fragments are rarely complementary. A 1969 thesis proposal by Peter Loban, then a PhD student working in the laboratory of Dale Kaiser in the Department of Biochemistry at Stanford, suggested a solution to the problem. Loban proposed to install a stretch of thymidine nucleotides onto the ends of one type of DNA and a stretch of complementary deoxy-A nucleotides onto another using the enzyme deoxynucleotidal transferase. Loban hypothesized that hydrogen bonding between the A's and T's would hold the separate DNAs together so that the nicks at the junctions could be sealed by DNA ligase. R.H. Jensen and his colleagues working in an industrial laboratory conceived and published an identical strategy independently at about the same time. However, whereas Jensen's efforts at sealing DNA junctions were not successful, Loban discovered that additional enzymatic manipulations did enable efficient ligation. While the multi-step approach that Loban and Kaiser developed for DNA joining was biochemically complicated, it nevertheless could splice DNA molecules together. Using AT extensions and additional manipulations that Loban had devised, Paul Berg and his colleagues in a Nobel Prize-winning experiment inserted DNA from the bacterial virus Lambda into the monkey tumor virus SV40. It was not determined whether the biochemically produced construct could be propagated in mammalian cells and a concern about possible biohazardous effects of disseminating the tumor virus component dissuaded Berg from trying to propagate it in bacteria. Ironically, these concerns have proved in retrospect to be moot. The site of joining of Lambda DNA to SV40 is now known to have interrupted a gene essential for Lambda replication, and thus the animal virus DNA could not have been cloned. In the first successful DNA cloning experiments, Boyer and I and our colleagues used an approach to DNA joining that was fundamentally different from the ones I've just described. At a scientific conference in Honolulu, Hawaii in November 1972, I reported my lab's new found ability to transform the genetic properties of bacteria by cloning plasmid DNA. I listened to excitement to Herb's description of the action of the restriction enzyme ECHO-R1, a protein that he and his colleagues had purified from bacteria, isolated from a patient, hospitalized at the University of California. Herb's lab had found that ECHO-R1 cuts the two strands of duplex DNA asymmetrically at the sequence GAATTC. As cleavage occurs between the G and the A on each strand, ECHO-R1 produced a single strand extension on every double-stranded DNA molecule that it cut. Moreover, every extension was identical to every other extension generated by the enzyme. This meant, I quickly realized, that all molecules of plasmid DNA cleaved by ECHO-R1 would be fragmented in exactly the same way, and importantly that the ends of different ECHO-R1 generated fragments could associate by hydrogen bonding. By using the plasmid DNA transformation system we had developed and ECHO-R1, it might be practical, I thought, to clone individual antibiotic resistance genes. This surely would be better than the mechanical shearing procedure I had been trying. As I later learned, Vittorio Scaramella and also Janet Merz and Ronald Davis working at Stanford had already employed ECHO-R1 generated complementarity to join DNA ends in vitro. During a walk and evening snack with Herb and others at a delicatessen near Waikiki Beach, I learned that Herb had similar thoughts and we discussed collaborating scientifically. Here, in a cartoon from a Honolulu newspaper published many years later depicting the scene, we see a black-bearded person, then black-bearded, gobbling an overstepped sandwich while Herb is either insisting that ECHO-R1 cuts both strands of DNA or more likely ordering another two beers. Our plan was to cut purified DNA of R65, a large antibiotic resistance plasmid I had been studying at Stanford into multiple fragments. We knew that this plasmid carried at least five separate resistance genes. From the calculated frequency of the six nucleotide sequence in the DNA that ECHO-R1 recognizes in that DNA, at least some fragments were likely to contain intact genes. We hoped that the plasmid's replication region would also not be interrupted by cleavage. The various plasmid DNA fragments would associate with each other randomly by hydrogen bonding and then DNA ligase would splice the ends together. We would then transform calcium chloride-treated bacteria with the DNA mixture and select colonies that expressed different combinations of antibiotic resistances. As all cells in a colony would be derived from the same bacterium, the plasmids they contained would be identical, we realized. An analysis of the DNA of these cloned plasmids would identify the component fragments. Although the concept seemed straightforward, no one knew at the time whether novel plasmids constructed in this way would be propagated or would be capable of propagation in living cells. We began our collaboration shortly after returning from Hawaii. Plasmids isolated at Stanford were transported to Boyer's lab in San Francisco for cutting by ECHO-R1 and then back to Stanford where randomly joined DNA fragments were introduced into E. coli. Bacterial colonies expressing only some of the antibiotic resistance genes were isolated and then analyzed by electrophoresis in gels at UCSF and by centrifugation and electron microscopy at Stanford. Missed something here, but okay. Annie Chang, then a research technician in my laboratory, carried out plasma DNA isolations and bacterial transformations. Bob Helling, who was on sabbatical leave in Boyer's lab, did many of the gel analyses. The strategy that Boyer and I had devised in Hawaii worked even better than we had hoped and the months in early 1973 were a period of almost unbelievable excitement for all of us. PSC 102, the first recombinant plasmid we analyzed, was about one-quarter the size of R6-5 and contained only three of the original multiple fragments of the parent plasmid. This result suggested and it also expressed resistance to only canamycin and sulfonamide. This result suggested that we had in fact cloned the canamycin and sulfonamide resistance genes by attaching them to a DNA fragment encoding replication functions. Herb and I quickly recognized the importance of this finding but realized that in order to make DNA cloning a generally useful tool we should seek a simpler plasma that would be cloned cleaved only once by EcoR1 and thus would contain replication functions and a selectable antibiotic resistance gene on the same EcoR1 generated fragment. If the EcoR1 cleavage site did not interrupt genes essential for plasmid replications or the antibiotic resistance gene needed for selection of transformed bacteria this plasmid could be used as a DNA cloning vector. In my collection at Stanford was PSC 101, a 6 kilobase tetracycline resistance plasmid. Herb and I found that cleavage of PSC 101 by EcoR1 generated a single DNA fragment as seen in this gel. As has since been done countless times the plasma DNA circles were opened by cleaving them with EcoR1. The linearized PSC 101 DNA was then mixed with the DNA of the PSC 102 plasmid that also had been cleaved by EcoR1. Bonding between complementary single strand extensions joined DNA fragments from the two separate plasmids. DNA ligase was added and the now closed circles of DNA were introduced into calcium chloride treated E. coli. When the bacteria were spread onto culture media containing tetracycline and canamycin cells that concurrently expressed resistance to both antibiotics grew as colonies. Gel analysis of plasmid PSC 105 which was isolated from one of these colonies there should be a gel here but somehow it is not. Anyway there really is a gel analysis which show that PSC 105 contains fragments from both parent plasmids from PSC 102 and PSC 101. And therefore allowing us to identify the fragment of PSC 102 that had been cloned onto PSC 101. In May 1973 a manuscript describing our results was submitted to the proceedings of the National Academy of Sciences. This paper which was published in November 1973 was the basis for the Stanford University and University of California patents that were licensed to more than 300 companies and underlie much of modern biotechnology. The discovery that genes from E. coli plasmids could be cloned in the same bacterial species by linking them to plasmid vectors didn't necessarily mean that foreign DNA could also be propagated in E. coli. Scientific colleagues offered cogent reasons why the propagation of DNA from an unrelated species would not be possible. However prior to the development of DNA cloning methods this could not be tested experimentally. Now Staphylococcus aureus is about as different from E. coli as two bacterial species can be. These bacteria have profoundly different shapes, are enclosed by biological envelopes with very disparate properties and are not known to exchange genetic information naturally. To learn whether biological barriers between such different bacterial species could in fact be breached, Chang and I sought to clone DNA fragments from a Staph aureus plasmid PI258. Whoops. I should say that the, okay. We knew that this plasmid carried a gene for resistance to ampicillin and hoped to clone this gene in E. coli using procedures similar to ones employed for E. coli plasmids. We obtained transformant bacteria that expressed the tetracycline resistance of PSC-101 and the ampicillin resistance of PI258 and found that their plasmids included Staphylococcal DNA fragment as well as the PSC-101 vector in this new plasmid PSC-112. As we wrote at the time, these results argued that antibiotic resistance replicons such as the PSC-101 plasmid may be useful for the introduction of DNA derived from eukaryotic organisms into E. coli, thus enabling the application of bacterial genetic and biochemical techniques to the study of eukaryotic genes. Shortly afterward, additional collaborative experiments between Boyer's lab and mine with the participation of John Morrow at Stanford and Howard Goodman at UCSF showed that eukaryotic DNA isolated from a frog could in fact be cloned in E. coli plasmids and its genes transcribed there demonstrating the extraordinary range of scientific applicability of the approach that Boyer and I had devised. Later, Boyer and his collaborators succeeded in making a short mammalian protein, somatostatin in bacteria and found that this peptide showed normal immunological reactivity. Then in experiments done collaboratively with Robert Schimpke at Stanford, Chang and I showed that bacteria could synthesize a biologically functional mammalian protein, the mouse enzyme dihydrofolate reductase, helping to further set the scene for the emergence of what has become the field of biotechnology. Initially PSC 101 was the only vector available for DNA cloning and this plasmid as you might well imagine was quickly requested by and promptly sent by us to dozens of other laboratories which proceeded to address important biological questions of their own using DNA cloning. Soon additional plasmids suitable as cloning vectors were discovered and were modified for special purposes using EcoR1 and the multitude of other restriction endonucleases that were being identified and purified. Using DNA ends generated by these enzymes replication regions from bacterial and eukaryotic viruses were linked to selectable genes increasing the flexibility of DNA cloning. Systems for accomplishing efficient production of mammalian cell proteins encoded by genes cloned in bacterial and eukaryotic cells were devised so that the synthesis of medically important products using DNA cloning methods is now carried out routinely. In closing I want to say that it has been a joy for me to witness the ways in which the invention of DNA cloning has contributed to the advance of both basic and applied scientific knowledge and I know that Herb has felt the same way. However I also want to say that science isn't done in a vacuum. The discovery of methods for cloning DNA depended not just on insights that Herb Boyer and I had about plasmids and restriction enzymes but also on the fruits of prior efforts in other laboratories as is the case for most discoveries. Notwithstanding the doubts of many in the scientific community that genes from one species could actually be transplanted to, propagated in and cloned in a very different one, the approach that Boyer and I conceived to achieve these objectives is in retrospect astonishingly simple. The roadmap in fact came from the properties of the DNA helix and one can say with certainty that if Boyer and I had not collaborated in early 1973 to show the feasibility of DNA cloning, the cloning of DNA inevitably would have been accomplished later by others probably approaching this call from a different perspective. Such is and should be the nature of science. Thank you. Thank you very much for again the insights and the memories and for working through the technical hitches of which we just seen the first few. We're now about to see the first change in the program and in order to get a little ahead of ourselves we'd like to go ahead now and roll on the second vignette. These again are memories of people who worked with manipulating DNA and we'll then break after this for coffee which will be out in the hallway. So second vignette. We were unable to actually isolate a gene. What is responsible for turning these on or turning these off? But the cancer virus itself had RNA as a genetic material. Well, let's take a look at this. Is this really dangerous? Once you had sent fused the DNA molecules it was extremely laborious and difficult. When I finally was called upon to testify by this time it was one o'clock in the morning. And this was the starting of the biotechnology revolution. The underlying problem in biology after discovering the genetic code and the mechanism of protein synthesis were actually the question how do you actually turn a gene off and on? How do you control DNA? Jacob and Mano approached this problem with a combination of genetics and physiology in a little bit of biochemistry but mostly genetics and physiology to try to understand a very simple system in bacteria. The system that bacteria have for using the sugar lactose as a source of their carbon and energy. But back in the mid-1960s we didn't know what any of these genes did and we didn't know in the case of the oppressor whether the gene made an RNA molecule or a protein molecule or what. We wanted to get more into the details. Some of us doing genetics, people like Wally Gilbert doing biochemistry. How do you study a gene in biochemically? You need to have at least partially purified that gene. John Beckwith came to work with me first and he actually worked on genetic suppressors. He was a great performer in that area. And so what he decided at one stage was to say well I'll purify the lact gene. The basis of our being able to isolate the lact gene comes from the structure of DNA itself. And what we had done was to isolate two viruses. If we took apart the DNA strands, the two DNA strands of each of these viruses and slapped them together, the lact genes would form very nice pairs because they were identical to each other. They were complementary to each other and would form these tight bonds. So what you ended up with was a structure in which you had black genes forming a nice double helix but these two other strands of the virus sticking out and not forming the tight bonds. And we were actually able to visualize that on a microscope and first of all there were beautiful pictures because they were exactly what we had predicted which was very nice. And this is the first time a gene had been taken out of a chromosome and purified in a test tube. We called a press conference on the day that the paper was published and announced our discovery. But at the same time at this press conference we also talked about some of our ethical concerns about genetic manipulation and so the media reports which were worldwide on our finding were a mix of reporting genetic advances and suggesting how close we might be coming to curing genetic diseases, etc. which was a bit of a stretch but I mean it's down the road it's related to that certainly. John Beck with electrifying feet demonstrated that you could obtain a single gene from a bacterium. However, his approach was not useful for obtaining genes from higher organisms including human beings. But then in 1970 Howard Tenman and David Baltimore came along and independently discovered an enzyme called reverse transcriptase that enabled a genetic information in an RNA tumor virus to be integrated into the DNA of a cell. The dad did the first experiment and I got this little lip my wife actually was working in the lab there so I turned to her and showed it to her and she said no, she said that's just background noise and I said well you know maybe it is, maybe it isn't, we'll have a look and concentrated the virus and then that was the whole story because now we were tenfold over the background that it was real and I just more or less told everybody in sight about it because it was really very exciting. However, this was just a day before Nixon bombed Cambodia and sent me and everybody else out into the streets so I went on strike for a week after having discovered the reverse transcriptase as actually the first thing that I did and the experiments weren't complete yet I froze away the experiment in progress came back after a week and then finished that experiment and a few others to nail it. So once Baltimore had reverse transcriptase he realized that you could use it to transform the messenger RNA back into say the DNA of any gene including a mammalian gene. So we had captured genes mammalian genes for absolutely the first time in history and this was in some ways the starting of the biotechnology revolution because now companies could do that insulin, whatever you wish and the reverse transcription to this day is the way to capture a messenger RNA. So once you could isolate genes you wanted to manipulate them play with them, try to figure out what they do to the scientists who had this ambition were Herbert Boyer and Stanley Cohen they were at a scientific conference in Waikiki, Hawaii in November 1972 and they found themselves together munching corned beef sandwiches in a deli late one night they got to talking about this problem and they decided they might have a way to accomplish this purpose it's involved using things called restriction enzymes that cut DNA at specific place so once they could cut it then they could insert the DNA of interest in a plasmid and put the plasmid into a bacterium and watch what the DNA does. The experiment went as follows that Stanley would send us the DNA we'd cut it up, put it back together again we'd send that DNA to Stanley he would transform it into the bacterium he would send those bacteria back to us we'd take out the plasmid DNA cut it up and look at it in the gel that was sort of the division of labor and so Bob Helling and I when we did the first analysis of what we thought was a recombinant DNA molecule we stained the gel and it was late afternoon and went to the dark room and turned on the black light and there was this picture and it was very evident that we had re-assorted the fragments of DNA successfully recombined them and we had recombinant DNA molecules In the summer of 1973 her boyer gave a talk at a Gordon conference and it was clear that you could take DNA from anywhere from a human cell, from a banana from a fish, you name it and put it into a little bacterial plasmid and grow the bacteria up and you would have enough molecules to be able to do chemistry on these things but as soon as that was apparent the same day that he gave a talk several of the younger people there immediately raised the question as to whether you could in that way make things that might be dangerous The questions led to a moratorium unprecedented in the history of science it also led to a meeting an international meeting to which scientists came from all over the world in Asilomar California in 1975 to discuss this problem the people who were there were frankly conflicted on the one hand they wanted to avoid the moral taint that had befallen the scientists who participated in the atomic bomb project but on the other hand they wanted to get on with recombinant research their exercise and social responsibility led them to include the press and there was considerable open debate I remember feeling as though I never knew what was going to happen I didn't know what the right way to proceed was and here I was a member of the organizing committee it was just a tremendous amount of science to get your arms around it was huge public issues for which we had no background or training we talked all this through and came out saying in this case we don't know enough we can't be absolutely certain that this is also safe and that was a wrenching thing for people to say because what we were really saying to ourselves was we're not going to do certain kinds of experiments until we've gone through a whole process of convincing ourselves that we don't have a we're not letting some genie out of the box here at the end of Asilomar the great significance was that we felt that experiments could go forward and could go forward with some sense of assurance that we weren't going to be doing really nasty stuff if we hadn't invited the press to Asilomar then the stories would have been less good than they were the information would have been less precise than it was we would have been in worse shape than we were well the National Institutes of Health which was the principal federal agency that would sponsor recombinant research took the Asilomar guidelines into account when it began drawing up its own rules for the conduct of recombinant research the NIH issued its rules in June 1976 and they were designed to protect public health and safety however a number of people scientists laymen and government officials around the United States were suspicious of these guidelines their main reason for suspicion was that the guidelines had been drawn up by the same people who were going to be doing the research this apprehension expressed itself quite forcefully in Cambridge, Massachusetts the city council held hearings on recombinant DNA and they then set up a committee to evaluate the potential dangers of this research and the committee was made up of citizens of across the spectrum of the city and included scientists and experts but also ordinary citizens you know working housewives etc. at the time of the great Cambridge crisis of post-asilomar and the mayor of Cambridge egged on by some of my friends was trying to you know ban recombinant DNA in Cambridge I must say that I've been surprised recently to find myself being put among those who are not concerned I have been concerned and I continue to be concerned I feel when I finally was called upon to testify by this time it was one o'clock in the morning and it was a circus you know there was music, there were flags there were demonstrations and there were several different Nobel Prize winners arguing different sides of the issue to the Cambridge city council there were only two libraries of clones in Cambridge one was our yeast library and the other one was Tom Maniatis's you know CDNA, what we would now call CDNA libraries and we were both basically chased out of town for a while in the end MIT defended me and that was great but Tom had to move to Cold Spring Harbor to continue his work for a while there is an inevitability to the fact that if a community of scientists say there might be danger here that the general public is going to hear there is danger here and we saw it all in the chambers of the city hall in Cambridge, in Congress and in the press we saw every bit of it and you have to go through it I don't see any way we could have avoided that secrecy is not the answer to getting public acceptance you just got to go through a messy process two seats alright the next speaker is Phillip Sharp a professor at Massachusetts Institute of Technology in the late 1970s Dr. Sharp conducted a series of experiments where he demonstrated that there was not an absolute and direct relationship from DNA to RNA to protein but rather particularly in eukaryotes there was a process of RNA splicing before a translation this concept of discontinuous genes resulted in him receiving a Nobel Prize in 1993 with Rich Roberts his talk today will be entitled Decoding the Information in DNA thank you it's a great honor to be asked to participate in this program celebrating the 50th anniversary of humanity understanding the nature of its genetic code and the celebration of the completion of the sequence of that genetic code which humanity now can turn to interpreting as to what it means in a biological sense the next 50 or 100 years is going to deal with the issue as you will hear later in this symposium with how we decide to use that information and of what benefit or otherwise it comes up back in the 71 I had just completed a postdoctoral stay with Norman Davison at Caltech where with Ron Davis he introduced the electron microscopy methods for visualizing DNA and the deletions and pieces of DNA that you could obtain from the study of first lambda phage and subsequently plasmids and we have seen today how important that technology was in bridging the time between the study of DNA as a physical entity and the ability to clone DNA as Stan Cohen represented with recombinant DNA methods I left Caltech and went to Cold Spring Harbor first as a postdoc and then as a staff member in theory working under the directorship of Jim Watson but as you know Jim was head of the lab and at that time half-time at Harvard and half-time at Cold Spring Harbor and I began to study DNA animal viruses my interest in DNA animal viruses were multiple first I understood that it was possible to introduce the infecting DNA into mammalian cells and understand the expression of the genetic code as expressed in mRNA in the cell and proteins subsequently and there was a number of pioneers in the field who had brought molecular biology to the study of DNA viruses my particular interest at Cold Spring Harbor was to introduce or work with adenovirus a larger cold virus infects man primarily with the symptoms of cold and this viral genome was sufficiently large that one could begin to see the unraveling of a genetic program of expression of genes before DNA replication and then subsequent to DNA replication it was a long enough genome that one could hope to address some of the very interesting problems in the mammalian cell molecular biology related to the structure and expression of genes and there with old Peterson and others we began to define the restriction maps and isolate fragments of the adenovirus and define the virus as a transcriptional program during its course of infection I was particularly interested in that time in some phenomenology that had been described in mammalian cells that was related to the expression of very large nuclear RNA called heterogeneous nuclear RNA that was found expressed in mammalian cells but was not found expressed in either bacteria or at that time thought not to be present in organisms such as yeast this large nuclear RNA looked like the chromosomal DNA but only was about ten times larger than the message that people had characterized that existed in the cytoplasm of the cell that also fed into another issue that was becoming obvious at that time why was there such large amounts of DNA in the nucleus of higher organisms particularly vertebrates and plants it was clear that there probably was not a requirement for that amount of number of genes and the question was what was the purpose of all this DNA sequence if it didn't code for genes subsequently at MIT in 76-77 working with Sue Bergett and a number of others we carried out a study of the electron microscopy a comparison of the structure of a mature mRNA that was expressed in adenovirus infected cells and compared the structure of that mRNA to the genomic DNA of the virus and what became obvious when we looked at the pattern of the RNA-DNA hybrid shown here to the right and that's that the RNA and DNA formed a duplex as visualized here by electron microscopy and then at the end of the duplex there were three loops that were formed reproducibly of size and those three loops as I understood from doing electron microscopy at Caltech with Norman Davison and others was due to the joining of DNA sequences or sequences complementary to the DNA that were non-contiguous on the genome and what's illustrated here to the right is the interpretation of this electron micrograph with the RNA-DNA hybrid we knew the polarity of the strands so this was the three prime end this was the five prime end and we observed these three loops of single strand DNA by the joining or RNA splicing of three segments of RNA together in the synthesis of this mature message and what you see here on the bottom of the slide is the adenovirus genome the messenger RNA body which I illustrated in the micrograph encoding the hexon mRNA and the interpretation of these three loops as being created by the joining of sequences from A prime, B prime and C prime over here to the end of the message and these intervening sequences here forming the three loops of single strand DNA sequences visualized in that electron microscopy at the same time this work was going on at MIT similar work was going on at Cold Spring Harbor under the direction of Richard Roberts and Louise Chow and others and came to very similar conclusions about the steps involved in the expression of a messenger RNA in mammalian cells and that's illustrated here and the observation of that process that created those three loops which we coined the term RNA splicing for explained then the presence of the heterogeneous nuclear RNA that was found in the nucleus of mammalian cells and explained the shorter mRNA levels sizes vis-a-vis those of the nuclear RNA and that illustrates here that the processes of expressing genes in mammalian cells pass through a process of RNA splicing with initiation of transcription and then the capping of those RNAs shortly after initiation the capping process being discovered by Aaron Shackin and Bernie Moss the splicing of the introns removal of the introns joining the exons exons being a term Wally Gilbert introduced and then the splicing to produce a mature message after polydenylation at the three prime end of the RNA the polydenylation process being described by Mary Edmonds and Jim Darnell and this explained the large heterogeneous nuclear RNA process which Jim Darnell and others had characterized over the previous four or five years and explained that they were precursors to mature messages that appeared in the cytoplasm of the cell after this process of RNA splicing over the years that has intervened from that time we have obtained from a number of laboratories the first I believe was actually Wally Gilbert and Susuma Tamagawa sequencing an intron in the imoglobulin genes and also from Shambone and Ovalbumin and others a sequence of the introns of many genes in the Mayan cells and we find the process in the following way the exons are adjacent to a conserved type consensus sequence at the five prime splice site three prime splice site and branch site and Joan Steitz and others made the proposal that small nuclear RNAs were going to be involved in this process and the small nuclear RNAs that are involved in the splicing of this major form of intron is U1, U2, U4, U5 and U6 and accounts for the splicing of this in the formation of a splice zone it was some 15 years later that it was recognized there was actually two types of introns in the Mayan cells this major type that accounts for about 9999 out of 1,000 and a minor type which has a different set of small nuclear RNA particles forming a splice zone much in parallel with the major family and that there were actually two types of introns that had been present in the nucleus of higher cells since plants and mammals diverged after the development of an in vitro splicing system where one could begin to study the biochemistry of this process in my lab and in the lab at Tom Maniatis and Michael Green it was recognized that the splicing of the introns were in two steps the formation of a laryate intermediate in the first step in a splicing complex called the splice zone and then the subsequent step excising the intron as a laryate producing the mature message which was transmitted to the cytoplasm of the cell work preceding this from Tom Checks lab had shown that the intervening sequences within a ribosomal RNA termed a group 1 self-splicing intron was capable of using guanosine as a cofactor to carry out splicing in two steps very similar to the two steps observed in the nuclear process and then shortly thereafter it was shown that in another type of self-splicing intron group 2 there was actually the formation of a laryate intermediate in the splicing process further emphasizing the relationship between this nuclear process and the splice zone and this self-splicing process giving rise to the excision of a laryate intermediate and in recent times particularly from the work of Tom Madyatis lab I mean Jim Manley's lab it's becoming increasingly clear or likely that this process RNA catalyzed as this process is as a self-splicing RNA only reaction the small nuclear RNAs that are present in the splice zone are executing this excision of an intron and another process that harkens back to the RNA world where RNA was major components in the activities of biological systems we subsequently defined that a splice zone as I mentioned before was essential and that there was a splice zone cycle where small nuclear particles u1 and u2 recognize the 5 prime splice site and branch site and then form the complex splice zone process which then led to the formation of a laryate intermediate the splice zone dissociated the introns are degraded the small nuclear particles are reassembled and the exon is subsequently transported to the cytoplasm and as I will mention or illustrate in a moment actually the splicing process labels the exonic RNA or the CDMA RNA for transport to the cytoplasm so in the film that you saw proceeding this where Dave Baltimore described reverse transcriptase as an essential component in the creation of genes for expression in bacteria it was bacterial systems do not contain introns and the only way before extensive synthesis of DNA one could obtain the genetic material for protein was to take the mature spliced RNA and copy it by reverse transcriptase into DNA and insert that into bacterial plasmids for expression of pharmaceutical active proteins now introns have been shown to be very important in gene expression particularly in RNA splicing it was shown by a number of laboratories that removal of introns reduced the level of mature messenger RNA a process still being investigated and as I will illustrate in the last slide probably is accountable by coupling between transcription and RNA splicing but I want to point out here that as we have used the human genome sequence draft over the last few years it has become increasingly clear that about half of all mammalian genes are expressed by alternative splicing and in comparison of the human genome sequence to the mRNA sequences that are expressed it's become clear as you will learn in greater detail that there's only about 35,000 genes in the mammalian genome as we now understand genes that would account for only about 1% of the total genetic material being expressed as mature mRNAs to encode functional proteins carrying out the activities in a complex cell however I want to disabuse you of the thought if you ever had it that once you have the human genome sequence we understand the biology and genetics of complex organisms as I just pointed out in the previous slide there are 35,000 possible genes considering any one transcription unit as a gene and I want to point, emphasize that the complexity of alternative splicing of a single gene could generate a larger number of different proteins than the total number of genes in a genome meaning that once we know the presence of genes it is not clear that we understand the functional unit that is being expressed in any particular cell to carry out any specific function let me illustrate this with a Drosophila gene from Larry Zabersky's lab it's Daschum, it's involved in the guidance of neuronal axons it's expressed in about 100,000 cells in Drosophila directing the connectivity of axons from those cells and that one gene can be expressed possibly in 38,000 different ways so here's the structure of the Daschum gene as shown up here and illustrated for genomic DNA and each of these bars represent an axon and if you scan along Daschum here you see that axon 4 can be expressed in 12 different ways axon 6, 48 different axons can be selected here in 33 from axon 9 and 2 from axon 17 so in the expression of this DNA as an mRNA to make a protein and note all this variation is within the protein you picked one from axon 4, one from axon 6 and one for axon 9 and one of the two for axon 17 and you multiply 12 by 48 by 33 times 2 there are 38,000 different possible proteins expressed from this gene not that all of these proteins have been shown to be expressed in Drosophila but many have from the work of Larry Zeberski and it is highly likely that there are genes and are genome that are expressed in as complicated a splicing pattern so given that we have the DNA sequence we are a long way from being able to understand how these sequences are expressed as protein and it's very, very important that we get the genome of many organism sequence so we can compare those genomes and identify the highly important functional sequences within the genome as compared to the non-functional sequences whereas the problem of selection or interpretation sorry, interpretation of the DNA sequence is that and our problem right now is that we do not understand how to look across the human genome sequence and identify this 1% of the sequences that code for axons as contrasted to the 99% that code for introns if you just take these consensus sequences which I've illustrated in the earlier slide if you scan this sequence you find 100 different pseudo-consensus sequences for splicing within introns that are not utilized whereas only eight are utilized for making a functional message so given that we have the sequence, interpretation of the sequence with the current bioinformatics techniques looking singly at one genome identifies only about 75% of the axons correctly now we have learned a lot about the splicing process beyond that of the fundamental splice's own and splicing system and we've begun to realize that the splicing process is an integral part in a continuum between the DNA and the mature message of expression of genetic material within higher cells and I want to illustrate in the last two points that in the following way we have in the last several years and this is the number of other labs Lynn Mackwatt and Melissa Moore and Robin Reed that the splicing of introns is coupled to both the transport of the message as well as surveillance of the message to see if there are mutations within the gene that cause nonsense mediated decay or nonsense mutations which terminate protein synthesis and the cell has a way now due to splicing to recognize those nonsense codons within the middle of the gene and cause the decay of the mRNA so that the fragment of protein produced from that messenger RNA will not interfere with the physiology of the cell this comes from the recognition that in the splice's own process as it executes the splicing of the intron proteins are deposited due to that physical process upstream of the intron junction illustrated here as the red and white giving rise to a deposit of a complex which then follows the message to the cytoplasm in the cytoplasm those complexes appear on normal messages upstream of the last intron and they are displaced due to translation and the mRNA is stable whereas in RNAs that have termination codons the termination codon is encountered by translation for the last intron and the last complex and the presence of that complex downstream of this termination codon then leads to the decapping and the degradation of the mRNA this integrates splicing into transport and surveillance of the RNA for a open reading frame and then the last step that I'll illustrate the gene expression is coupled from initiation of transcription through the elongation and splicing through polydinylation this is one integrated process from initiation of transcription to the production of a message so here is illustrated from a figure from Tom Manyadis and Robert Reid that RNA polymerase initiates transcription and it has a platform called the carboxyterminal domain on which both capping factors splicing factors and polydinylation factors are associated and then as the polymerase moves along or the DNA moves through the polymerase more precisely these factors bound to the polymerase carry out capping, promote splicing execute polydinylation to produce the mature message and from the point of initiation of transcription through the synthesis of the message is a fully integrated process which we are only now beginning to understand so I want to again thank you for being included in this wonderful symposium and I wish us all as a biological community great success and exciting discoveries over the next decades due to the human genome sequence Thank you To make a gene that bounds that radioactive molecule more tightly a natural model organism and it worked great You said your mother has a fatal disease It's a genetic disease By linkage and collecting families you could in fact find disease genes Her boy are very quickly recognized commercial possibilities in recombinant DNA Amid all the turmoil over the technique he tried to enter several backers in this possibility At first he failed but then suddenly success came out of the blue and a new industry was born Fall of 76, I get a phone call from a young fellow by the name of Bob Sponson and he says I've been reading about this recombinant DNA technology and he gets ready to be commercialized and he said yes, I think so and he says well I think I can get some money and I want to start a company he says you think we could get together and talk about this and so I said okay and so he comes to my lab on a Friday afternoon and this fellow walks in not much hair very young, he's 29 years old got a suit and a tie he really looks good and everybody's people in my lab are looking at him who is this guy so he comes in to my office and we sit down and we talked about this and we sketched out the idea of putting a company together Sponson and Boyer's conversation quickly led in April 1976 to the formation of a new company they called it Genentech for genetic engineering technology and then in September 1978 at a press conference they announced that Genentech using recombinant techniques had successfully and for the first time produced human insulin the founding of Genentech was key because it brought together the strengths of the academic community with the strengths of the financial community to really move this technology forward make a difference in human medicine and it did I later worked on the set of genes with other genes involved in that with other secret lactose to try to find the control product the repressor for those genes the question is what is the repressor what is it chemically I mean you could say it's a protein that recognizes DNA could it be in another piece of DNA or even an RNA was considered we isolated that operator region as a DNA fragment taking all the DNA in the cell putting the repressor molecule on it destroying all the rest of the DNA letting the repressor molecule protect a small piece of DNA just under it and then isolating that and looking at it to look at it we would like to we could do a chemical experiment we could say here's a piece of DNA it's about 20 bases long little piece of double-stranded DNA what does it look like that's the problem of DNA sequencing immediately this is back in the early 70's it's impossible actually at that time to sequence the DNA directly we tried Alan Maxim and I then spent two years working out the sequence of the 20 bases of DNA there was a lot of technological development the development of electrophoretic methods for separating DNA of course out of that came the sequencing the ability to do sequencing I a year or so later began to do a set of experiments that tried to ask in detail how did the protein touch the operator those experiments actually led to a DNA sequencing method because the discovery we made was that we could identify we could decipher not just which bases the protein touched by these chemical modifications but we could in fact the pattern of breakage the DNA by these modifications was so sharp we think that we could recognize which bases were which in the sequence and that led to our discovery of a DNA sequencing method Alan Maxim and I developed a chemical sequencing method which was extremely rapid at the same time Fred Sanger and Coulson discovered a enzymatic method that also was extremely rapid both methods have in fact the same intrinsic speed they depend ultimately on the same conceptual trick and the conceptual trick is to identify the position of a base in DNA by its distance from some reference point now you can find genes in the fruit fly using breeding experiments but you can't use breeding experiments with people but then at a genetics meeting in Alta, Utah David Botstein was inspired to propose a method by which you actually might be able to map the location of genes in the human genome what happened was this fella Kravitz was talking about linkage of hemochromatosis to HLA alright and they had a theory and they did the usual human genetics thing and in those days linkage in humans was almost impossible to do and one of the few markers maybe a dozen that was useful was the HLA because it's so polymorphic and they made a theory that hemochromatosis is recessive and linked to HLA and the molecular biologists were having none of this because there was all this mathematics and statistics I actually understood what they were trying to do and was trying to explain to people that HLA was just a marker and I remember saying something like if you had polymorphisms all over the genome you could map anything and I suddenly realized that well we do have polymorphisms all over the genome with the recombinant DNA we could see them with southern blots it was in the room and he was he heard that and he realized also right away that I had said something that actually was feasible the polymorphisms that David Bautstein had in mind are restriction fragment length polymorphisms called riflips for short these are just fragments of human DNA that vary in length from person to person now if a disease gene occurs near a riflip in a person's genome then the riflip can be used as a telltale sign for the presence of that gene we knew that the entire human genetics field was missing this capacity that if that capacity were there we could find all those disease genes cystic fibrosis and Huntington's disease and all those things and we had a workshop with David Bautstein Ray White a large number of other people from the Utah group and there was a lot of conversation about could you really find these little variations to look for a human gene and in the hall at MIT there's David Hausmann was the one into the hall and David Bautstein was the other end of the hall and everybody was discussing riflips everybody was discussing restriction fragment like polymorphisms so they asked us what we thought and we said well you know your odds are really lousy of finding it without the map with the map we can find anything and that turned out to be true without the map it depends on dumb luck David Hausmann said look you know why don't we just look for a disease gene and maybe we really will get lucky well dumb luck is what they had and the eighth marker I think it was that they looked at is in fact linked to Huntington's disease gene and they said we got it we got it we found it you know it's for you found it we got near the marker this is it we got the marker we found the gene so forth you know I think it really revolutionized the way people thought about what you could do because up to that point really even with the paper being published in 1980 by 1983 nobody thought you could do it I'd like to introduce the next speaker Horace Freeland Judson Mr. Judson is a professor of history at George Washington University he's the author of 8th day of creation a wonderful definitive text of pre 1970 molecular biology his talk today is titled ideological consequences of the human genome project the air conditioning is blowing away my text it's difficult in this burst of six months of celebration of the discovery of the structure of DNA and the examination of some of its consequences for someone like me who's not a scientist to find ways to say things that most of you don't already know and indeed take for granted the title of my talk I had the good fortune to be lazy enough to turn in my title late so I knew what some of the others were and I saw that there would have to be some way to step outside that kind of thing and to take introduce some other considerations these are fairly simple remarks I hope they'll make some sense so we're marking the 50th anniversary of the discovery of the structure of DNA now is this a convenient round number or does it have any significance beyond ending in zero last February Nicholas Wade of the New York Times and he's at this meeting and maybe here in the hall telephone me to ask just that question the 40th anniversary was indeed not much more than a round number it was observed because the founders of molecular biology weren't sure that they would live to celebrate the 50th I remember both Francis and Jim saying that at the time as it turned out though the 50th coincides with an achievement that does mark a period the closing of one era in molecular biology and the story of DNA and the transition to a new era this is of course the sequencing nearly complete of the human genome historians have traditionally divided their subjects into periods or ages in 1776 Edward Gibbon began the decline and fall of the Roman Empire with the age of the Antonines in 1945 Arthur Schlesinger Jr. launched a brilliant career as an historian of the United States with the age of Jackson yet many find this practice suspect they fear that such demarcations oversimplify the streamy richness of events and make gradual complex intellectual evolutions seem more abrupt and clear cut than in fact they were as the example suggests an age is often appended to some particular personages which also seems likely to skew the emphasis and destroy the account sorry, destroy the account yet the history of molecular biology falls pretty neatly into four ages and indeed personages are associated with them one way or another before them though came a prehistory conveniently marked off by the first coining of the term molecular biology by Warren Weaver of the Rockefeller foundation in 1938 then came the first door stone age during which certain understandings became clear certain essential techniques were developed that age culminated dramatically in the elucidation of the structure of oxorabin nucleic acid the double helix of DNA the stuff of which genes are made early in 1953 the golden age of molecular biology began with that discovery and encompassed the solving of the genetic code to regulate the processes of the cell as early as 1968 leaders of the field were telling me that they had put in place an overarching explanation for those processes for single celled organisms anyway bacteria and bacterial viruses they were asking what new fields were ripe to go molecular the imperialistic drive was already strong at that time molecular biology began to change and radically the silver age of the late 1960s to the turn of the century has been characterized by four developments the field has grown explosively large areas have become commercialized exploiting the science for profit the silver age indeed and industrialized the science scaled up to a remarkable extent even the most basic science has been driven by new technology most importantly beginning about 1970 molecular biology has divided and subdivided into a number of different fields Matt Messelsen once in a conversation much like some of the ones that you've seen on these films said you know up to about 1970 molecular biology was a clear rushing mountain stream and then it hit the sands and spread out into many different things at the opening of the new century our century now coinciding happily with the 50th anniversary the sequencing of the human genome was roughly completed and with it the fourth age in the history of molecular biology is dawn it's dawning will it be iron, will it be diamond likely both and although we can glimpse possibilities not for another decade at least will we know so now DNA is a term on everyone's lips in every day's newspapers and broadcast news accounts recently I saw a new line of modernistic furniture called DNA we are told relentlessly of the wonders that are beginning to flow from the sequencing and manipulation of DNA from our detailed understanding of the structural biochemistry of molecular genetics and from our growing skill at altering genes the technologies of the gene can identify criminals or exonerate the wrongly accused they can alter food crops animal or vegetable we are assured radically increasing production perhaps they can change organisms in stranger ways so that cows or bacteria for example will produce drugs they can improve the diagnosis of various diseases we are told that perhaps they will begin to offer cures for certain diseases from the first years in the development of the technologies of the gene in the early 1970s the hope arose that genetic disorders those where a single gene is missing or defective could be cured by supplying patient cells with the correct gene after more than a quarter century of fruitless efforts and the spending of more than 100 million dollars gene therapy is becoming possible although at absurd expense and with risks not yet evaluated the next hope is insight perhaps therapies for diseases where multiple genes are involved like cancers at the least we are told the technologies there are now great improvements in preventive medicine where when readings of appropriate stretches of an individual's genome will detect genetic components of susceptibilities to certain ailments then there is talk still sort of of making improvements in human heredity hereditary characters by adding new genes to the germline so that they are transmitted to the offspring one that intrigues me but I have no basis for saying that it's at all remotely possible chimpanzees have a metabolic pathway to make their own vitamin C we do not such developments are indeed the chief subject of this conference I'm not here to advertise them the consequences of the discovery of the lucidation of the structure sorry that's redundant the lucidation of the structure of DNA of the half century of burgeoning of molecular biology of the human genome project the consequences reach far deeper to see what some of them are we must turn back the clock not to 1953 but to 1957 in September of that year in London at the yearly symposium of the Society for Experimental Biology Francis Crick gave a talk the overall subject of the symposium that year was the biological replication of macromolecules polysaccharides lipids nucleic acids and proteins the symposium lasted two days and presented 16 papers among them the organizers allowed one about the biosynthesis of oligo and polysaccharides and the eminent chemistry delivered that paper began with the announcement there has been intense activity and dramatic progress in this field during the past decade or more every one of the 15 other papers had to do with nucleic acids or with proteins or with the relationship between them the fact had long been obvious that among the giant biological molecules though the starches and the fats were chemically repetitious and conceptually boring the nucleic acids and the proteins were the powerful, the unpredictable and the interesting subjects substances in 1957 Francis Crick was known and highly regarded among scientists for his ideas and his organizing flair as a theorist of biology but his reputation had spread wider than the particulars of the ideas themselves these were still confined for the most part to the people as yet scattered and not numerous it's hard for us in this room to realize how small the field still was with all of themselves and not as biochemists nor microbiologists nor geneticists but as molecular biologists today we all know that the Crick and Watson structure was a discovery of sovereign importance but in September of 1957 the essential experiments to test that structure had not yet been carried out or rather the most decisive of those experiments was just then being performed by Matthew Messelsen and Frank Stoll at Caltech and was months from publication the shuttles of corroboration were only beginning to weave the structure into the fabric of scientific thought Watson and Crick had of course gone on to other problems Watson had tried several things but chiefly to determine the structure of ribonucleic acid or RNA this endeavor had proved fruitless for the moment except in so far as negative answers disciplined the imagination Crick had taken up a more general problem which was how the genes are in fact and in mechanistic detail expressed in the building of the organism in the first place the expression of the genetic instructions meant the synthesis of many and various kinds of protein molecules Crick's endeavor had barely begun to make any progress either by September of 1957 nonetheless Crick gave his talk the bold title on protein synthesis he began it's an essential feature of my argument that in biology proteins are uniquely important when I read stories to my children I prefer to try to do the voices rather than saying quote and unquote they are not to be classed with prolisaccharides for example which by comparison play a very minor role their nearest rivals are the nucleic acids Watson said to me a few years ago the most significant thing about the nucleic acids is that we don't know what they do by contrast the most significant thing about the proteins is that they can do almost anything in animals proteins are used for structural purposes but this is not their main role the main function of proteins is to act as enzymes almost all chemical reactions are catalyzed by enzymes and all known enzymes are proteins by the way you'll note the extraordinary parsimonium care with Francis make statements of that so an all known enzymes are proteins it is a first sight paradoxical that it's probably easier for an organism to produce a new protein than to produce a new small molecule since to produce a new small molecule one or two new proteins will be required in any case to catalyze the reactions I shall also argue that the main function of the genetic material is to control not necessarily directly the synthesis of proteins there is a little direct evidence to support this but to my mind the psychological drive behind this hypothesis is at the moment independent of such evidence once the central and unique role of proteins is admitted there seems to be little point in genes doing anything else Crick then went on with the same energy and assurance to organize the state of knowledge of molecular biology as it then stood by means of a series of observations about other people's experiments on the relationship of genes to the organism and by two general principles he called the principles by the curious names of the sequence hypothesis and the central dogma Crick's paper of September 1957 and particularly of those two principles proclaimed the new way of thinking about the processes of the cell that characterizes contemporary biology the paper stated for the first time the simple formal scheme upon which that way of thinking is founded the fundamental logic of molecular and cell biology has much more to it of course than the sequence hypothesis and the central dogma including essential elements that were hardly envisaged in September of 57 another 10 years were to be required before even a schematic outline of the problem of gene expression including mechanisms of transcription and translation of the genetic message the details of the genetic code and the patterns of interaction by which expression of the genes is switched and regulated was firmly in place and that only for simple well one-celled organisms nonetheless the rest of that logic turned out to conform to two Crick's two principles so Crick's paper on protein synthesis was more than a summary of research and problems it was a manifesto Crick regarded it as such 18 years after the paper he told me at the time you realize not so many people were thinking along these lines I mean much of the audience when the lecture was given I ran over time and didn't get it over very well and the ideas were pretty new they weren't new though in the small group where people had been discussing it so it was propaganda you realize the sequence hypothesis and the central dogma and I'll read them out in a minute set forth a manifesto that worked they proclaimed the research program that was rising then and that of course dominates today molecular biology is not summed up in the central dogma any more than the theory of relativity is fully stated by the equation E equals mc2 mc2 yet Einstein's equation like Einstein's equation the central dogma epitomizes a change that touches our most intimate sense of the way things are now one brief historical point work that had begun in Sweden and Belgium before the war and continued in France in the mid 1940s had suggested out of a morass of conflicting evidence that the second kind of nucleic acid RNA served as an intermediary in whatever biochemical relationship there was between DNA and protein the formula the DNA makes RNA makes protein was first put into print in January of 1947 by two Frenchmen and Strasbourg or rather by the nameless editor who wrote the English language abstract that appeared with the article the abstract is clearer than the paper sometime in 1952 James Watson wrote that out on a piece of paper and stuck it to the wall above his desk in the room of the cavendish that he shared with Francis that was the first statement of what has been called the central dogman if you could give me the first slide please in September of 1957 in that paper before the Society for Experimental Biology when Crick came to give what he called an outline sketch of my own ideas on protein synthesis here is what he said and we can go on to the next slide which is actually a blank I don't need anything more for a bit here's what he said my own thinking and that of many of my colleagues is based on two general principles which I shall call the sequence hypothesis in the central dogma the direct evidence from both of them is negligible but I have found them to be a great help in getting to grips with these very complex problems their speculative nature is emphasized by their names it's an instructive exercise to attempt to build a useful theory without using them one generally ends in the wilderness the sequence hypothesis in its simplest form it assumes that the specificity of a piece of nucleic acid is expressed solely about the sequence of its bases and that this sequence is a simple code for the amino acid sequence of a particular protein this hypothesis appears to be rather widely held its virtue is that it unites several remarkable pairs of generalizations the central biochemical importance of proteins and the dominating biological role of genes and in particular their nucleic acid the linearity of protein molecules and the genetic linearity within the functional gene the simplicity of the composition of protein molecules and the simplicity of the nucleic acids the central dogma this states that once information has passed into protein it cannot get out again in more detail the transfer of information from nucleic acid to nucleic acid and from nucleic acid to protein may be possible but transfer from protein to protein or from protein to nucleic acid is impossible information here means the precise determination of sequence either of the bases in the nucleic acid or of amino acid residues in the protein this is by no means universally held but many workers now think along these lines as far as I know it has not been explicitly stated before of those two the central dogma got the attention partly for a rhetorical reason the curious name that Kirk gave to it I once asked him about that oh that's a very very interesting thing it was because I think of my curious religious upbringing because Jacques has since told me that a dogma is something which the true believer cannot doubt but that wasn't what was in my mind my mind was that a dogma was an idea for which there was no reasonable evidence and most of us can recall Francis's armor piercing laugh in June of 1970 in a pair of papers back to back in nature David Baltimore and Howard Tenman published their discovery their independent discovery of an end time that transcribes RNA sequences into DNA by the way Francis I've liked those films enormously I think they're very good they're very few errors there's one significant omission Howard Tenman deserved a lot more recognition for reverse transcriptase than that film gave him the publications engendered excited comment in nature and elsewhere saying that the central dogma had been overthrown the unidirectional flow DNA to RNA to protein could be violated Kirk was provoked to a testy brief reply which nature published later that summer he had not stated the central dogma as DNA makes RNA makes protein he cited the source his formulation in the talk in September of 57 the central dogma this states that once information is passed into protein it cannot get out again information here means the precise determination of sequence he wrote dryly that colleagues had not sufficiently appreciated the precision and parsimony in his formulation and he published this slide this diagram next slide please as he said DNA can make DNA DNA can make RNA RNA makes protein is possible in the laboratory by certain traits I'm told he said the DNA can make protein and it's possible for RNA we knew makes RNA and it had then by then been confirmed that indeed RNA can make DNA since 1957 the evidence for the central dogma has grown to be overwhelming from time to time efforts have been made to show exceptions they have failed though the concepts of replication transcription and translation are sublimely simple the biochemical machinery at every type or step of information flow has turned out to be complex they can only have transfer RNAs and how many are needed there's no trace of a back translation machinery if there's one statement from the new science that deserves general currency it's this assertion of Crix most immediately and narrowly the central dogma defined the difference between the functions of nucleic acids and proteins most widely the central dogma is the restatement radical and absolute of the reason why characters acquired by an organism in its life but not from its genes cannot be inherited by its offspring the inheritance of acquired characters is intuitively so natural a notion it's still natural today for many the central dogma reaches the biochemical basis for the rejection of Lamarcan inheritance and all its forms and this includes the ideas of inheritance of acquired characters the troubled Darwin the ideas propounded fraudulently by Trofim Lysenko Stalin's agronomist and the misdivicious notions like the collective unconscious or inheritance of racial memories that delude some of the followers of Carl Jung the central dogma thus turns out to have profound ideological consequences it affects fundamental tenets of belief systems the technologies of the gene will drive at least two great political changes the first of these we will not be able to deal equitably with what the human genome project can do for the health and medical status of individuals without moving to universal single-payer medical coverage given the present and foreseeable political climate of course we may not make that move but no legislation that tries to prevent discrimination in private health insurance on the basis of DNA screening has any chance of being watertight more startling we recall what happened on January 10th and 11th what the governor of Illinois George Ryan did just as he was leaving office he pardoned four men who had been convicted of murder and condemned to die and commuted to prison terms every one of the other deaths penalty sentences in that state Ryan when he took office four years earlier had been a defender of the death penalty his was an act of conscience and courage but consider this, the hard evidence the new evidence of the death penalty killed innocent men this evidence comes from the cases where DNA identifications have exonerated prisoners on death row but the further fact is the DNA DNA evidence would only be relevant in a portion of the cases of prisoners on death row and we must extrapolate the incidence of wrongful convictions to those others the practical the political the ethical and moral arguments against the death penalty haven't changed nor have they changed public opinion much in the past half century now for the first time we have irrefutable scientific proof that the death penalty is wrong by real interest though is in another consequence of a radically different order reaching beyond the practical beyond the political to touch and reshape our understanding of who we are and how we came to be first another bit of history in the summer of 1945 Vannevar Bush electrical engineer formerly president of MIT director of the wartime office of scientific and research and development delivered to president Harry Truman a report titled science the endless frontier Vannevar Bush believed that basic research is technologies feedstock and was convinced that after the war the federal government would have to continue to organize and pay for scientific research from the training of new scientists he called for and predicted the astonishing high exponential growth of the enterprise of the sciences that we have all witnessed the uniquely compelling part of any of our Bush's proposal though was as comprehensive ideology basic research would surely produce practical payout but how long that might take he wrote and from what particular lines could not be predicted what a license for curiosity sorry I've lost the last bit biologists molecular cell and biochemical bio medical biologists have been living that ideology ever since the promise of medical advances has generated ever larger budgets and when biochemical biometical scientists are directed by the politicians to attack some particular problem say Nixon's war on cancer the cry goes up that the scientists choices of problems and basic research are in danger even while the scientists are quietly learning to write their applications for grants for basic research in terms of conform to the latest dictate all this is fine cancer is a blip on fundamental cellular processes the human immunodeficiency virus is a window into the intricacies of the immune system but consider the contrast with other sciences those that cannot sell themselves with any similar promises I remember the day in the early 1950s when a strange report was on the front page of the New York Times two American continent quantum physicists named Tsengdao Li and Chen Ning Yang had overthrown the parity of weak interactions they got the Nobel Prize for that in record time but on the front page of the Times consider the continuing arguments of the big bang the origins of the universe the news of quasars and black holes or of quarks the accounts of fragments of the comets flashing into Jupiter or of new evidence of the vast meteorites that we now think caused mass extinctions of the excitement generated by the finding of a new proto human fossil the lion city in Yucatan or pharaonic tomb in Egypt or the frozen body of a bronze age hunter shed by an alpine glacier every year every month such things are front page news yet what practical result can come from any of this why do we pay for the Hubble telescope and its successor now being built for rockets to Mars or for new archaeological digs since the dawn of human time and at every culture origin stories have been central we are all fascinated by origins of the solar system of life of species of humans of language of civilization with gathering momentum for more than a century the sciences are telling better more comprehensive more grounded more verifiable stories of origins molecular biology has one of the greatest origin stories as yet only partially told two modes of explanation run through all of biology we have questions of how and these are about physiology about how creatures eat reproduce run around and so about molecular genetics the last survivor of the great evolutionary biologist of the 1930s has called these proximate questions the other mode speaks to questions of why these are about adaptation or extinction the changes in species and the struggle for existence across the immense depths of geological time in other words they are about evolution they are ultimate questions now recall that the talk of the human genome project in just those words is a gross misnomer because it's not just the human genome but bacteria and roundworm and fruit fly and mouse and soon chimpanzee and more genomes over the entire range of living creatures so compare them comparative genomics as we've already heard this morning is beginning already beginning to give us in full detail the fusion of mental and Darwin of genetics and evolution of how and why of proximate and ultimate one quick example familiar to no doubt to most people in this audience but covered by the media too in the last two years molecular biologists in England and Germany have discovered what appears to be a gene essential to the human ability to use language the gene Fox P2 controls the action of other genes affecting several functions and nobody said this was going to be simple some members of a large family in London are afflicted with a single mutation in that sequence that severely limits their ability to use words to learn and employ normal syntax chimpanzees have that gene too we have acquired two point mutations in the gene and the statistical analysis strongly suggests that these mutations have been selected for strongly whether this is right or wrong it's the kind of place where some of these answers the kind of answer that's going to be found here's what my friends the biologists ought to be advertising here are the transcendent answers we will thrill to the answers questions of origins Darwin said it in the poignant last paragraph of the origin of species there is grandeur in this view of life here is the triumph of the scientific world view thank you biology was going to drive the kinds of technologies that we would develop the ultimate of this would be the human gene for this be done this is a major science project it will have lots of implications to find out how to cure most of the tremendous maladies of human kind mapping and sequencing the human genome would require a huge project now the department of energy was accustomed to such projects and it was ready to sign on right away but biologists in the United States who tended to work in a kind of cottage industry were highly resistant to anything that smacked a big organization major goals and high technology so they opposed the project nevertheless Robert Sinsheimer who was then the chancellor at the University of California Santa Cruz thought it was worthwhile at least to consider this I went to one of the first meetings that was ever held on the human genome project in Santa Cruz in the spring of 1985 and Bob Sinsheimer had called this meeting because he was thinking about building a center to sequence the human genome we invited Lee Hood we invited Wally Gilbert Dave Botstein John Sulston essentially all of the leaders to consider what could be done initially I have to say there was a great deal of skepticism when I received the invitation I thought this was totally silly what an absurd idea we sequenced some million bases we were talking about 3 billion bases once you started to think about a physical map that that might be feasible the idea of a sequence stopped being so crazy it was one of these things which was on the table forever but who knows maybe there would be better technology we talked about where the technology really was at that time and I realized that in fact the underlying technology Fred Sanger's techniques and our techniques were capable of sequencing about 100,000 bases a year per person if intensely applied and when I left the meeting I became convinced across that time that it was technically feasible to sequence the human genome and what I saw was basically that the problem was purely industrial scale problem most people in the scientific community said that's just machine building that's not real science and I don't think that there were many people who had any intimation at all of the kind of revolution that for instance the DNA sequencer just by itself never mind all the other instruments that have been developed that that would have I remember one afternoon when three or four of us were sitting around and kind of the whole vision of how you should do this just came out I mean one person said well rather than four different lanes with radioactivity let's use four different fluorescent groups and we can mark each of the four different bases with those fluorescent groups and another said why don't we use capillaries to separate the DNA fragments with each fragment being marked by the identity of its terminal base and why don't we have a laser system at the end that can read out the colors as they might great by in that one afternoon we kind of laid out a lot of the different ideas and what was really marvelous about the DNA sequencer is it required the integration of chemistry, of biology of engineering and of computer science the biological community was split about it I think and there were those who were really enthused and saw that this could move biology into a new era and there were those who didn't want to go into a new era like biology the way it was and we're very much opposed to the idea of big science entering biology the biology community in the United States could not expect to determine whether or not the nation proceeded with a human genome project the reason was that the Department of Energy had gained the support for the project of a very powerful U.S. senator Pete Domenici of New Mexico I have two giant great national laboratories in New Mexico so even in 83, 84, 85 time frame I was used to having presentations made on big science projects one of the scientists from Los Alamos Dr. Charles DeLisi he came to my office as a DOE person telling me about a great opportunity it didn't take me very long to just get committed to the idea of hook, lion and sinker and lo and behold the NIH went to Lawton Child who was a chairman of an appropriation subcommittee that funded the National Institutes of Health and instead of arguing you will note that from the beginning it was a funding program for DOE and NIH so the great compromise was don't try to take it all DOE let's go at it as partners we got together and without authorizing legislation kind of a great feat it was appropriated the critics charged that this would amount to a $3 billion crash program that would do nothing more than produce tedious, rootinized and intellectually unrewarding work and that would perhaps more important deprive other parts of biology of necessary funding what really did turn the tide was there was a national academy committee set up to interrogate the pros and cons of the human genome project added by Bruce Alberts he was completely uncommitted on the question of whether sequencing was a good idea or not so he was truly neutral so he was credible to all the parties so he was the perfect choice once you realize that the project was a technologically feasible and that the ultimate benefits would be very great then the question becomes a purely practical one do we want to spend this amount of money what is it really going to cost what's the most often way of doing that most of the objections were dealt with by in the American way by a sensible compromise it's one of the few examples that I know of where science policy was made by a bunch of middle-aged folks in Washington who actually came out with the right answer it was good in 1988 NIH created an office of genome research and announced that James Watson who was Mr. DNA would be the head of the venture this decision amounted de facto to a commitment that NIH would take the lead on the biological side of the genome project if one looks at I think the reason why I'm so excited by really finding out what our DNA messages are I think it's really two reasons one just a very practical I think we find out these messages we will understand at a very deep level some diseases now which effectively totally baffle us and sometimes almost destroys the other reason is as a scientist DNA is really the message of life what Jim Watson did is he brought to the human genome project of validity and a stature and an imprint that perhaps no one else could have done never met anybody quite like him I would say we'd attribute to him and Lawton Childs broadening the base beyond the Department of Energy as we get this these genetic messages there will be ethical considerations that will come up and who should know about our genetic imperfections should they be open knowledge looking at the ethical, legal, social, psychological economic implications of the human genome was pretty extraordinary because that was quite out of character with anything that the NIH had ever done I think from the very start NIH should initiate discussions so that as this knowledge is accumulated the public is prepared to deal with it before the genome project there had been several efforts mounted by the federal government to focus on ethical issues and medicine in particular and this one was really aimed at basic science and I thought it was about time there were a lot of critics and the critics said no, no, no, you can't do this you're just going to derail the science and it's going to be like a cinema and there's going to be a moratorium and all these people are going to be nacing and talking about problems it's going to be just hopeless it's going to be a morass and there are a lot of people who are lobbying him from the very beginning there were some very good senators who began to get worried about the potential for using this in propitiously things we wouldn't want to do or they should be prohibited Senator Mark Hatfield got involved my recollection is that there was a set aside that a portion of this money would be used for ethics there were issues of privacy keeping information private from third parties so it shouldn't come as a surprise that working with the science as we knew it then and we're all just getting started the first five year plan of LC actually focused on some of those issues and lurking in the background was always the question of whether genetics and medicine was going to be different than other parts of medicine to date the LC program hasn't interfered with the practice of genomic science but it has drawn major attention to a whole host of issues raised by the practice of genomic science questions for example of insurance employment and medical care at the same time the LC program has also heightened sensitivity among commentators and public officials to the importance of using human genetic information in ways that maintain privacy and avoid discrimination the job at the beginning was perceived to be so big and here it turns out we accomplish it with far less money far less time we're here to celebrate the completion of the first survey of the entire human genome without a doubt this is the most important most wondrous map ever produced by humankind the human genome project has already paid enormous dividends across the whole domain of biology and medicine it's revealed for example that we have only 30 to 40,000 genes about half of earlier estimates and not necessarily a lot more than some lower organisms it's emphatically demonstrated the necessity of international collaboration in such a project and also the necessity equally of keeping the data that comes in freely and publicly available I didn't think that when in the early stages that I would really be around when they announced the mapping and then the sequencing particularly the sequencing would be too far off but here I am and now we're going to get even that part done what I was told mapping is one thing, sequencing is the ultimate when we get there then it's going to be up to scientists to figure out how to use it well as you can see from the program the entire afternoon is devoted to the topic of the human genome project and appropriately the first speaker is Charles DeLisi who has reserved the credit for first proposing that this be a large multi-year international program proposing that in the mid-1980s that time he was in the Department of Energy he left in 1990 to become the Dean of the Engineering School of Boston University where he still is although he's given up the deanship he's now both a faculty member and a founding member of a biotech company so welcome to Charles DeLisi it's been an absolutely exhilarating experience for the past 15 years watching this program unfold, evolve, develop and then finally be brought to completion what I think is more impressive than the actual goals themselves was the process by which those goals were met I think this is a remarkable achievement in the sociology of science it's something that sociologists and historians will be studying for decades and I hope they understand how it worked there's certainly a lot of self-interest involved as you heard within labs, between labs within centers, between centers within agencies between agencies within the international community and between the private sector and the federal government and academia and it's just amazing the rapidity with which this project has been completed it's a minor miracle in fact so I think that's extraordinary and I'll return to that in a few minutes before I moved to DO in 85 I spent 10 years here in the intramural program and it was a 10-year period that had an incredible impact on my development and the way I thought about the biomedical sciences the way I viewed what was important and it's what I took with me when I moved to DOE when I was at NIH I was interested predominantly in immunological problems but by the early 80s it had become clear to most of us who were given to quantitative extrapolation that the rate of sequencing was going exponential and it seemed to me then that it would soon get to a point where experimentation wouldn't keep up with information being generated and we needed to begin to think which is Manura Kenehisa was in my lab at the time we set up what I think was the first relational database we developed pattern recognition algorithms and that was my mindset even though that was a small part of my research that was my mindset when I moved to DOE in fact I remember musing with Jake May-Zell here that it him saying it should be possible and me thinking well I think it's economically not possible I thought it was technically possible for economic reasons but I thought there wasn't a chance that it could happen anytime in the near future when I moved to DOE one of the things I found is that people were interested in understanding the genetic basis of variation in resistance to low levels of contaminants what Mark Patinski used to call disease diathesis and resistance a reference genome to get at human polymorphism a reference genome seemed essential but I had already dismissed that and when I changed my mind was when I saw the report from the Office of Technology assessment a few months after I had arrived that was where I first saw the mention of sequencing the human genome I immediately called Mort Mendelsen who had been on the committee and who was the chairman of our advisory committee at the time he told me about the Sinsheimer workshop Lee Hood's interest, Walter Gilbert's interest that was a defining moment because then I began to think that maybe the cultural problem the question then was how extensive was there an underlying a broad based underlying interest in that in this that we didn't know about David Smith then made what I thought was a key suggestion he suggested a workshop of leaders at which we would discuss which I charged them we would discuss the technical feasibility the costs and so on and it's kind of interesting can I have the first slide compile from letters that I had received after the workshop and it's quite interesting I still have these in my files and it's quite interesting to read these 15 years later some of them are very very prescient both with respect to the problems that we were going to face and with respect to the consequences I can remember setting aside a couple of hours on an otherwise hectic Tuesday afternoon this is it at which I read these and summarized them in this memo to the assistant secretary who at that time was Al Trivelpiece basically the 10 to 15 year projection what was most uncertain in my mind and I felt I kind of have this optimism about human creativity I think anything that's solvable what was the greatest uncertainty was economic that is to say I didn't know obviously no one knew what the economy was going to be and to me it was going to be central in determining whether this thing really moved forward in fact the gods were with us we had an economy in the 90s that was unprecedented no one could have predicted Craig Venter of course and that turned out to be very very important in general the leadership obviously it's the scientists that made this happen but the leadership that set the conditions Francis Collins here at NIH his predecessor Jim Watson Jim Weingarten Trinoce and Dave Gallus and a staff and I could speak personally here about DOE only a staff that not only supported but that led and Dave Smith in particular is a name that's not mentioned enough in this he was absolutely central as far as the department of energy is concerned this memo and subsequent briefings that I gave in the department was enough to generate a certain amount of support from DOE looming large in front of me was the office of management and budget which I feared greatly but I found out that OMB was very very supportive and they were supportive I think because this was presented as an engineering project with milestones goals a beginning a middle and an end not as which it was and not as a scientific project which was not this was to create an enabling resource and that was the distinction and it was a very very important distinction that left the Congress and as you just heard Pete Domenici for me at least was a very crucial figure this is just a quote from many bills that he many pieces of legislation he introduced over the course of a decade you can see this was the earlier ones he was a staunch and steady supporter of this project and we would have had rough going I think if he weren't so this is a wonderful way to end a century of revolution not just in biology but in many areas of science my feeling is the major challenge of the next century the century we're currently in is not going to be scientific I think science is going to continue to accelerate it's too intrinsic part of human nature not to it's almost an organic necessity I think the major challenge as far as this area is concerned is access and what I have in mind in particular there are a lot of problems of access but in particular the convergence of reproductive technologies and genomics as we begin to develop a command and ability, a perfection of the Darwinian theme playing out the Darwinian theme both in vivo and in vitro full genome scans followed by prenatal by embryo selection that I think given the current economic conditions in the world is almost guaranteed to introduce forces that are going to lead to greater polarization I do not subscribe to what I sometimes refer to as the Yahoo Winnum Hypothesis that is to say we're on the road to alphas and deltas, to eloys and murlocs I'm a lot more optimistic about the future than that, in fact I'm totally optimistic about the future I think part of that I'm just an optimistic person but also a large part of it is the genome project itself the genome project gives us cause for social optimism it is I think in a non-trivial sense an enormous experiment in the sociology of science which has been successful beyond any of our expectations I think there's a tremendous amount to learn there I hope the sociologist study it and I hope we can apply it to solve the very problems that is the sociology of the project will help solve the very problems that some of this technology is going to exacerbate and these are problems which are not peculiar to this particular age the sorts of problems that exist are very fundamental to human nature they've been with us for millennia but in fact their resolution is now crucial it really is crucial we're faced with a power law distribution of wealth between this nation industrialized nations but between nations and that sets us up for polarization with these technologies I think that the human genome project I hope will give us at least part of the answer if it does it will tell us much much more than just about our biology thank you very much second speaker of the afternoon is Dr. Bruce Alberts who's president of the National Academy of Sciences and you've already heard about his role in the genome project from the movie at the time that he served as the chairman of the academy committee studying the feasibility of the human genome project he was chairman of the department of biochemistry and biophysics at University of California in San Francisco and as indicated in the in the movie but hard really to appreciate unless you were there that committee was absolutely central in gaining sufficient support in the life science community to go forward with this project prior to that committee the opposition to it within the biological community was quite intense that dissipated with that report of the committee so I view that as a critical development in launching the project back in the 80s welcome to Bruce but let me start by congratulating everybody who's actually been involved in getting to the today I think it's a remarkable achievement that have the final completed human genome sequence unfortunately we've been crying wolf several times prior to this and I don't think the public understands quite well enough what today actually really means I'd like to start by telling you a little bit about the academy I'm here to talk about the academy's role in this project the first slide we could have the slide up on the board well I don't have a Macintosh so I can't figure out why this one's not working but anyway okay thank you this is where the academy is right near the Lincoln Memorial Vietnam Memorial Washington that's a big statue of Albert Einstein and I like this slide because it symbolizes the image that we would all like to have for science something accessible and something that the public can understand and that's a major component as you've heard of the outreach programs of LC and other parts of the human genome project the academy is a private organization a charter at the time of Lincoln to exist as such in Washington DC and we had this little piece of our charter that nobody understands that turned out to make all the difference in order to exist as a private organization we must when ever called upon by any department of the government investigate, examine and report upon any subject of science or art art at that time meant technology here's a catch but the academy shall receive no compensation whatsoever in the United States and this led to a great volunteer tradition subsequently three other organizations were incorporated under the same charter, the National Academy of Engineering, the Institute of Medicine so we have Science Medicine Engineering as honorary associations with some 5,000 members in all and then during World War I the operating arm, the National Research Council which allowed us to bring now only just engineers and medical professions into these committees but lawyers, teachers, wherever else we needed depending on the question that the government was asking and today we call ourselves the National Academies because we really do work as one almost everything we do we're very active more than one report every working day the reason why we're valuable is that even though the government pays for most of what we do by study and that means they pay for the transportation and the meals of the volunteers and for the staff work that's needed even though they pay for it they're not part of the process and we don't negotiate the answers that we produce we produce a report that is released both to the government and to the public and to the press at the same time and that's why we are respected by the people who ask us for advice there are two kinds of reports basically most of our reports are what we call science for policy what are the facts about substance x that might make it dangerous so that then somebody in congress or the administration could make a policy about it for example there are many many such examples I'll show you one what we're talking about today though is a second class report a minority class but a critical and important one policy for science how do we keep science effective and healthy in the United States and the world here's a few examples of reports one of each kind this is arsenic and drinking water that end of that big debate between the bush administration in the early days about whether they should or should not accept the Clinton standard the administrator of EPA governor Whitman came to us the human health damage at levels of arsenic at 5 parts per billion 10 parts per billion and so on we produced that report the government then decided what level to set the standards we do that over and over again trying to create a factual basis on which policy makers can make wise decisions that involve much more than science here's an example of a recent policy for science report a report on how we keep the enterprise healthy with a growing number of postdoctoral fellows in holding patterns for jobs how do we support these young people so that they will have productive careers and we will continue to attract the best young people into science of course this is a report that we're talking about you've already heard about it several times it's a course of policy for science report like the postdoc report our work started and I got involved in 1986 at which time there was a huge debate going on in the scientific community this is just an example part of that debate this is a meeting at Cozming Harbor that Jim Watson called in the summer of 1986 and there's Wally Gilbert on the pro side and David Bosnian on the anti side arguing about this project so in retrospect you might wonder why was it so controversial in 1986 and here's my answer more than 95% of the DNA sequences used as junk so it was argued that it would be much more efficient just to sequence cDNAs why sequence all that junk second the largest continuous DNA sequence completed at that time was about 150,000 nucleotides and our committee eventually could calculate that with available technology our genome would require 30,000 person years to sequence so the image at that time in many scientists of mine was filling the jails with people sequencing DNA we couldn't imagine that kind of science last but not least it was broadly viewed as a big science threat to small science culture and small science funding so in the late fall of 1986 shortly after the Cozming Harbor meeting I got a call sitting in my office in San Francisco asking me if I would chair this committee that had already been established with all these wonderful people like Jim Watson already on it to see whether there should or should not be a project in the United States to map and sequence the human genome I must say I was completely shocked as you saw in all those previous slides I hadn't attended any of the previous meetings I told them I hadn't even thought about it why would you want me to be the chair and they said that's why we want you to be the chair the fact I think there was a second reason a year earlier I had published this article in Sal about limits to growth in biology small science is good science and I still very much believe that most of the creativity comes from laboratories where the principal investigator could really understand the nature of the experiments going on and so I was viewed as somebody who would probably be biased against the project from the start and if they could convince me that maybe it would be more widely accepted that was probably why the advocates accepted the idea of me as chair so we started in December 1986 and we had a divided committee what we do at the academy on any issue like this is we get all the reasonable points of view both sides of the argument on the committee and then we ask the committee to argue it out based on evidence and come to a consensus view in this case we had people against the project people running for it and so rather than try to take a vote at the beginning the strategy was of course to collect evidence that's what scientists do so we had some 22 different experts come in to tell us about what was really happening on the front lines how hard would it be to map and sequence the human genome one of these experts, Maynard Olsen was so effective that we quickly invited him to join the committee so that we get real ground truth on the committee itself here's the next speaker before he aged the final committee is listed here it was a wonderful group of people and basically it was a learning experience for us all certainly for me and I think for everybody we all learned a lot about the science and what was possible I have a few photographs taken at the time John Burris was a staff officer he was so effective he's now president of Bullite College of course Sidney Benner, John Tews there the only one who's still not with us is Dan Nathan's unfortunately the Nobel Prize winner who was a wonderful contributor to the committee part of the strategy of running committees is whenever it gets tense you take a break and we spent a lot of time eating I may account for my girth actually here's more eating and that actually works so despite the very different opinions at the start we quickly reached a consensus that some special project was needed and in retrospect here's I think why we knew that rapid progress in genomics research would depend on a free sharing of all genomics data and we heard all these I would call them horror stories about the people who were not sharing such data which got us upset and made it clear that the business as usual mode of operating would not work we felt that funds for a special project could be used to create a new set of obligations for resource sharing among genomics researchers changed the sense of the community and create much greater progress thereby and the result of course would be to make the final maps and sequences available as a public good so this data would benefit everyone's science this is a relatively short report as reports go from the academy only about a hundred pages this is the outline of the chapters and I'm going to now just go over very quickly the major conclusions first we concluded this was unanimous it should be a special project funded by the federal government to map, sequence and acquire increased understanding of the human genome second the initial focus should be on producing high-resolution genetic linkage and physical maps with regard to sequencing which was the most controversial part of this immediate efforts should concentrate on improving techniques with large-scale sequencing efforts delayed until a major drop in cost and we actually set a target for that and fourth we strongly urged that a comparative genetic approach be taken that it should not just be a human genome project we needed the genome sequences of model organisms in particular the ones shown here were specifically called out this data would be necessary to interpret the sequence data on the human genome and of course that really did turn out to be true at the time we were criticized for tossing this out as a political sop to the community but it wasn't that at all and finally we said by a mysterious process that only Jim Watson understands we came up with this number of 200 million dollars per year in 1988 dollars and it should take 15 years that part of the process which was a subcommittee working on this that included Jim was completely beyond my understanding how you'd ever figure that out now I wish all the Academy reports were this successful we released it in February of 1988 thanks to Jim Weingarten your chair this immediately went into action at the National Institutes of Health in contrast to the Department of Energy the National Institutes of Health had initially appeared to be very skeptical about this idea they were reflecting really the community's views but as soon as our report was released Jim Weingarten who was then director really very strategically moved immediately to set up a series of planning meetings which led to the appointment of Jim Watson as the director of this human genome research effort at the NIH only about 7 months after our report had been released now all reports had that effect I'd be extremely pleased now looking back I think the report was successful that's why I'm speaking here but one of the things we really got right you've already heard about today the press conference and elsewhere we say that this is a quote sequence of the entire genome is certain to reveal unsuspected sequences having important functions for example one of the great challenges of a genome sequencing project is to identify potentially important functional domains involved in gene regulation and chromosome organization the identities of such sequences will require sequence comparisons between the analogous intergenic regions and multiple species including human versus mouse and this was to counteract the argument that we should only sequence cDNAs which was a strong point of view on the part of many people we didn't get everything right one important thing we missed we had no expertise on intellectual property on our committee we had no patent attorney we understood what was going on we were told that DNA could not be patented without a strong utility and what we had to worry about instead was that somebody might want to copyright the sequence and so what we what we said was instead of addressing the patent issue the committee wrote about DNA copywriting and recommending that such copywriting not be allowed so as to keep genomic data freely available I can't help but wonder if the course that actually happened in the United States of what I would view as excessive patenting of sequence information without adequate utility I can only wonder if our committee had been better informed whether we could have had some effect on that process part of my job is going around the world President Academy is going out around the world meeting with scientists from other nations and I can tell you that the image of the United States for sequestering quite unfairly in their opinion in the form of patents DNA sequences without actually demonstrating strong utility for those sequences is viewed as a grab by this nation and greatly resented the human genome is viewed as an asset in the property of the whole world nobody objects to real utility patents when you really discover something but there's a general feeling around the world that we have gone too far in what we've allowed to be patented thus the President of the Royal Society at the time Aaron Kugin myself produced this editorial in Nature in 2000 there are many other issues on the agenda right now of the nature of the policy for science and the variety that I'm talking about there is of course still the issue of how much intellectual property is enough what should we do about requiring other countries to accept our own intellectual property provisions whole issue of implant sciences of international agriculture is affected by these discussions there will be a major report that's been in the work for several years on intellectual property issues released this summer co-chaired by Rick Levin the President of Yale so the academy continues to deal with many of these issues others of course include issues of stem cells human reproductive cloning the sharing of data and materials we just had a major workshop report chaired by Tom Check the whole issue now about sensitive but not classified information in biology how do we deal with that what is it if we don't do this right it threatens to paralyze the whole enterprise so the academy very much will continue to have a central role I just want to end by referring you to our website we made a conscientious effort to put everything that we publish upon our website you can read it for free 2500 books are up there we are also trying to organize it in prepared in ways that will be useful particularly for people around the world who don't have access so easily to the rich intellectual resources that we have in the United States thank you very much our third speaker this afternoon is Maynard Olson who has been partially introduced by Bruce Maynard was a member of the academy committee as you've heard at that time he was at Washington University in St. Louis he has since moved to the University of Washington in Seattle where he has their genome project and is a professor of medicine as well as genomics at the time that this project was begun as Bruce emphasized the technology was rather primitive in fact I think it was estimated it would take some 55-60,000 E. coli clones to deal with the entire genome one of the major early technological advances is going to Maynard Olson that's the development of the east artificial chromosome the act chromosome which immediately reduced the number of acquired clones by a factor of 10 or more so Maynard welcome to the podium well thanks Jim this is a historic occasion both for our community and our interplay with society also for me personally I'm used to coming to historically oriented symposium being entertained by what Jim Watson or Sydney Brenner looked like when they were young but I'm not so used to being amused by pictures of what I looked like when I was young today is largely a day of dreams retrospective dreams and prospective ones an interesting feature of the program which sets the stage for my talk is that I was preceded by Charles DeLisi Bruce Albert setting the policy stage for the human genome project and the next scientific talk on the program by Eric Lander is entitled Beyond the Human Genome there's clearly something missing here and I have 15 minutes really to represent the activities of a vast number of people over a period of 15 years really the toil, sweat and tears part of the human genome project that is how was this particular dream made real when the Albert's committee issued its report as this graph indicates the state of cumulative sequencing experience in the world was still lost in the baseline in 10 years of experience with methods introduced in the late 70's as we've heard by Fred Sanger and Wally Gilbert the worldwide community on all organisms had sequenced a total of about 15 million base pairs the quality of this sequence was highly uneven and the average segment length was about a thousand base pairs so relative to this unpromising start this project clearly faced an enormous challenge to scale up and it's the success of that challenge that we're here to celebrate today somehow they started me with my last slide or nearly my last slide this is the graph showing cumulative progress lost in the baseline in 1988 and the faster than exponential growth since that time there were two predominant technical challenges that needed to be faced to implement the recommendations of the Albert's report the mapping aspect really involved a very serious problem in establishing long-range continuity on a scale that really had never before been imagined at our press conference at noon it was announced that the average base pair and the human genome now resides on a sequence of some 20 plus million base pairs and to have achieved that goal is a truly astounding technological accomplishment I can't summarize even a tiny fraction of the technology development not to mention the hard work that went into that let me comment just very briefly on an aspect of the problem that I was involved with working with Eric Green in the late 1980s we developed a method that came to be known as STS content mapping the idea really being to take advantage of the then newly available technology to put mapping on a purely informational basis that was really the key idea here previous ideas about physical mapping had been very clone based including work in my own laboratory and held up this specter of truly vast clone collections that would have to be maintained over a period potentially of decades and kept straight and stable and worst yet locked the project into a 1980s recombinant DNA environment as we all know the styles of recombinant DNA work the availability of specialized vector systems such as the backs that really carried the major load at the end of the human genome project turned over several times since the late 1980s this idea of getting mapping on a purely informational basis so that you define mapping landmarks in terms of sequence and that ultimately these landmarks could simply be found by DNA sequencers as they were in sequencing the genome segmentally was one contribution to the successful development of an international distributed program using a lot of different technologies and a process that could keep up as time went by a number of prominent early members of the human genome project so many of those vignettes joined me in sort of advocating this approach to physical mapping in a 1989 science piece and really the key idea was this advocacy of using short tracks of a single copy DNA sequences landmarks to define positions on the physical map and the key notion that these landmarks could allow integration of information for many different sources and that's actually what played out really just within a few years I wish that all of the opinion pieces that I've associated myself with just as Bruce said about the NRC report would have this much impact this quickly but within a few years using really a big variety of techniques prominently the radiation hybrid method introduced by David Cox and Rick Meyers clone-paced STS content mapping and genetic mapping all of whom fit into this sort of informational paradigm for achieving long-range continuity nonetheless the mapping problem coming under control only brought more focus on the staggering problem of increasing the scale of DNA sequencing beyond the really minimal cottage industry levels that my first slide showed to the scale that was required for this project much less the grander ambition of this comparative genomics that Bruce already talked about and that we'll hear more about this afternoon so the second big challenge was how to scale up sequencing throughput and particularly to do it without compromising finished sequence quality I already mentioned that there were clearly uneven quality control issues associated with early historical sequencing and that was despite the fact that every one of those base pairs was carefully curated by a graduate student or a postdoc and now we were talking about sequence untouched by human hands untouched by human minds and flow into some kind of highly automated information processing system to produce this enormous end point so I thought that I would simply take one critical aspect of the large-scale DNA sequencing technology and talk about it into slightly more detail and this is really how the four-color fluorescence method which was introduced from Lee Hood's lab as we've heard in the early 1980s or the mid-80s project headed by Lloyd Smith how the four-color fluorescence method which emerged ultimately as the dominant technology got there it didn't have the inevitability that all successful technologies have in hindsight even viewed as late as the early 1990s again the story in its full ramifications would exceed the scope of this talk but let me just give a few examples some of these papers I suspect not even well known to the genome center participants here which addressed some critical limitations of 1980s four-color fluorescence technology the biggest limitation of this technology is that it really had intrinsically rather poor signal-to-noise characteristics a lot of people still don't realize that because the solution to these problems is not in any one technology but from a whole series of both biological and engineering technologies I'll just touch on a few examples in the late 1980s John Majors' lab the cycle sequencing method was introduced which was a spin-off of PCR technology with thermostable DNA polymerases this idea that you could simply repetitively run the didoxy sequencing chemistry and gain more than a factor of 10 factor of 20 or 30 in signal-to-noise ratio greatly reduced the template requirements of the method and in the process of reducing the amounts of template required also reduced requirements for stringent purity and this really was critical to the high throughput methods that carried the weight in the human genome project Lee Hood talked about capillaries in his afternoon of insight that led to four-color fluorescent sequencing but the first commercial implementations of it for a variety of reasons were on slab gels and indeed the early years of the human genome project depended on slab gel technology very cumbersome essentially unautomatable a number of companies went broke attempting to automate the casting of prefabricated slab electrophoretic gels which would have enough resolution to do DNA sequencing on the surprising to me resolution of this issue came in the early 1990s with the sudden recognition Norm DeVitchy's lab played a key role as well as a number of others that linear polyacrylamides non-cross-linked matrices could actually achieve the exquisite resolution required for DNA sequencing this led to pumpable replaceable matrices in a truly automated process and it's that process by which we collected most of the data the human genome project energy transfer dies first introduced by Rich Mathies and Alex Glazer added additional leverage on the signal to noise front mutant DNA polymerases and played a critical role particularly the ability to genetically engineer the attack polymerase the thermostable polymerase made essential by the cycle sequencing so that it lost most of its discrimination between dideoxy and deoxy nucleotide triphosphase these are all rather technical issues and I use them just as an example that there was a lot going on during what now in retrospect looks like a lag phase of the human genome project all exponential processes have such lag phases and really we were in no position to do a sensible scale up of this technology until some of these component technologies really came in place and of course they were occurring in a complex context of competing technologies that had to be sorted through by a really disciplined pilot project mechanism only last on my list here due to my own colleague Phil Green is what I think was really the major accomplishment of the mid to late 1990s and that was putting the interpretation of raw sequence data on a firm statistical basis in his so-called FRED software which underlies FRAP and CONSED and other very widely used programs really turned the raw base pair the raw color fluorescence base pair into a commodity so that all of these genome centers around the world could talk the same language could freely trade data as went on a very large scale especially toward the end and proceed with this kind of confident quality control basis and essential independence except for the most difficult special cases of this expert curation which was the legacy of earlier experience with DNA sequencing so by the mid 1990s it was becoming apparent that the mapping problem was largely in hand not solved in all of its details but largely in hand that the technologies were coming into place for scaling up of four color fluorescent sequencing the private sector was very active particularly in the implementation of these techniques through better reagents and instruments and it had become a time to sequence I think it was John Sulston and Bob Waterston who really deserve the primary credit in the mid 90s based largely on their experience in the CL against project to argue strongly that the time really had come to move to very large scale production and I joined forces with that I think the technology was finally there this technology did have an industrial component this is a rather interesting picture with respect to this project it's a photograph of the MIT Center Eric Landers center taken by Bob Waterston Bob told me that they had a better camera angle at MIT than back in St. Louis to sort of capture what this type of facility looks like but either I'll put this forward as evidence for the close interactions between these major centers or at least the possibility that they had active espionage programs but industrial scale sequencing was finally there and it was distributed we avoided I think something that many of us on the Albers committee wanted to avoid which was excessive centralization we wanted a technology that could disseminate around the world and throughout the community and these facilities proved relatively large but they were not did not have a characteristic they were readily replicated in many sites and indeed we've seen that replication in many sites for many purposes so that led to exponential growth in the production of sequencing if you look at the blue lines here you'll see that the endpoint that we celebrated this morning that we're celebrating in this whole symposium actually followed a really smooth course behind this where of course where missteps were difficulties of every manner the sweat toil and tears dimension of this project but the aggregate statistical effect was rather a smooth approach to the sequence that we've now released the red line tells a somewhat more complex story and that really involves the rough draft phase of this project and I'd be remiss if I didn't make just a brief comment about that that's that stage of this story leading up to the publication in 2001 of the two papers from the public consortium and the private sector of a rough draft sequence and it was the rapid surge of red bars at that period was associated with this sudden commitment to produce this intermediate rough draft form of the sequence I don't want to plunge deeply into this story but it is one much in the spirit I think that Charles Delisi indicated in his comments I think that we do need to learn from the experience of the human genome project it has been one of the extraordinary successes of science policy of interplay between science and society and it's not going to be the last indeed I think it will be seen some time historically as a sort of foundational example of the difficult ways in which science and society interact I would like to encourage both scientists and non-scientists alike to take an interest in really how the human genome project worked I tried to contribute to this last year with a short article expressing my perspective of private competition this is just one person's view but I think that this brief quote from the abstract does capture some of the flavor of what happened in the late 1990s that this project was launched during an extraordinary confluence of rapid scientific progress rampant technological optimism and exuberant entrepreneurial capitalism I think I chose such flowery language to indicate that this was a somewhat manic social phenomenon of the late 1990s and it's one that enveloped the human genome project in very complicated ways and certainly the resultant strains on basic scientific values were exemplified by this public-private competition and the way that these tensions developed and were resolved is something that's worthwhile to study so finally I want to present my last slide really for the immense number of people who devoted years of their lives to making this dream real it was certainly thousands of people who had a direct hand in the sequencing of the human genome they played many roles in scientific leadership, technical roles support roles of a variety of types and I think really the bottom line is that we did it thank you in the movie that was shown at noon Jim Watson recounts the first press conference after his appointment at the NIH and during that conference he announced that 3% of the genome budget would be devoted to studies of the ethical, legal and social implications of human genome research what he didn't mention was that that was an idea that had not been discussed in advance it was dropped on us as a bombshell it turned out to be a stroke of genius it was very well received by the press and more importantly by the congress and I think it's true that that has become the single largest fund for studies of ethical issues and biology in the country and maybe in the world the first director of that program was Eric Jonst after some years at the NIH moved to Case Western Reserve where he is a professor of bioethics Eric there you are thank you well I want to add my congratulations to Maynards to the scientific community but especially to my former colleagues on the staff of the institute there are likely to be unsung heroes in this story it will be those program officers I mean it's probably a good idea that you didn't as tempting as it was take up the offer from the guy in Alabama who had already sequenced the human genome in his garage he called a number of times and was willing to give us the complete sequence of the human genome in return for a congressional medal of honor something but it was probably good that you pressed on well we're here to celebrate the 50th anniversary of the first model that Jim Watson helped build I want to talk about another important model that he helped launch it's the 15th anniversary of that famous press conference this year in which he stuck the agency's neck out for them and committed the NIH and the genome project to formally sponsoring research on the social environment in which genomics would unfold and the issues that would come out of the interaction between the science and its social environment it's a common misapprehension I think reflected even in Senator Domenici's remarks that the LC set aside within the human genome project's budget was an idea that congress imposed on NIH when it approved the federal agency's plans to add this national center for human genome research to its roster of institutes actually the congressional appropriation subcommittee that funds NIH was as surprised and in fact skeptical as the rest of the biomedical research community about Dr. Watson's initial announcement and I'll come back to that interesting episode with congressman Obi and others in a minute the fact that they were ethical and social issues to be addressed in genome research was not news both of the major feasibility studies for the project the MRC report and the OTA report on mapping human genomes had noted the fact that there could well be downstream consequences for society of isolating and identifying the complete list of human genes in the congressional hearings that followed the OTA report witnesses like Tom Murray also pointed these potential issues out and suggested that someone in the government ought to take the responsibility to fund discussion and research of these in advance what wasn't clear until Jim made his commitment was whose responsibility that was in the government to take up these kinds of issues to the agency's credit and I stood behind Watson's decision and incorporated this new mission into its joint efforts with the DOE to develop a five year plan a first five year plan implementing the ideas that had come out of the MRC report on how to design a human genome project I don't think it was even clear at that point to many people how in the world this LC set aside this attempt to engage social issues should be approached fortunately we had the great luck to to have Nancy Wexler and Jim together at a conference in Valencia Spain one of the first of a series of incredibly ponderous affairs and at one of the breaks she having heard his announcement sequestered him and they began to talk about how this would actually be implemented subsequently she was appointed to the committee as punishment for raising the idea to chair the working group that would accompany work alongside the sequencing working group and the mapping working group to put together a five year plan and she found other good people like Tom Murray and Patricia King and Jonathan Beckwith Victor McCusick to sit on this committee and try to figure out how a scientific research institution like NIH could get into the business of essentially social impact assessment of the science the work of the LC program report when it came out was structured by two goals to first to develop programs addressed at understanding the ethical, legal and social implications of the human genome project and secondly to identify and define the major issues develop policy options intended to address them the methods for achieving these somewhat vague initial goals were also prescribed in the plan to adapt existing NIH review and funding mechanisms to create extramural grant support for education, research and public participation projects on these issues in other words we used the tools we had at hand the system at NIH for supporting work not by federal employees or consultants but by scholars and academics and others out in the field of of course there were still those inside NIH who were willing to challenge Watson on the wisdom of this move shortly after I got there we had the chance in the genome staff to present the five-year plan to the assembled institute directors one of the the NIH director's lieutenants who will remain nameless but mainly because I can't remember his name after Jim had sketched the plan including this attempt to fund research on the ethical and social issues said well I just don't understand Jim why you want to spend all this money subsidizing the vacuous of self-styled ethicists whew Jim said in fact he didn't think it was that much of an option in this day and age that the cat was out of the bag in the public's mind that important research like this could have a social impact and the fellow said yes but why inflate the cat why put the cat on TV and indeed why should the human genome project fund its own social impact studies when there are plenty of folks like me at the time and others professional science watchers who would be sure to comment on those implications on our own steam well Dr. Watson wrote in 1992 with a co-author insignificant that it is a 20th century truism that science is not done in a vacuum and should not be pursued as if it could be good science affects its social context and the practical effects of good basic science are often the most wide ranging of all science in turn is constantly affected by the professional norms and social policies and public perceptions that frame it doing science in the real world means anticipating those interactions in planning accordingly by pursuing the study of ethical legal and social implications the NCHG assumes its responsibility to help make that planning timely well informed and productive doing the genome project in the real world means thinking about these outcomes from the start so that science and society can pull together to optimize the benefits of this new knowledge for human welfare and opportunity that was the vision that was the mission and the model that came out of it is interesting to take a look at there are plenty of models around for thinking about how to develop science policy in the US in the recent history the main model has been some form of the blue ribbon commission you have a science policy problem and you either commission the NAS to do the study or you put together your own government commission this is a very different kind of model because it was committed to using the extramural grant mechanisms of the NIH to create a community of scholars that would be around much longer than any commission would be to keep an eye on those fluctuating interactions between genome science and society what I'd like to do in the remaining time I have is to tell you just a little bit about that model and then to make a few remarks on whether it's been a success and how we would know if it had been one of the complaints from the scientific community and others about the very idea of launching this model was that genome research didn't seem to have any special need to be singled out as ethically problematic almost all good basic science as I just said raises social implications why pick on genomics there were a couple of answers to that that we would usually give two short ones and a longer one the short ones were right there is nothing uniquely problematic about genetics but the genome project by speeding up the pace of the discovery by broadening the scope of what we know about genetics is going to produce a avalanche of new information and it's those economies of scale that are what drive the need to pay special attention here we need to pay more attention here because it's going to come faster and be more confusing to the clinicians in the public another short answer was special everybody should be doing this why not have LC programs at the other NIH institutes the longer answer was in fact there is something special about genetic research those who develop and use new genetic risk information find themselves caught in the scissors action of two broad forces one is our society's inclination to invest genetic information with occult power to define our identities and predict the future and the lessons and the second is the lessons of our long history with other attempts to use genetics for the public good the former inclination is understandable enough given the deterministic paradigms that public uses to understand genetic health problems most of us think about genetic disease in terms of examples like Huntington's disease or Tay-Sachs disease that do unfold in a lock step fashion but it also leads to over interpretation and the history of social discrimination is replete with examples of well-meaning programs gone awry including the involuntary sterilization programs of the eugenics movements and the XYY screening programs of the 70s so caught in that pincher movement between those two forces geneticists do face a unique constellation of questions that people in other corners of biomedicine don't have to address squarely well have we been successful at trying to identify these major issues at developing those policy options when I got to NIH it wasn't clear that we were going how we were going to have this conversation between the disability activists on the one hand and the genome scientists basic scientists on the other over the last 20 years bioethics had established a fragile but productive working relationship with clinicians with medicine research researchers were from a whole different country I wondered if I would even speak the same language with my colleagues when I got there then I saw some of the watch some conversations between my biologist colleagues and the informatics folks and realized actually I didn't have anything to worry about much larger cultural gaps in the genome project but I think it has been a successful model and the things that I would point to demonstrate its success have paradoxically enough are not even things that the LC program the genome project or NIH can take credit for well then how could they demonstrate its success it's because they are developments that could have only been possible I think with the inspiration of this model and the information the research the data that it produced for example the majority of states now do have in place some form of legislation to protect genetic privacy to protect people against genetic discrimination of course it's a wild gommish of legislation some better than others and the genome the LC program the genome project wasn't out there lobbying the state houses but the conversations that we've had in public have stimulated the public to put into place those kinds of protections both at the state level and now a few important pieces at the federal level as well another sign of the times is the fact that when new genome centers spring up at important top level institutions like Johns Hopkins or Duke they now have built into them an LC program an LC component a concern to look at these issues I think maybe even a more significant feature is the fact that in planning the post genome project initiatives like the plan to create a haplotype map of the human species the two parts of the genome project have come together the planning committees for that project are integrally focused on the ethical problems the social problems and the scientific problems it's the sort of work increasingly that genomics will be getting into that can't be done without considering those sorts of issues it's quite an accomplishment and is only the beginning I think for this kind of two cultures conversation the new vision for the next for the future of the genome institute reflects this increasingly as you'll see the ethical dimensions of the work are being integrated throughout the spectrum of the initiatives I understand that the plan is to eventually move the LC budget from 5% to 95% that's a joke but it kind of when you look at the topics on the list of initiatives you begin to see that increasingly this is going to have to be a natural part of doing the science I'm thankful that we can say in the genome community that in the process of producing the human genome project we can quote Watson's famous line it has not escaped our notice that this work has implications for the larger society thanks the final speaker of our first session is Eric Lander who is the director of the human genome center at the Whitehead Institute at MIT and Eric has a career history that has fascinated me from the beginning starting out life as a mathematician and really going quite a long way down that road before he moved into biology where he has also distinguished himself and I almost tempted to ask whether this is his third incarnation that we're viewing this morning as a futurist he's going to speak to us on beyond the genome well thank you very much that's on we'll see it is on well my assignment is to now make the transition from a look at the past which has been the subject of the symposium up to now to a number of talks looking at the future in the spirit of the human genome project we laid out 25 minutes to do that there remain three this makes me feel right at home in the spirit of the human genome project we just had the press conference at lunch to announce the completion of the genome and we're already behind the eight ball but before I start though let me just share the sense that I know so many of our colleagues are feeling a couple of thousand people around the world that it's an extraordinary day for so many people who have spent so much of a very important part of their life doing this it was extraordinary to wake up this morning and realize it's done there were more days than not when we woke up with doubts and to wake up this morning with no more doubts and just to see the product there was tremendously satisfying and I want to say personally to the many people who are here or my colleagues I could not imagine working with a more extraordinary group of people could not ever hope to and don't imagine that I ever will again so we should all cherish this day but we should also look forward we got here for a reason we got here because in the mid 1980's we saw what could happen if we could just lay our gratification a little bit and build the infrastructure build the tools, build the foundations in the mid 80's everybody knew we wanted to go all fine disease genes do all sorts of things and of course a lot of that has happened in the course of the 80's and the 90's but we knew that the solid foundation had to be laid and we're finally here so where does it go? well my assignment is to talk about beyond the human genome if I look back at the history of biology biology has in fact thought about the field has thought about biology in three different ways for most of the history of science biology was thought about in terms of organisms descriptions of individual organisms in their properties the major intellectual change occurred just before the 20th century in the 1890's with the beginnings of biochemistry when it was realized that you could in fact purify away functions away from the organism purify away what were thought to be vital functions that could digest molecules, enzymes and yeast and it led to the idea that we might be able to purify away each and every function in a test tube the apotheosis of this came mid century with the purification of the most extraordinary function heredity in the form of the DNA molecule and then the height of the structure-function relationship in the determination of the structure of that molecule in the DNA double helix what world did that give rise to? it gave rise to a world of molecular machines now we view biology both in terms of organisms and in terms of its molecular machines and we have gained in the next half century very rich descriptions of biology in terms of these molecular assemblies but the anniversary that we celebrate this year of the DNA double helix is special because it was both the high point of this view of biology as molecules and the beginning of a distinct view of biology as information just as DNA had been purified away from the cell it became clear at least in principle that one could purify away the information from the molecule and that has been the work of the next 50 years the purification of that information the laying bear of that information in its purest form as DNA sequence something we celebrate today what then in the next 50 years where does this go what is that picture of biology as information? to me the picture is this it is a picture of life as an extraordinary library a library of information gathered over 3.5 billion years of experimentation that information is as powerful and perhaps even more powerful than the experiments we could ever hope to do at our bench yes this view of biology as information is not exactly experimental it can never stand on its own without the ability to do experiments and to manipulate but it is the laboratory notebooks of evolution over 3.5 billion years and evolution has a larger budget than Dr. Zirhoni evolution has longer grant periods than Dr. Zirhoni and for this reason evolution has engaged in far more curiosity directed investigation and has saved the laboratory notes well at least of the successful experiments that is not actually enough but in any case we have all the successful experiments for each species, for each individual within each species for tissues within each individual and the question is how to read this information well I will because of time try to be brief and I'm just going to touch upon some of the things that I have slides here for but most of the power that comes from this informational view of life comes from comparison from comparing the experiments of nature and the comparisons I'll talk about are comparisons between species comparisons between individuals comparisons between cell states all of these will very much occupy us in the years ahead between species well one of the most important species for biomedical research the mouse although we celebrate the finishing of the human genome project today which was expected perhaps we have as a bonus a highly advanced draft sequence of the mouse genome to compare it to already and in comparing the human and the mouse genome we can line up their 3 billion bases 2.5 billion bases give or take and we can begin to say how similar are they and what can we learn from that which has been preserved between the two from their common ancestor in evolution 75 million years ago well if we take any stretch of human DNA a typical stretch of mouse DNA corresponds to it of over 1, 2, 3 million bases or so of DNA because in fact these both represent the same ancestral region passed down along two lineages over the course of those 75 million years and one can in fact build a cross index lookup table taking you from any region of the human genome to any corresponding region of the mouse genome and back and forth again so now for every bit of the human genome we can compare it to something in the mouse if we take sequence this is just meant to be illustrative if you take a sequence of a parent gibberish and human and mouse and put them through a computer it's capable of finding hidden messages like if you've spotted it this is hidden and it's there and computers are good at finding these sorts of things but they're preserved and they ought to be by chance and when you do this in reality on stretches of sequence like a favorite gene of ours PPA or gamma you find that there are all sorts of hidden sequences here that is sequences that evolution has preserved lovingly over 75 million years and they correspond to exons the protein coding regions of genes but to our surprise we also find that there's about an equal number of non-exonic sequences scattered across the genome and these correspond to we don't know and that's what's so much fun about the human genome project that's what makes it so much more than just the engineering is we're now at a loss to explain what about a quarter of a million of these well I know what two or three of them are I recognize a few of them as enhancers that people have studied but there's most of them are mysteries to us and in fact if we look across the whole genome that is the picture that there is about 5% of the human genome that is better preserved by evolution than it should possibly be and I'll spare you the analysis that lead one to reach that conclusion and much of the work ahead is going to be figuring out what that does, what the what those are, are they untranslated regions for genes, regulatory elements, genes we've missed RNA genes, structural elements, etc and the young students today are getting tremendously excited because they have no intention of doing this one region at a time they're trying to come up with high throughput methods that still can extract that information well one of the great things is we don't just have to look at human and mouse the power of the human genome project allowed us to get the mouse quickly and cheaply, it's allowing us to get the chimpanzee which will be along sometime by the end of the spring I'm sure dog is coming, cow is coming, chickens are coming all sorts of animals are lining up to get sequenced and I think we will see the ability to get more and more of these sequences and thereby to really extract signal from noise, to refine the parts that are really functional to identify these functional elements now of course we're impatient folk and so when we're impatient we try to go to a model organism to get some inspiration and so already folks in the genome project have begun trying to play these comparative evolutionary games with species like yeast yeast sequenced in 1995 one of the foundations for sequencing but lots more to learn it turns out by comparison by looking at evolution's lab notebook for example we sequenced three different species of yeast separated from saccharomyces cerevisiae roughly like human to lemur to dog to mouse comparing those four together we can see all sorts of things first off genes just like mouse and human roughly in the same order very easy to line up any particular region with any other region but when we look closely we find that although this genome has been known for eight years and annotated very richly the annotations aren't fully right there are more than 500 putative yeast genes they ain't really genes it's very clear by evolutionary comparison that they are not real genes but were spurious open reading frames new genes are found genes are merged the stop codons have to be changed in about 75 cases the start codons in 200 cases 60 new introns are found and the entire yeast annotation is actually much more powerfully done by asking evolution than by running any single genome analysis program we have today but not just that we can go beyond genes by looking at the intergenic spaces to begin to study regulation of genes we can look for these tell-tale signatures of sequences that are far better preserved across these different species and I'll just give you an example a favorite of yeast geneticists the Gal-4 binding site a particular regulatory site it's kind of crummy it's just got a lot of cases CGG CCG it's all over the place can't make much sense out of it but if you intersect all four species most of them go away the ones that stay are the ones preferentially that appear in front of genes in fact they are much better preserved than a random sequence would be in front of genes compared to inside of genes oh but that's a signature you could look for any old sequence that shows this funny property of being better preserved in intergenic regions and say that must mean something and in fact of course with computers you can look at every sequence and when you do you find out most of them don't show any special properties but quite a remarkable number do show this very unusual property of being preferentially conserved in some places more than others and our friend Gal-4 is right there so you can begin to generate a long list and I've shown in the beginning of the list of a hundred of these different elements across the genome there that say they must do something but now what do they do again you can begin to combine biological information already available on the web you can take those particular binding sites and ask where do they occur in the genome if we just look at one yeast all over the place but again if we intersect four yeasts they correlate very strongly to hydrate metabolism which is of course just right because that's what Gal-4 does but of course we knew that one but it turns out you can apply many different databases on the web for functional classifications chromatomuted precipitation etc etc etc and it turns out that the vast majority of the sequences that can be extracted by pure information can actually be assigned tentative meanings by making use of pure information found on the web often experiments involving expression data sets or proteomics data sets but data that you yourself didn't collect but could download from the world and I'll skip that in the interests of time so and beyond the human genome we need to identify all of the functional elements in the human genome all of the functional elements in the human genome and then further function through correlation and experiment and it will take many new things much more genome sequence we have to be able to read where and which proteins sit along chromosomes chromatin structure, correlation of structure and expression but already the young people are imagining just what data they need like we needed the sequence of the human genome to be able to extract this information second type of comparison the comparison between individuals in our species this is some human sequence the human genome project produced a reference sequence but any two people on the planet differ from this reference sequence and they differ by about this much is that for you? that is about one letter in 1300 any two people in the world differ from each other or from the reference sequence that went on the web today by about one letter in 1300 not a lot as these things go two chimpanzees in Africa differ by 2 to 3 times as much two orangutans in Africa in Southeast Asia they differ by 8 to 10 times as much humans have relatively little defined variation and that's because we're a pretty tiny species compared to a small founding population of about 3,000 generations ago living in Africa with an effective population size of about 10,000 individuals and simple population genetics tells you that such populations can't have that much variation they don't have that much variation they only can have variation of about 1 in 1,000, 1 in 1,300 nucleotides and so the variation we walk around with today is largely the variation we walked around with in Africa because 3,000 generations since we left is the blink of an eye with respect to the generation of much new diversity humans have a low rate of genetic variation most of the variation we have is actually common variation shared across the world many of these or some of these common variants are likely to play an important role in the risk of disease and therefore human geneticists are getting pretty excited about another post genome sequence project not post genome project but post sequence project and that is a new paradigm for human genetics let's just enumerate all variation in the human species and correlate it with disease the picture is actually very simple we know a few examples we know that on chromosome 19 there's variants here in the apolipoprotein E gene three variants in the population E2, E3, E4 homozygosity for E4 vastly increases your risk for Alzheimer's disease very important points us to mechanism of disease we can state a few dozen examples of this today but what we really want to do is get comprehensive write down all the variants along the top of a matrix write down all the diseases we're interested in and just fill in the matrix is that variant enriched in this disease is that variant enriched in that disease and all of human genetics can become at least conceptually one very large Excel spreadsheet in which we can tick off these sorts of boxes and get all the variants but the number of variants there must be 8 or 10 million variants in the human genome which used to be a big number but it's not as big a number as it used to be already by the time of the publication of a draft sequence more than 2 million of the common variants have been identified by today 3 to 4 million of the common variants have been identified and if we need all 8 to 10 million that's no longer something that keeps can be done beyond that in fact we've learned that these variants don't just occur at random they tend in fact to be locally correlated with each other there's structure to the variation reflecting again our ancestral origins in the population of back in Africa and in populations that have left Africa since and we can begin to study the correlation in these variations if we look at regions of the genome we find they don't occur 100 and all 86 kilobases here lots of sites of variation but they occur almost completely in a lock step fashion in one of four flavors these are called haplotypes and if we can begin to define all of the haplotypes across the length of the human genome we can pick out a modest number perhaps 300,000, 400,000 representative variants to proxy for all the key haplotypes in the human genome and then we can analyze to another goal the human haplotype map project that the NIH has in fact launched together with partners around the world and I think we will see in the next several years a nearly complete you'll never be totally done with human variation but a very rich characterization of the human variation that exists in the population and its correlation in these haplotype blocks and what that will do is energize a new generation of disease scientists to correlate the individual variants and the haplotypes with disease. We know examples already this is an example from inflammatory bowel disease across a certain cytokine region where it's the haplotype structure that gives you screamingly clear evidence that these particular cytokine genes are involved somehow in risk of inflammatory bowel disease and in the past couple of years we've seen additional examples in fact just in the past year we've seen examples with inflammatory bowel disease with type 2 diabetes with schizophrenia with cardiovascular disease and I think this is but the tip of an iceberg here we are even seeing surprises that we didn't ever imagine were encoded in the information in the genome I'll digress for a brief moment on just one of them we can actually read out aspects of evolution recent human evolution simply by looking at human variation and how it's scattered across the chromosomes here's the idea stay with me if this seems to get a bit complex but if there's a variant in the population that is fairly common simple population genetics tells us that it has to have taken a long time to get common things drift up to high frequency slowly well if they took a long time there has to have been a lot of recombination in the course of that history if there's a lot of recombination it means that the correlation between this and things nearby will have largely broken down so common variants shouldn't show long range correlation well if something has been subject to positive selection because it has been advantageous there's a telltale sign because it could have risen to frequency very quickly before the long range correlation is broken down so in theory it might be that you could simply look at many many alleles across many many different variants and correlate their frequency with their extent of long range correlation and when you see something that has unusually high frequency and unusually high long range correlation it's the signature of positive selection and so a postdoctoral fellow actually at the White Institute tried this for two genes implicated in resistance to malaria and bingo screaming signals in both cases the putative malaria resistance allele shows extremely strong evidence of selection over the last 5,000 years this has now been shown for several other genes lactase persistence of lactase has an even huge extent of these blocks and so we can read out the history but note we actually didn't need to be smart enough really to know about lactase or malaria one could in principle take any variant in the human genome and just compare its frequency to its long range correlation and run across the genome a type of positive selection that has occurred over the last 5,000 or 10,000 years now this particular technique only goes back 5,000 or 10,000 years but golly that's an interesting time it corresponds roughly to civilization density of populations, high infectious disease change of diet, well I'll settle for that but of course others are going off coming up with different ways to detect selection in our history going back much much further and all of this is coming together to tell us there is more information in this book than we had realized beyond the human genome too we need to characterize all of that human variation we need to systematically correlate that variation with disease and disease resistance which will prove as interesting as disease susceptibility many new techniques are needed large data collections are needed patient populations are needed but again the next generation of young scientists have their sights set clearly on this the last thing the comparison of cellular states well there has been a huge explosion made possible by the information from the human genome and by the technologies to now interrogate cellular readouts technologies such as DNA microarrays again all of these things you'll hear more about where you can take RNA from tumors grind it up, label it and wash it over microarrays and get a readout of the expression levels of now all the genes in a genome and this is made possible tremendous work by many groups around the world now classifying cancers not based on what they look like in the microscope but based on what they look like under the hood based on the complete expression patterns of these cancers breaking apart for example the acute leukemias into two parts this is of course a demonstration of something that was known but further splitting of the ARL leukemias of lymphomas of lung cancers of breast cancers of prostate cancer of meduloblastomas and we will enter upon a world classified based on information because of course the tumors know more about what they're intending to do than we can possibly see by any other method than simply asking them and already the young scientists are thinking of ways to integrate all of this information together all the information from so many different sources I mean the human genome sequence they take it for granted now they want to integrate all the pieces I'll close with a small vignette what are post-docs and graduate students and MD-PhDs are thinking it's a vignette of a particular disease gene it is responsible for form of cytochrome oxidase deficiency that is enriched in the Saguenay-Laxon region in Quebec some years ago we had mapped this thing to a particular region of chromosome 2 but because there were relatively few individuals and families it was a very big region no obvious candidates and I'm afraid it simply had a sin on the shelf and it took the resources to plowing over each nucleotide to find the gene but then someone showed up in the lab who said I claim I can do this with no new experiments that fit the price we could afford on this and so his idea was combine everything we know about DNA, RNA and protein and I'm gonna find the candidate gene he took the developing sequence of the human genome and all its annotations of potential genes he took expression patterns from those cancers I was telling you about and looked for all genes whose expression patterns were similar to known genes that might be related to the function of this disease in this particular case mitochondrial function and then he took proteomic experiments about the particular organelle involved, the mitochondrion and all he did once he got these databases to be interoperable there's a lot under the hood there was simply integrate them and one gene popped out as the obvious candidate and then did an experiment went to the lab, ordered primers, re-sequenced the two mutations, right gene that's what our students wanna do they don't wanna not do experiments we're not gonna tell them not to do experiments it's not that an informational biology is experiment free it's that an informational biology means that each time they go to the bench they go to the bench armed with the best hypotheses the best ideas and can therefore get the most done beyond the human genome three we need to build almost a full connectivity map of particular states in health and disease and in response to perturbation and for this too we need powerful tools to characterize the readouts to characterize tissues and to be able to perturb these tissues with tools like chemical genomics part of the new plan for an HGRI with RNAIs and new computational tools but this world of biology is information is rapidly coming upon us the human genome projects often thought in terms of its engineering component but it is very much an intellectual revolution as well complimenting biology as molecules with biology as information complimenting the long standing paradigm that you formulate your own question and collect your own data the experiments of individual scientists with a new and complimentary paradigm that you formulate your own questions your own personal questions but you consult common data the experiments of nature this human genome project has been a wonderful thing to be part of it's extraordinary to be part of but it's not going to hold a candle to what's going to come next and so I thank you all and thank you very much okay thank you we have an action packed final session for you this afternoon and the first presentation will be the Watson lecture given by the president of Princeton University Shelly Tullman and to introduce her we'll invite Kay Redfield Jamison from Johns Hopkins University the chair of the Genome Action Coalition author of several wonderful books to introduce Shelly Thank you as the chair of the Genome Action Coalition I'm delighted to welcome all of you today for the James Watson lecture the Genome Action Coalition is an alliance of more than 140 patient advocacy and professional groups as well as pharmaceutical and biotechnology companies the purpose of the coalition is to create an environment within the government and within the general public in which genome research and public education about that research will continue to enjoy strong support our coalition decided early on and with great enthusiasm to name our annual lectureship in honor of James Watson our previous Watson lecturers have included Francis Collins Vice President Al Gore and the Surgeon General of the United States at that time David Satcher this year in the 50th year celebrations of the discovery of the structure of DNA we're particularly delighted to name Professor Shelly Tullman as our Watson lecturer Dr. Tullman's work in molecular biology is well known and extensively honored a member of the Princeton faculty since 1986 she served there as Howard A. Pryor Professor of Life Sciences and subsequently as the director of the Lewis-Sigler Institute for Integrative Genomics she's a member of the Royal Society and was appointed by Jim Watson to be a member of the First National Advisory Council for the National Center for Human Genome Research two years ago Professor Tullman was chosen to be President of Princeton University and by all accounts has done a marvelous job Shelly Tullman has been an innovative scientist outstanding teacher and a wonderful advocate for science she's also been a wonderful advocate for women in science oops sorry we're delighted she will be giving this year's James Watson lecture instead of plaques or glass bowls which are the usual fairer in Washington which was published and made a decision years ago to give a copy of the book as our award to the sorry that reminds me there's a line in a recent biography about Jim Watson that said how hopeless he was at laboratory science that he'd go into the laboratory and things would start falling off the walls seems to spread anyway we decided instead of plaques and glass bowls we would give our honorees a copy of Jim Watson's marvelous book the double helix and I've often thought that Jim I mean obviously the science contributions are the main thing but his book as an example of just unbelievable writing is a tour de force so I would like to ask Jim if he would come up and give Dr. Tullman an autographed leather bound copy of the double helix this will definitely look better on my shelf the dog eared paperback that I have of this book Jim thank you very much it is a great honor to be with all of you today so many of you friends for over 25 years to give this lecture in honor of James Dewey Watson and to celebrate his monumental contribution to science as well as the completion of the sequencing of the human genome the 50th anniversary of his co-discovery with Francis Crick of the double helix structure of DNA surely one of the most beautiful forms in nature has been appropriately celebrated today, this year across the country and across the world the unveiling of the double helix in 1953 marked the beginning of a revolution in the then nascent field of molecular biology and the ripples of that seminal moment continue to be felt most especially today with the completion of the human genome that we are celebrating today but as profound as that discovery proved to be in my own mind it is matched by Jim's extraordinary leadership of his beloved Cold Spring Harbor Laboratory and his impact on the broader scientific community Jim is a consummate institution builder by having a phenomenal nose for scientific talent the highest of high standards that he applies ferociously but fairly and a commitment to reaching out beyond the shores of Long Island to create a place that became a home away from home for the entire life sciences community the lab has been a magnet for the brightest minds interested in science from middle school students who discover the wonders of DNA at the Learning Center from the graduate students and postdocs at the Watson Graduate School who are receiving an innovative and intense graduate education designed to give them a fast start to an independent career from students and postdocs who come to the lab to study and each summer who come to take courses that will change their scientific lives by opening up new areas of exploration to the thousands of scientists like myself from ages 18 to 80 who attend the meetings always held in an atmosphere of high energy heated scientific debate lots of beer and no sleep from the judges policy makers and philosophers who come each year to the Banbury Center to explore with scientific experts the interface between science and society and to all of us who benefit from the publications of the Cold Spring Harbor Press it's a simply breathtaking assembly of broad intellectual footprint and it's due to Jim's vision and to his enormous skill at making it a reality Jim's leadership was critical at the dawn of the genome era when he saw certainly before the vast majority of other scientists the absolute imperative to sequence the human genome by the force of his intellect which is prodigious and the force of his personality which is shall we say interesting he kept our eye on the prize and then set in place the forerunner of today's enormously successful National Institute for Human Genome Research in so doing he broke every regulation in the NIH rule book it's rumored single-handedly responsible for every gray hair in building one there are some eyebrows that have yet to descend back to normal levels but you should be glad you weren't director then but we have the human genome ahead of schedule and under budget and while there are many who share the credit for this enormous accomplishment without Jim's visionary leadership at the very beginning I predict we would not be here celebrating now I have been given the task of talking today about the impact that the human genome project will have on biology now I suspect that the reason I've been asked to give this lecture is because I gave a similar lecture almost seven years ago at the Cold Spring Harbor genome sequencing meeting in which I said a lot of nice things about genome biologists so I'm guessing that Francis Collins was hoping that I'm in a similarly generous mood today and will roughly do the same thing what he's failed to take into account is the fact that I now spend my day thinking about the sorry state of the economy and college athletics consequently I've developed a really big mainstream but in preparing for this lecture I return to the lecture that I gave seven years ago just to see whether some of the predictions that I made at the time have come to pass and I thought it would be fitting to begin as I did that time by quoting Richard McIntosh a distinguished cell biologist at the University of Colorado who said the following in 1966 future biologists will be working in an environment defined by a wondrous wealth of information about genome structure it is mind boggling to think of the ways in which our experimental lives will be changed as a result no field of biology will be untouched this was a bold prediction to make in 1996 but especially bold because the person who made it is not a molecular geneticist but a very ultra structurally oriented cell biologist his work could not have been further from the world of genomics seven years later has this prediction which I agreed with at the time held up to the test of time and money the most profound impact of the genome project has been the way in which it is demonstrated and legitimized what can for one of a better phrase be described as data driven science to distinguish it from hypothesis driven science this I think is precisely what Eric Lander was referring to when he said we have become information the way I have put it is we have come to realize the degree to which information is power it is not that hypothesis driven science the tried and true method whereby a scientist sets out to test a very specific hypothesis is diminished in its importance this has been the overwhelmingly dominant model for the 20th century biology success is simply beyond dispute it is rather that biologists have come to appreciate a different and equally powerful way of promoting the progress of science by generating a large body of data and then using it to construct hypotheses to be tested by the tried and true method I do not wish to suggest by the way that this wasn't entirely new insight for biologists one could argue for example that the genetic screens conducted by Janie Nuslein-Volhard and Eric Wischhals in the early 1980s which garnered for them the Nobel Prize in medicine in 1996 were forerunners of this paradigm these two extraordinary scientists created a large community resource in the form of mutant fly strains that were defective in some aspect of early development those strains became grist for the mill of the Drosophila developmental biology field and kept many scientists busy for the next 20 years but examples like this were relatively rare I've often told the story of my visit to the Princeton physics department not long after the NRC report the physicists were curious about this new initiative and wanted to know what all the excitement was about despite my own enthusiasm for the project I was trying hard to be even-handed and explain why the biology community was not embracing the idea of sequencing the genome without reserve I raised within the criticism that we do not yet have the tools for the genome and therefore it was possible that the data would sit in databases for many years without the means of interpreting it they were incredulous that this was a concern they pointed to shelves of books that contained data that they had collected for example satellite sky surveys that remained to be explored for physicists this was standard operating procedure today we biologists have a great deal more respect for the power of uninterpretable data hanging over the heads of a scientist to drive innovation had we not created the need for better algorithms to identify patterns in DNA it is unlikely that such programs would exist today the key lesson learned data is inherently good the genome project has also taught us to take full advantage of efficiencies of scale when the goal is to create a resource that will benefit the community biology has always been a cottage industry the unit of work consisting of a small group of scientists almost always trainees it's obvious to everyone from the beginning of the project that the genome was not going to be sequenced in that model despite the fact that the model has spawned some of the most creative and important science of the 20th century you've already seen so in fact a new lesson learned is taking advantages of the efficiencies of scale and you've already seen this picture which until today I did not know was taken by Bob Waterston but in fact what it clearly depicts is the ecological niche of a genome scientist which became a place that looked fundamentally different from a traditional biology laboratory the decision to adopt such a different model was not taken lightly but it was a critical decision because without it the genome would never have been sequenced the power that derives from adapting a high throughput cost-effective way to generate large amounts of data is now being sequenced in many projects across biology and the lesson we have learned here is if something is worth doing once it's worth doing 384 times and let me just give you some examples from this this is an example taken from structural biology this is a group of scientists in the northeast who have banded together to develop methods to do automated NMR analysis now for anyone in the audience who knows anything about NMR analysis what you know is that it is incredibly slow and time consuming and usually is done on one protein at a time basis what the genome project I think put in place in the minds of NMR scientists is the idea of in fact doing NMR analysis on a large scale with random proteins as a way to begin to assign function for proteins they are not succeeded yet this is very much a work in progress but I think the idea of doing it in a high throughput way to get around the slow and laborious NMR analysis has as its inspiration to develop the strategies that were put in place for the genome we are going to hear from Pat Brown the real revolution that has occurred for us who studied gene expression the notion that we have gone from studying the expression of a single gene painstakingly carefully to now being able to measure the expression of all genes in a cell all measured simultaneously an interesting variant of this strategy has been developed by Anthony Aletsu and Jim Metherall in Utah who are taking the idea of microwaves one step further and looking at high throughput expression of very specific expression patterns in Drosophila embryos so beginning with a random CDNA library each one of those is then in a high throughput way being used in in situ hybridization to identify random CDNAs that have highly interesting unique patterns of expression that suggest that they might be important in development again the notion that you could take a technique like in situ hybridization and use it in a high throughput way again I think is inspired by the genome here's another example that is actually taken out of science just a few weeks ago and that is the idea that you could screen for the function of genes using RNAi analysis in a whole genome wide way this is an experiment that was done in Gary Rubkin's lab and is clearly identifying genes in C. elegans that are involved in the regulation of fat by essentially taking random CDNAs and looking for those that affect the pattern of fat deposition in worms again I think once again you can see the flavor of the genome project coming through a project of this kind I would argue that all these projects were inspired by the powerful example of the genome project especially in the way in which data gathered by brute force repetition of a powerful technique can move science forward what these approaches are also pointing us toward is a fundamental paradigm shift from a reductionist approach to science in which we focused on a single gene a single protein a single cell surface modification toward an integrative way of thinking for the past 25 years biology has moved forward by investigators approaching the study of an organism much as in the fable of the blind man surrounding the elephant each one touching a different part of the elephant and therefore describing the elephant in very different ways legions of scientists have spent a lifetime studying one protein on the surface of a cell describing in exquisite detail the ways in which the protein transduces information from the outside of the cell to the inside now we have the potential to know the identities of all the cell surface proteins expressed in a cell we can begin to ask an entirely new kind of question does the cell coordinate the activities of all these cell surface proteins is there a conductor orchestrating the music of the cell or is it a cacophony with the loudest instrument winning the day using a different metaphor this is the difference between taking the radio apart and putting it back together these integrative approaches will inexorably lead us to a new kind of biology that is far more quantitative and will therefore call upon biologists who have much more rigorous training in mathematics and in computer science the capacity to extract information from large data sets as the example that Eric just gave us and to use that information to develop theoretical models for experimentalists to test will become increasingly important the close interplay between theory modeling and experiment has dominated many other branches of science particularly physics and astrophysics but it has had little impact on biology until now I used to refer to the journal of theoretical biology as the cure to insomnia no longer will I be able to say that the genome and its vast accumulation of data has the potential to change all of that by opening the doors for scientists with more analytical and theoretical bends such scientists tend to be very smart and to think completely differently from experimental biologists and that is a good thing for biology the project also reminded us in the starkest terms the central role that technology and particularly the development of new technology has on science it is now much more widely accepted that as much as ideas are fundamental to the advancement of science technological innovation is the engine of scientific innovation as Immanuel Farber one of my pathology professors taught me 25 years ago ideas are a dime a dozen its experiments the count and without advances in technology ideas will often remain a glimmer in a scientist's eye technology is often the rate limiting step between a good idea and an experiment to test it and the genome project made it crystal clear that biology had undervalued the importance of technology and thus had no had not organized the infrastructure to support its development which is both expensive and risky hopefully that has now changed I'd now like to turn in the last few minutes to some specific examples in biology that have in my view been transformed by genomics in the first example and this could actually applied all of the examples the genome project has delivered a classical good news, bad news message the good news is that it has created research projects and thus gainful employment for young scientists for many years to come the bad news the genome project are great ignorance about the nature of the information contained within the genome no one who has studied the genome as I have for the past 20 years can be anything but humbled when the human and mouse genomes were aligned and compared and here I was intending to make a point that I think was just beautifully made by Eric Lander and I know is going to be reinforced by several of our next speakers but clearly what is immediately apparent when you look at any part of those two genomes that have been compared is that evolution has indeed been hard at work conserving far more of the genome than we could explain by genes and their closely allied regulatory elements this is just a close up example of one random part of the genome that has been picked in which these blue boxes are recognizing homologies that correspond to the exons of a gene but these red boxes which these very high these lines of indicating homology between mouse and human here and up in here are elements for which we have no knowledge about what they could possibly be doing scientists should have a field day trying to understand what evolution had in mind when she paid so much attention to these little segments of DNA one can't help but reminded of Sydney Brenner's great exhortation to us at the beginning of the project when we were making pronouncements about all the junk DNA in the human genome what Sydney reminded us was don't forget the difference between junk and garbage junk is what you store up in the attic because you know it's going to be useful someday garbage is what you throw away had we not made the critical decision to sequence all the genome and not just the CDNAs or the genes we would have missed the junk which I predict is going to be some of the most interesting parts of the genome indeed there's no field of biology that is going to be more affected by the genome project than evolutionary biology and again I think you got a foretaste of that from Eric's talk some of the questions that are now open to evolutionary biologists some of these that they have been debating for a hundred years are now possible to be answered how did genes come from how often when a gene arises in evolution is it through a duplication is it through mixing and matching of exons through recombination within genes and how often are genes literally created de novo from the junk of the genome how did genomes change with time what drives the expansion like maze or the contraction of a genome like fugu how do those genomes change with time a point that Eric just mentioned how much of the genome is under positive selection versus neutral selection versus negative selection how often do we select for something as opposed to selecting against something and finally how much are mutation rates affected by where you are in the genome is there an architecture of the genome inside the nucleus that makes it more or less likely that point mutations will develop in those genes these are critical fundamental questions to evolutionary biologists and these are questions that now can begin to be approached by genomic tools here's one of my favorite evolution stories this is a place where genomics has come face to face with Charles Darwin we've spanned the 150 years this is on the left are depicted two evolutionary biologists at Princeton Peter and Rosemary Grant who for 25 years have been traveling each year to the Galapagos Islands off the coast of South America to live for three months on a rock and study how populations of Darwin's finches have changed as a result of very dramatic climatic differences and you can see that these are in fact quite different if you just look at the beak of this large ground finch which has evolved to crack very hard tough nuts which are the only things in seeds which are the only things that survived drought versus the quite narrow beak of this beach which is designed for times when the islands are wet when there are flowers and where they need to stick their beaks deep into a flower in order to extract the nectar what the Grants have been doing over 25 years is characterizing these birds morphologically so they understand in enormous detail the structural differences among them what they are now doing is sequencing the genomes of these and now correlating these structural differences which are critically important to the survival of these finches and the genetic changes that led to them Darwin meeting the 21st century likewise conservation biologists and ecologists are having an enormous impact of the genome on their work as they begin to define the level of biodiversity on the planet this is a slide that I took out of a review in nature that was written several months ago in which the authors demonstrate the enormous ignorance that we currently have about the level of biodiversity on the planet for example in vertebrates it's estimated that of the total number of predicted organisms only a tiny percent of them have actually been identified and studied what genomics is now beginning to help do is define the relationships between animals within these groups to begin to understand population dynamics scientists who study the dynamics of forests to understand who is related to who in a forest when a large tree sends out sprouts are those sprouts their daughters their cousins their second cousin twice removed all of these tools that were developed for the genome sequencing are now being used by these wonderful scientists engaged in some of the most important scientific questions that face us today what is clear is that DNA based taxonomy is going to be of enormous value in resolving many of the serious and fundamental problems that they are facing and that go back all the way to Linass the availability of genome sequences of obscure organisms and obscure from the point of view of mainstream biology of course not necessarily obscure in the view of mother nature will have I predicted dramatic effect on biology going forward when the genome project it began it began with what Jerry Fink called the security council of organisms these were the privileged few whose genomes were identified for sequencing other than the human genome they were chosen because in fact they had good genetic systems and large communities of scholars working on these organisms I would argue those were the right choices but today because of the improvements in technology we now have the United Nations and the United Nations is a much broader much more curious and interesting group of organisms some of whom are going to allow us to ask questions that we could not ask with the classical model organisms I was having lunch with Mark Kirchner yesterday many of you know him a very distinguished cell biologist at Harvard Medical School and he was regaling with me on the placemat of the restaurant with stories about the acorn worm a small animal mollus that he is now working on and the questions he was asking with this organism that could not be asked with any of the security council organisms this is going to be a wonderful boon for biology the fact that the ability to sequence the genomes of lots of different organisms is really going to make biology a much more inclusive and interesting science so in this short lecture I've been able to mention only a handful of ways in which I see biology changing in the next decade as the result of the output of the many genome projects that will be undertaken it's a wonderful time to be in biology for the frontiers field decidedly different to me and hold out the promise for an extraordinary ride ahead however I would only urge us to take this ride with the last and most important lesson that I learned from both Jim Watson and from the leaders of the project which is to aim high and to be bold thank you for having the genetic code changes the way we look at everything really allows the individual an individual creativity and insight and intuition to really come to the fore again in biology in 1960 there were so many things we said will we ever be able to do these things I don't think we never even thought about sequencing the genome much simpler things seem daunting at the time and here I was sitting 30 years later and there was a sequence of the genome in front of me and I just had this sense of history coming together the pace of history moving at an incredible rate and in fact it has changed biology there's just no question we've moved into a new era we're accumulating data in orders of magnitude more rapidly than we were before in fact I would go so far as to say for the first time we now have the tools commensurate to the task of understanding biological systems there's going to be a bright line in history and the bright line in history is going to start with the first genome going forward because it's so revolutionary completely revolutionized as how science is done the pace of which science is done the pace of which discoveries are made and will ultimately revolutionize medicine most of the analysis of genome sequences is really focused on the genes for obvious reasons but the bits of code for proteins and how many proteins they make and one of those proteins are what the so-called proteome of each individual genome is for protein coding regions we have the genetic code we've had that for some time we don't really understand how to interpret the DNA sequence of the rest of the genome and the best method it turns out for doing that is to rely on evolution let me take the example of the fruit fly if I showed you a Drosophila melanogasta and a Drosophila pseudobescura for all intents and purposes they would look identical to you yet they're 60 million years different in evolution and if you look at the genome sequences only those sequences that had some selective advantage that were important would be conserved I think a lot of the challenges in genomics are really to understand what the rest of the DNA does in the organism the future of DNA sequencing over the next 10 years is going to see another perhaps 6000 fold increase in throughput and the way we're going to do that technically is by being able to sequence single molecules of DNA because with every area of biology the key to future analysis is going to be nanotechnology the ability to interrogate single molecules or single cells so my view is in three years we can have instruments 10 of which could probably equal the entire sequencing output of all the major centers in the world today you will not be able to do biology going forward without being incredibly computer literate we'll be able to eliminate 99% of the random experiments and use the computers to narrow down using all this information of what is the key experiment to do that will advance science having the genome sequences having it available on a computer really allows the individual an individual creativity and insight and intuition to really come to the fore again in biology now when everything is in the database people can begin to think and design to particular experiments they can go back to the original idea the creative idea and they can be much more enabled so I think it really enables the small research scientists it also enables the theoretical biologists so if you would ask me five years ago to name the successful theoretical biologists in the world I would have named Francis Crick and there would have been a long pause and I'd say I'm sure there must be somebody else but I can't think of who they are I think now we're really engendering the field where these massive data sets are being generated made freely available there's really an opportunity for people to come in and interpret data that really brings the intellectual content back into research and really makes the individual not even the individual group but the individual person a real unit of scientific discovery again so I think it's tremendously liberating I think this is the groundwork for human biology and medicine of the future in medicine we can actually show that it's economically feasible to prevent diseases or catch them early versus waiting until symptoms appear this is very clear cut in some types of cancer for example colon cancer so genomics coupled with this new preventative medicine paradigm will enable individuals to have degrees of control of their own lives for the first time in human history my favorite way of looking at the impact of the genome is a story which actually didn't involve sequencing of the genome involved sequencing of a gene it was a gene that we discovered years ago in a cancer inducing virus which controlled the growth properties of cells was an oncogene and was discovered to be in a human cancer the critical element that led to the human cancer and Novartis drug company took that information and built a compound which inhibits that enzyme and that has now revolutionized the treatment of the disease chronic myelogenous leukemia is very specific that's the poster child because that's what basic molecular information can lead you to it can lead you to drugs that are specific that are powerful that are safe so we can design preventive drugs or design proteins or design genes or whatever and in doing so I expect we'll add easily 20 plus years average lifespan and those 20 plus years will be productive and creative years and parenthetically that poses really interesting challenge for society of how we treat older people and that generates all sorts of social and ethical issues well knowledge particularly large increases in knowledge always have major effects on society and this is knowledge about living organisms and of course that includes us like component to almost everything there's a huge difference between the physical and the actual life outcomes and the future of medicine will have to deal with that difference and this quite naturally leads to the whole issue of genetic privacy so if we know all of your predispositions who has a right to have that information does your insurance company does your employer does your family do your friends I mean where do we draw these for these kinds of things a lot of patients are going to push their doctors for information which is going to force the doctors to educate themselves even if they don't want to necessarily so if a patient comes in and says you know I have breast cancer what's the likelihood my kids my daughters are going to get breast cancer you better go look it up we're going to see miracles we're going to see miracles of all sorts coming out of the molecular information translated through clinical experience and that's what this is all about thank you we're now moving to a series of short talks on targeted areas from leaders in the field and the first speaker hardly needs an introduction it's Bob Waterston the Gates professor in the department of genome sciences at the University of Washington in Seattle best known of course for the nematode and the human and the mouse he's going to talk to us today about the mouse thank you Richard it is a great pleasure to be here and like Eric this I'd like to acknowledge the many people throughout the world who have made this a special day they have been fantastic and it is really their day and I congratulate them all now if I can back here in the beginning okay good so I'm going to bring us back to reality here a little bit and I'm going to talk about the mouse sequence I think Eric gave a little bit of everybody's talk in this next session so maybe I'll be done sooner so anyway one of the issues that I'd like to address is finding genes and I'm going to use the mouse genome and its comparison human to illustrate how that's contributing so I'm going to give you first a brief tour of the findings of the mouse genome and then show how it can be used to improve gene prediction you've seen this underneath it is the human sequence paper and it is really the comparison of those two genomes the mouse and the human that was the subject of the mouse sequence paper and really made this a fun paper to write I don't have to tell this audience why you have to do mouse and I won't bother to do this except that it is at this key evolutionary distance where most of the genome has diverged through chance events in the genome whereas functional sequences remain similar to do the mouse sequence we use the hybrid strategy we started out with a back based physical map and that was complete in last summer we've done now a draft sequence which is what was the subject of the publication and now we're going on to finish this to produce a high quality sequence just as high quality as the human because I think that's really going to be necessary to get full interpretation and that's well under way so I'm going to talk though what I'm going to talk today about is based on the draft sequence just some technical details this was about a seven fold sequence coverage it was assembled it had some 176,000 gaps so it had lots of holes in it but it was able to order and orient those pieces into just 88 pieces covering the genome 96% of the sequence being represented and this just gives you an idea of what that means in terms of scale these are the mouse chromosomes and you can see that in many cases these alter context as we call them these segments of ordered and oriented sequence span much of the chromosome these other chromosomes where there are more we don't understand why there are more and maybe there are different things unusual about those chromosomes the analysis was carried out by an international group of really stellar scientists not only at the genome centers who contributed the sequence at Whitehead Washu and the Sanger Institute but these various other groups and still others across the globe we met for more than a year talking by phone it was a fantastic experience the mouse is indeed shows us how dynamic genomes are over this period the genome is about 14% smaller than the human and we estimate that it's lost over a billion bases since the last common ancestor but that's been partially replaced by the insertion of repetitive elements and this has occurred in small regions through the genome so that in fact the genes are left more or less intact in spite of all this churning in the genome and Shirley was asking some of the kinds of things that might influence the creation of genes and so forth and this kind of dynamic activity must surely have something to do with it the mouse has two times the mutation rate since the last common ancestor compared to human and if you take a static view now for some reason the mouse mutation rate is about five times that of human there are local family expansions where some genes have been amplified in mouse these tend to have to do with reproduction immunity and olfaction and there are indeed by comparing these two genomes see clear evidence of positively selected genes but in spite of that much of it is much of the architecture of the genome has remained the same and Eric alluded to this this is aligning human chromosome 20 with regions of mouse sequence and basically the two sequences are fully in alignment here's a region where it's more interrupted where the different pieces of mouse chromosome where the human chromosome is represented in many more different fragments in the mouse but nonetheless this gene order this conservation of gene order is a very powerful tool for interpretation and so we have indeed a map where we can take and place the human chromosomes here on designated here by these colors and lay them on to the mouse genome and basically begin to think about reconstructing the nature of the last common ancestor another in terms of other things that are the same 99% of the mouse genes have human homologs that doesn't mean they're exactly the same but they're certainly not new genes they're different species at a very frequent rate so it's rather than different genes totally novel genes accounting for the differences between us and mice it's actually implementation how the same genes are employed in making this each genome seems to have less fewer than 30,000 genes and accounts for less than 1.5% of the sequence and yet as subsequent speakers will deal with in more detail and others have alluded to as well some 5% of the genome is conserved but I'm not going to I'm going to leave that for others to speak of instead I want to talk about now how this is being used to aid in our finding of genes we've talked about these genomes as if we had the parts list in hand and all we had to do was begin to understand the regulation but the dirty truth is that we probably really only have about half of them firmly in hand at this point and what I want to do is show how we can begin to use the two now one way you can do it obviously has been alluded to and that is that you can find new genes new exons and so forth by this comparison here's a region of mouse and human where the spikes represent regions of greater conservation here's a whole new gene that was discovered an ApoA gene in the midst of a locus that's been heavily studied this is from Eddie Rubin's lab and indeed this was very revealing and the study it's the project of active study now but another major problem is not just finding the genes but then distinguishing them from pseudo genes and this is often ignored but indeed our genome is littered with copies of degenerating genes these are two classes processed pseudo genes copies of MRNAs that are reverse transcribed and reinserted into the genome in new locations and then duplicated genes that are degenerating as one example the gene GAPDH it has apparently one functional gene in mouse but more than 400 copies and 118 of these satisfied the criteria for gene prediction programs and basically were contaminating genes in the mouse gene set as part of the mouse analysis we undertook to see what we could learn about where these pseudo genes were really causing us problems so there are 22,000 genes that had some kind of evidence supporting their prediction from any of these sources about 12% of those had no gene in the corresponding human region and 23% I mean 5,000 are members of local gene clusters that is these duplications where there are multiple copies of a gene side by side when we looked at these two classes by inspection of the gene predictions about three quarters of this group appeared to be pseudo genes and about 30% of these were pseudo genes so that's actually a quite sobering percentage of the total set so just in terms of this it does suggest a way though of getting around the problem we can capture a lot of these pseudo genes depending on homology by inspecting members of local gene clusters carefully but we have to be careful when we deal with these local gene clusters because in there seems to be a lot of the unique biology there are 25 clusters in mouse that appear to be specific in the mouse and they involve proteins involved in reproduction immunity and host defense some of these are the genes undergoing positive selection again a clue as to where new genes may be coming from so we have to be careful to maintain these without while we're trying to get rid of pseudo genes so anyway basically we're going to try to take advantage of the fact that processed pseudo genes can be eliminated by conserved centenie processed pseudo genes these duplicated forms we can take advantage of relatively rapid divergence and then also we can look for single exon genes is where multiple exon forms exist so we carried this out on chromosome 7 we found clear evidence based on CDNAs for 605 genes and then using these criteria using gene prediction programs and filtering them based on the suggested criteria we ended up with another almost 550 bringing the total genes to almost 1200 for chromosome 7 we then turned around and searched intergenic regions for evidence of protein homologies and pseudo genes and we found a remarkable number some 940 so almost as many pseudo genes in this chromosome as real genes at least by discount so we wanted to have some kind of check on this to see what was if we were on target or not and so we took advantage of the fact that there are synonymous and non-synonymous changes synonymous changes don't change the coding potential of a gene whereas non-synonymous changes change the amino acid and will change the function and so one should have a neutral rate and the other will affect function so when we applied it to 7 only 5% of these 1150 genes appeared to be likely pseudo genes by this statistical analysis by contrast of the 941 pseudo genes that we had again by this kind of analysis 97% of the pseudo genes so we had actually done a reasonably thorough job of both getting real genes and getting rid of pseudo genes I think this is promising for the future and indeed when we looked at these pseudo genes 80% of them appeared to be processed pseudo genes these are made by this reverse transcriptase mechanism 20% came from duplications interestingly it also appears that genes that are in these duplicated regions of recent duplication that Eric alluded to appear to be under relaxed constraints so anyway I think that this is a very powerful benefit of being able to compare sequences I think you'll hear from further speakers more on the same theme but with this and maybe one or two other genomes I think basically we'll have a complete and thorough parts list for human and that'll be a very valuable thing for the future you've seen this already it was with all the features of it these are, I should acknowledge all the members of the mouse consortium who participated in this and the participants of an Eichler and Latina Hillier as well as all the folks at WashU were critical in the in the humans chromosome 7 analysis and of course all the data is freely available thanks next speaker is also on the subject of comparative genomics it's Eric Green who's the scientific director here at the NHGRI moved there after his period at WashU and has made many contributions to the genomics well today we celebrate a hallmark event of course in the history of science with the completion of a finished human genome sequence but as we marvel at our accomplishment we also look to the new scientific horizons that are brought about by this informational resource and one of the many areas that you've certainly been hearing about in this session that we're going to pursue is the comparison of the human genome sequence to that of other organisms as you just heard about for the mouse and Shirley Tillman alluded to and certainly Eric Lander talked about if I can have the first slide but really it's appropriate to pause and ask the question and thinking about new frontiers in comparative genomics is why sequence additional genomes especially beyond those that have already been completed by the human genome project and I think there's several reasons that can be cited for this first, sequencing other genomes provides intrinsic biological insight about other organisms for the experimental models the data facilitates research with those organisms each with their distinct anatomical physiological and developmental features but there's also great interest in gaining insight about the genetic blueprint of livestock and also companion animals the second reason is that comparative sequencing provides tremendous insight about the basic underpinnings of genome evolution in the case of comparative sequencing you gain insights about the mutational mechanisms and DNA repair the acquisition of new functions through evolutionary change and of course the molecular basis of speciation but I think most important for this session is the fact that comparative sequencing is key for interpreting the human genome sequence and because of its importance I really plan to emphasize this last area in my talk today so shown here is one ten thousandth of one percent of the human genome sequence the next great challenge in genomics is to identify all of the sequences in the human genome that are functionally important in essence this is going to involve using computational tools and experimental tools to essentially highlight all of the sequences the subset of the sequences that have a functional role first we need to identify the one to two percent of the sequence that corresponds to exons but far more challenging we need to go in and identify the larger amount of non-coding sequences that function as regulatory elements to control gene expression that mediate chromosome dynamics and in fact that function as yet undiscovered ways indeed I think it's really important to make a distinction between functional elements that do and do not code for protein because the paths towards their identification is really so different so for simplicity I will make the general claim that coding sequences that is genes are relatively easy to identify this is because we mostly know what we're looking for we have complementary data sets such as ESTs and full length CDNA sequences available to us and because of our understanding of basic gene structure we have ever improving tools for using computational methods for doing gene predictions and so in short while I recognize there's a huge challenge as Bob just alluded to finding all the genes I'm not particularly worried about our ability to find all of the coding sequences in the human genome in contrast we have just a huge task ahead of us in identifying all of the non-coding functional sequences because these are very hard to identify why well first of all little is known about actually what we're looking for in the first place there's virtually no complementary data sets that are available to help us and our general ignorance is basically left us in a state where we don't have very good computational methods that are available for actually identifying sequences among non-coding DNA that are functionally important indeed the conclusion from my talk which I'm happy to give now is that a major role for comparative sequencing is going to be the identification of functionally important non-coding sequences in the human genome and I now want to provide you some evidence to support this statement well recognizing the value of comparative sequence analysis the generation of the human genome sequence was quickly followed by the generation of the mouse genome sequence and detailed comparisons have been performed providing tremendous insight about the mammalian genetic blueprint as you've heard about from several earlier speakers I really find several findings from the initial comparisons of mouse and human particularly relevant for today's talk first 40% of the human genome sequence will align with the mouse genome sequence but as you just heard about only about 5% of the genome is estimated to be under evolutionary selection implying that about 5% of the mammalian genome is functionally important of course we don't yet know precisely which bases constitute this 5% that functionally important 5% is divided up into about 1.5% that is protein coding and the remaining about 3.5% that is non-coding the challenge thus becomes to identify this 5% from among this 40% identifying the 1.5% which is protein coding which I've already told you was relatively easy but then identifying this 3.5% which I have already argued is actually going to be quite hard well a powerful approach for tackling the challenge is to identify sequence from additional species especially those selected at distinct evolutionary positions and then look for sequences that are in common sequences that are shared over large evolutionary distances are likely to be functionally important and indeed efforts are already underway to sequence additional vertebrate genomes such as the rat and zebrafish and chicken and chimpanzee and cow and so forth but one limitation is that only a handful of additional genomes can be sequenced at least for the foreseeable future mostly because it costs upwards of about $100 million or more to sequence a vertebrate genome using today's technologies well the comparative genomics program NIH is interested in generating data to address the question what are the best combinations of species to sequence for identifying all of the functional elements in the human genome and rather than tackling entire genomes our program is sequencing limited or targeted regions of the genome allowing sequences to be obtained from multiple other species in turn sequence comparison can be performed with an evolutionarily diverse more evolutionarily diverse set of species providing quick insight about how to use such data sets for finding functional elements we are now sequencing literally dozens of different targeted regions across the human genome in some cases in greater than 20 species I want to briefly show you some illustrative data generated for this region of human chromosome 7 which contains these 10 known genes the one I want to highlight is the famous CFTR gene the gene mutated in cystic fibrosis this gene was identified in 1989 and since then it has become one of the most intensively studied genes in the human genome with hundreds of investigators pouring over this gene for the past 13 plus years and publishing well over a thousand papers with so much work performed on this gene one would think we now know everything there is to know about it including all of the regulatory elements and all of the functional elements embedded within it indeed this is far from the truth so to illustrate this point shown here is a 30 kb interval within the CFTR gene encompassing exons 11 and 12 along with the intervening intron shown here with the various geometric shapes they are just simply corresponding to different repetitive elements several years ago we sequenced this region in the mouse genome and compared it to the human sequence which was generated in partnership with the Washington University Genome Sequencing Center shown here is an output which you have actually seen a version of it is a sequence with a comparison tool called PipMaker developed by Webb Miller's group with the middle section simply blown up here at higher resolution PipMaker generates pair wise alignments between two related sequences and then generates a percent identity plot or a pip that depicts the positions of the resulting alignments as along the length of the reference sequence in this case human as a function of the percent identity from 50% to 100% so in short as you look across this interval and you see all of these little dots and lines they simply indicate regions where the human and the mouse sequence are highly similar ranging from 50% to 100% identical immediately notice the striking sequence conservation in the two exons which is exactly what you would expect but in addition if you look in the intronic region you immediately see the point made earlier that just under half of the human and mouse sequence aligns with most of the alignments being within the non-coding intronic region now presumably some of these intronic alignments correspond to functional elements but how do you find the functional elements amongst all of the noise clearly one vertebrate species in addition to human is not sufficient but what about using multiple other species well we have now sequenced this region and multiple other vertebrates so primate sequence such as from chimpanzee and from baboon reveal the very high sequence similarity in humans as expected well over 90% across the interval for both species various other mammalian species such as cat and dog and cow and pig and rat give patterns that are more like the mouse with varying degrees of sequence conservation across the intron moving beyond mammals to a bird the chicken and a fish one encounters very very different pattern with the intron now essentially being devoid of any sequence conservation indeed the only consistent feature among all of these vertebrates happens to be the presence of sequence alignments within the two exons as you can see the patterns in sequence conservation among the intronic regions of the mammals certainly appear quite chaotic and quite noisy nonetheless we sought to investigate whether we could use these sequences as a group to try to identify discrete regions in that intron that are highly conserved among multiple species towards that end a postdoctoral fellow in my laboratory Elliott Margolies working with Matthew Blachette a postdoctoral fellow in David Hauser's group are developing algorithms for identifying highly conserved regions using multi-species sequences specifically they take all of the sequences from this various species and then through alignments in various sophisticated analyses they can detect actively conserved sequences and then in a final distillation step they identify those sequences that are conserved in multiple species we call these highly conserved sequences MCS for multi-species conserved sequences well by analyzing sequences from all of these species we can identify a small number of MCSs within the interval for example MCS is shown now in pink overlap both of the exons as expected another MCS just happens to sit right there interestingly a little bit of database searching reveals that this region actually contains a rarely expressed alternately spliced exon two other MCSs sit there and there and it turns out that these were found to overlap sites that are sensitive to cleavage by the enzyme DNAS1 and such DNAS1 hypersensitive sites are not to be associated with regions of transcription factor binding thus anecdotally there appears to be evidence that indeed MCSs within this intron do correspond to functional sequences in addition though there are a handful of additional MCSs within the intron the functions of which are currently unknown in total we find over a hundred MCSs across the CFTR gene with roughly two-thirds of these residing in non-coding sequences so despite the huge effort of studying the CFTR gene for all of these years we can now identify literally dozens of new non-coding sequences within the gene that are highly conserved in multiple species but for which virtually nothing is known about at a functional level by analyzing the larger 1.8 megabase region encompassing the CFTR gene we now find that about 4% of the sequence falls within an MCS this value of course is nicely consistent with the estimated percent of the mammalian genome that is thought to be under selection of the 4% corresponding to MCSs about 27% corresponds to known coding regions and a bit more fall within UTRs leaving about 68% that reflect unknown presumably non-coding regions the bottom line is that sequences from multiple vertebrates each of which individually looks hopelessly noisy and confusing can be used to find discrete highly conserved regions that represent candidates for functional elements but at the same time I make absolutely no claim that the species that we've sequenced so far represent the best set for finding such elements indeed if you look at the evolutionary positions relative to humans you can quickly see that while we've sampled a nice distribution of vertebrates we've really emphasized these conventional mammals as our data reveals there's still quite a lot of sequence conservation between humans and these mammals in fact there's almost too much conservation but this falls off dramatically when you go all the way over here to a bird in fact often there's too little conservation for some comparative studies sequences from species residing in evolutionary positions between these birds and these mammals might prove to be very useful such as marsupials and also monotranes indeed if we return to that same region of the CFTR gene sequence from two marsupial species the North American possum and the Australian dunnert reveal a much simpler pattern of sequence conservation in particular within the entronic region less conservation than what you see in these mammals but certainly more conservation than what you see in the bird or the fish species suggesting that marsupial sequence may have proved to be a valuable resource to serve non-coding sequences in particular when you consider the fact that these smaller regions of conservation very nicely overlap regions where I told you previously was exactly where the MCS is reside so in short our multi-species sequencing effort is providing a framework that we believe will help guide the selection of future genomes to sequence this includes both establishing approaches that for assessing the relative value of different species sequences for doing comparative analysis and with the detection of MCS being just one of many metrics one could imagine for doing that assessment but it's also providing sequence data from previously unexplored genomes and I think really that while the results of our program are still preliminary I think ultimately it's going to come to show that we should sequence an additional set of mammals probably a marsupial and a monitoring maybe another bird and a fish and an amphibian maybe a reptile or two and I don't even have time to tell you a whole other set of reasons why we would want to sequence multiple primates but I would send you to a recent paper that's published last month in science from Eddie Rubin's lab that I think nicely makes a case for that so in summary the generation of a complete human genome sequence brings new challenges that will be greatly aided by a Noah's Ark approach for multi-species sequencing such evolutionary diverse genome exploration coupled with new powerful computational methods will I think help illuminate all the functional elements in our genome and facilitate the comprehensive understanding of our genetic blueprint and in closing I just want to acknowledge the dedicated members of my laboratory as well as the NIH Intramural Sequencing Center I want to also acknowledge the various collaborators many of whom I know are sitting in this audience but really my presence on this stage today reflects a thrilling journey I've had in genomics for the past 14 years including my extensive involvement in the human genome project since its inception but I really owe that presence to the remarkable opportunities and support that my parents actually provided to me and they actually journeyed here from St. Louis to attend this important symposium and so in closing I want to acknowledge them and actually thinking about my journey I want to thank them for their contribution to the human genome project and I want to thank all of you for your attention thanks Eric we're all convinced to sequence more and more genomes if we weren't before the next speaker is Pat Brown from Stanford who I think the most important thing I can say about him is he's just done wonderful wonderful things with DNA chips and studying gene regulation and so forth and he's going to talk to us about genomes life stories that's okay I think Eric has your slides so he has Eric's talk on the computer not his and I think we can work with that I'm afraid I can't spend any more time introducing Pat he's the one of the five speakers that I really don't know let's see if we get it sorted out David are you willing to come next David? Oh you have it? okay it looks like we're set thanks Pat thanks very much Richard thanks everyone for your patience while I mess around my computer here the idea that variation in genes gives rise to variation in the form in physiology and behavior of people and other living things is one of the most powerful ideas in biology and really the central idea of genetics inherited variation in the sequences of genes leads to the kind of variation between species and between individuals that's illustrated in this picture from this somewhat infamous issue of Science Magazine and what's easy to see just by looking around this room there's another kind of variation in genes that's much less familiar but much richer in variety and arguably even more important and to see it we need only just to look closely at one person the tens of trillions of cells in your body are so diverse in appearance and behavior that if we didn't know any better there would be thousands of completely different organisms and yet every one of these cells has an identical genome the variation that produces this extraordinary diversity is not in the sequences of genes but in their parents of expression and even though the sequence of the genomes in these cells doesn't change over time the structure and physiology and behavior of these cells is constantly changing in response to changing signals and conditions and life experiences during development a simple tiny round cell too tiny to see with the naked eye can turn into a gigantic neuron that's more than a meter long elaborately branched and electrically active and this nice quiet resting lymphocyte here when it senses the presence of a bacterium can turn itself into an angry lethal cell that will hunt down and kill the bacteria well if these cells if each of these cells has an identical genome how can they be so fantastically different? yes each cell has an identical genome the words in the genomes vocabulary are identical but just as you can use your vocabulary of about 30,000 words to write a million different stories just by using them in different combinations and patterns the genome can use its vocabulary of 30,000 or so genes in different patterns to write a different script for the life story of each cell in your body and that's exactly what it does well wouldn't it be wonderful if we could read all those life stories? there's a line that's often quoted at least it's often quoted by scientists from Marcel Proust that says the real voyage of discovery is not in seeking new landscapes but in seeing with new eyes and I must say I find it quite remarkable how much his handwriting resembles a PowerPoint font well as frankly Francis Collins rightly emphasized the genome sequence is a public resource available to everyone and anyone can use it to make new eyes that enable us to see ourselves in a new way and here's an example of new eyes made possible by the genome sequence a DNA microarray which is made by simply printing tiny spots of DNA representing each of the 30,000 or so human genes in an ordered array on a simple glass microscope slide using a simple benchtop robot and you can barely see the pattern of these 30,000 tiny spots in visible light it really doesn't look like much but when we use the microarray to look at the gene expression pattern in a cell or a tissue or a tumor it lights up each of the colored spots that you see here represents one of the 30,000 genes and we can visualize the gene expression pattern using red and green fluorescent dyes to label the messenger RNA that we isolate from cells so that the color of each spot shows us whether a gene is expressed and at what level this particular picture actually comes from an experiment that my son Zach a high school sophomore did last summer looking at how the gene expression pattern in his dad's blood he particularly enjoyed the blood drawing part actually this experiment changed hour by hour over a couple of days and I think it's an example of how the big science of the genome project enables lots of smaller scale science a high school student with his own hands and eyes can see the whole genome come alive well with these new eyes we've now embarked on a voyage of discovery reading the life stories that the genome writes for every cell and every tissue in the human body and we're just getting started on this of course but okay so we have lots of colored dots that I'm claiming represent a life story of some kind but it is supposed to be a story it's not in any language that we understand but there's hope because the genome's language has a beautiful intrinsic logic and order that mirrors the intricate molecular choreography of the cell and the diversity of cells and tissues in our body and we can use this inherent logic to organize the data that we collect on this voyage of discovery into a new kind of map of the genome that's not a map of its physical structure but a map of its expression program its voice so to speak and the format of this map is like a table each of the horizontal rows you see here represents a single gene and in this particular map we're only looking at the 6000 that vary the most in their expression in these samples and then each of the horizontal columns represents one of the samples we analyzed and it's pretty much a grab bag of a bunch of human cells and tissues and half of them being non-malignant and half of them being cancers and then each individual pixel several million pixels in this map represents the expression level of a single one of these genes in a single one of the samples and we use a color code with a very dark blue color representing no detectable expression and the brightest yellow representing the highest levels of expression and all these shades of turquoise to allow visualization of the data and the final thing that we do is to reveal the intrinsic order in this program is we use a statistical clustering method to group together genes that have similar expression patterns and we do the same thing to cluster the cells and tissues according to the similarity in the gene expression scripts for those tissues so notice how diverse and intricately patterned this gene expression program is which really parallels the diversity the molecular diversity and the functional diversity of the cells and tissues in our body now if the gene expression patterns are really the genome script for the life story of each cell as I said then we would expect that similar cells and tissues should have similar scripts and they do for example right here in this series of about 12 samples here are all normal brain samples and you can see how they have a very characteristic and distinctive gene expression pattern with subtle differences depending on the region of the brain but that distinguishes them readily from all the other samples in this set and right next door we have a set of samples that are all brain tumor samples and you can see that again they have considerable similarities in the normal brain but also distinctive differences that separate them from the normal brain samples and also from all the other samples in this set and you could do the same thing with the dozens of other kinds of cells and tissues and tumors here each one of them has a very characteristic and distinctive gene expression pattern that causes it to be localized in a particular way in the map here now this means that we can use these gene expression programs in principle to identify and classify an unknown sample so for example if one of you were to walk up to me with a little tube containing a sample of RNA from an unknown tissue I would, I could analyze it using a DNA microwave by comparing it with a map like this I could have a pretty decent chance of being able to tell you not only what tissue it came from but also quite a lot about the physiological state of that tissue and I think it's easy to see how this can become a pretty powerful diagnostic technique the expression patterns also tell us a lot about the genes we learned what the words in our vocabulary, the vocabulary that we use in our everyday lives basically just by observing how they're used and we can use the same strategy to learn the meaning of the words in the genome's vocabulary the functions of the genes these detailed pictures of each genes if you just read along the rows here you have a very detailed specification distinct for each gene of each gene's expression pattern that's specified, we're basically evolution that said it's, it's it improves fitness to have that gene expressed and what tissues and under what conditions and by looking at these rules for usage we can start to build up a pretty well developed idea about the function of the many thousands of genes that are newly discovered by sequencing and for which we have no experimental data as to their function and then as we learn this language we begin to use what we know about specific sets of genes the words that we're able to recognize to read the scripts just in a sense as we might read the script of a play and then visualize in our minds eye the action on the stage for example this little cluster of genes right here is a set of genes all of which encode proteins that are exclusively localized in the mitochondria and they have a role in the mitochondrial function the cell's energy factory so to speak and so their expression basically gives us an idea of the energy needs of the cells and how much and what the oxygen demands of those cells are because they're involved in aerobic respiration and you can see obviously that the brain needs a lot of oxygen and so does the heart here okay and this little cluster here is a set of genes that are exclusively expressed in cells and tissues from males and so basically this tells us which of these samples come from men and boys and and then this cluster of genes here is specifically expressed during cell division in dividing cells and so its pattern of expression shows how fast the cells are dividing and you can see for example that in the normal brain there's very little cell division going on whereas in the brain tumors there's substantially more cell division and the same is true for the many other cancers that are represented in this sample they tend to be dividing more quickly which is no great news so that's just three clusters of genes and of course if you look at this map you can see how there are many hundreds of additional clusters of genes that we can use to build up for each of these tissues a reasonably rich picture of what its molecular characteristics are and some ideas about what it's likely to have your own physiology will be okay well this is just sort of a static picture of the different cells and tissues in our bodies and we know that every one of these cells and tissues is sort of continually reinventing itself the living genome doesn't impose sort of an authoritarian hard wired program on the cells but it sort of sits right at the interface between nature and nurture so to speak and it is continually bringing new pages to the story in response to its life experiences and so the corollary of that is that we really want to be able to get dynamic pictures of the expression program under all the different conditions to which each of these cells can respond and that's going to be a very huge task that we've barely begun I just want to illustrate one important example of that and that's the cell division cycle this is a map very similar to the one I showed in the previous slide but in this case we've organized it such that you can see for this set of 900 genes that are periodically expressed during the cell cycle the waves of expression of those genes as the cell progresses through the cell cycle and by recognizing that these 900 genes are regulated during the cell cycle and looking specifically at where they're expressed during the cell cycle we can build up a better picture of the molecular choreography of this important process and also the role that each gene plays in it well we're just really beginning this voyage of discovery but already the results are starting to have some influence on how we think about important medical problems as Eric alluded to briefly in his talk and for example here's a map of the gene expression patterns taken from several hundred human cancer samples and this map gives us a very rich picture of sort of the molecular composition and the wiring of each different kind of cancer cell its interactions with surrounding cells and tissues, what systems are active what are present and even what potential drug targets are available for treatment of the cancer and you can see from the color coding of the tumor samples here kidney, brain, prostate, breast, stomach, ovarian, etc these little branches at the top here that each of these cancers that we can distinguish by sort of classical pathological means also has a very distinctive and characteristic gene expression pattern this already suggests the potential of this kind of analysis for identifying molecular markers that could be used for early diagnosis and screening and so forth but if you look more closely it's also evident that cancers that originate from the same tissue in this case breast cancers that we would today give the same diagnosis are not all the same in their expression patterns in fact there's a lot of diversity in their molecular programs and this parallels the clinical observation that cancers that we call by the same name and this is a very challenging clinical problem so these gene expression portraits of cancer have now become the focus of a large scale effort by hundreds of groups around the world that's starting to begin to define a new more precise and hopefully predictable molecular classification of cancer that I think in the next few years will lead to more individualized and hopefully more effective treatments now there's an important subtext to this story and also to every story that you've heard and every story that you'll hear today and tomorrow and it's probably the most important and the least appreciated story of the genome project but it's been touched on by virtually every speaker this morning Francis Collins rightly emphasized that the human genome project has been a model for the value to science and society of the immediate open and unrestricted access to new scientific discoveries in particular genome sequence information none of this would have happened were it not for the fact that for the past DNA sequences have had a unique status among scientific information they've been kept in an open public resource freely available to everyone from online databases like GenBank for anyone to use for any research purpose our ability for example to make and use DNA micros is absolutely dependent on our ability to find, organize and analyze copies and copy every gene sequence and just to give a fresh illustration of this ripped from the newspaper literally of the public benefits of open access Joe Derisi was a brilliant young scientist at UCSF with these colleagues here made a DNA microarray by simply downloading every viral genome sequence that was available in GenBank a little over 10,000 of them and synthesized little bits of DNA for each of them and printed them in this microarray that you see here so when the Centers for Disease Control got their hands on the first sample from a SARS patient they immediately sent an aliquot to Joe with a single microarray analysis Joe had tested this tiny sample for every single one of these 10,000 known viruses and was able to report back to CDC that it was almost without a doubt a novel coronavirus previously unknown in humans but with features of many animal coronaviruses so what's amazing about all this is that DNA sequences are unique in being treated as public resources essentially every other kind of published information becomes the private property of the publisher who thereafter has monopoly control over who can read it and how it can be used doesn't matter if the research was sponsored by public funds or for the public benefit just try to imagine if DNA sequences were treated the way every other kind of published scientific research is treated not freely available for you to download and use creatively and use to develop new and powerful algorithms but only available in the publisher's terms at the price the publisher sets in a thousand different journals each with different restrictions on access can't search the sequence itself you can only search it by the name of the gene and a few keywords all the things that we've taken for granted that we can do with sequences wouldn't exist today and the value of the sequence information to the scientific community would be so diminished that it's very unlikely that there would have been a genome project so why should any other kind of scientific information be treated differently from the information of genome sequences in GenBank imagine if all other scientific publications were open available for creative development of new tools like we've seen in genomics to search, annotate, interlink integrate not just DNA sequences but the whole body of scientific knowledge what Shirley was alluding to and many of the other speakers I think you'd see a transformation of science in medicine that would match what we've seen in the genome project and why should a college or a high school student with a passion for science or a cancer patient who wants to learn about the research on her disease or her physician the people whose taxes paid for the research for that matter why should a physician or a scientist in Audis Ababa or Bombay, the people around the world who could benefit from ready access to the latest research why should they have any less access to the published results of publicly funded research than I have at a wealthy research institution like Stanford that can afford to pay for access is this what scientists and the public and the government want? Of course not but that's what we have today what this system is a loss of science in society and I think we as scientists owe the public who supports our research the full benefits of the research they support and we owe each other the full benefits of the work we do as part of the scientific community and this is an easy problem to solve it's not rocket science so let's fix this system there's an active grassroots movement of scientists to make all scientific publications freely and openly accessible check out this website and in closing I want to urge all of you to do what you can to help us make all scientific research a public resource the way the human genome is today and I want to just say thanks and congratulations to the genome sequence team Thank you so very much Pat for the beautiful work and also those great comments about free data release and now we're going to shift over to some computational genomics and welcome David Hausler from UC Santa Cruz and he's of course very well known for us for the beautiful work out of his group on the browser and the work in assembling the draft thank you very much Richard well I'm in the awkward position of returning us back to comparative genomics a little bit here and I would say that Eric Lander has already said a lot of the things I'm going to say and then Bob Waterston stole my opening joke that Eric Lander has already said a lot of the things I'm going to say so we're in some kind of a recursive function here moving backwards I don't know so I'll do the best I can with what I have left that's unique in this talk so I want to talk about three things I guess I'm here I'm here representing an interesting subculture here the cyber geeks and the math nerds that have all come into this field and I think have invigorated it in an interesting way so what are some of the things that they've done in the analysis of these in these big analysis projects I'll hit on just three in this talk sequence assembly is a very important thing that has to be done computationally genome browsers which I want to emphasize are a kind of a new computational microscope on the genome and then computing evolution's path which returns us to the comparative genomics theme so assembling the genome some of these stories have been told at length in the popular press where in the draft phase of the sequencing there were two great projects one led by Gene Myers and Granger Sutton who actually pulled together that whole genome shotgun data which was an enormous computational task and then there was also a somewhat last-minute and now famous effort by Jim Kent who was then a graduate student and he pulled together all of the data from the public project into a coherent draft sequence and during the hectic days the month before the June 26th announcement he typed out some tens of thousands of lines of code icing his wrists at night just furiously trying to put together a program that would pull together that sequence into as coherent a picture as was being pulled together by Myers and his team at Solarra and I think this is one of these cases where you have 400,000 pieces of course the groundwork for those pieces was laid by Phil Green another one who fits this prototype quite well we never could have gotten the pieces that needed to be assembled together without the brilliant advances by Frapp and the other mechanisms but you see that these people have pulled together data unfathomable to do by any kind of hand analysis in the nick of time and one thing we're proud of here at Santa Cruz is that that draft, that first working draft was posted on the World Wide Web from UCSC July 7th this shows internet traffic now outgoing is the green traffic this is the weeks of the year 2000 and you can see it's rather quiet accessing the UCSC site until the news that the first working draft of the human genome was set free emphasizing some of the previous speakers and unrestricted you could go ahead and get a glance at your own genetic instruction book and that hit caused half a terabyte of information to be broadcast out on the net that day July 7th broke all previous records and you know scientists all over the world had access to it who actually looked at it other cyber geeks they were looking for mysterious messages counting the number of times gattica had occurred in the sequence it wasn't exactly scholarly work on July 7th but at least it was free, it was out there everybody could jump at it and get to it and so I'm very excited about that event if we look recently other important advances have been made in assembly the arachne assembler is a landmark putting together again the whole genome shotgun sequence of the mouse and the David Jaffe in Seraphim Batsuluit in the lander group and also there was another assembly I want to mention here the fusion assembler put together by Jim Mulligan so there were actually two assemblies of the mouse genome and it was beautiful to see both teams in this friendly competition we were able to compare the assemblies it was tremendous, it was almost a dead heat they were very very close and what we needed to settle on one for the paper and the arachne was chosen and then most recently the wonderful announcement in November 2002 of the first draft of the rat genome built by yet another technique which is an interesting technique developed by Richard Gibbs people at Richard Gibbs group to pull together a combination of clone based and whole genome shotgun data so these are all advances in computer scientists need to keep track of what's going on in the field and be there ready to pull together the new kinds of data sets if you look at this next phase what does it do so that gives us the sequence but what do we do with it now I want to move into the second part part of the talk where we look at genome browsers as microscopes we heard about the ensemble browser at the press conference there's another browser called the NCBI map viewer and then there's the UCSC genome browser which was also created initially by Jim Kent and now we have a large team at Santa Cruz of several staff and graduate students who work furiously on this in times of urgency and no more than in the last four days I have to say credit to these you can go to genome.ucse.edu right now click a button and you will see this current draft we only got on Wednesday and normally it takes the thousand computers that we have there it takes longer than that to actually pull together the map every all the information in GenBank on to the genome sequence and give you a browser experience but they've made incredible time here what does it mean to have a browser of the sequence now if you look at the way we had to access the genome before the sequencing project this was our view of the genome a cytogenetic view of the genome or genome as we heard on some of the beautiful film clips here a genome that's based on maps of genetic markers or radiation hybrid markers these maps were crude the distances were measured in centimorgans even in the radiation hybrid map if you go back you can't get more than a resolution of 100,000 bases or more with this now we have an opportunity to go much richer and if you think about the great things that happened based on these early crude maps it is amazing think about the discovery if we think about the discovery of the Huntington's genes I was just talking with Nancy Lexler about this the story is amazing Lucky Jim Gisela getting it on the first marker in Hausmann's lab that's amazing nowadays you shouldn't need that much luck so the bad news is we're kind of taking away the luck factor here with these browsers if you look at this we can take a piece of the in this case chromosome 4 and start to zoom in on this and what the first thing you notice is that that old world the side of genetic world the way we thought about the chromosomes in terms of bands has been mapped on to the genome sequence along with the full genetic maps and radiation hybrid maps and this is the work of Terry Fury in my laboratory here who's here today and it was a very very difficult informatics test to actually connect the two worlds our previous maps of the genome there's so much literature and culture that lives in those old worlds it has to be pulled forward so you zoom in to a certain part of the genome and immediately you start to see these large blocks of Sydney that Eric Lander talked about you can see the relationship with mouse at this very very large level these are pieces of mouse chromosome 5 and 7 that are orthologous or derived from a common ancestor with these regions and we can choose then to go ahead and zoom in again on a piece of that so we take this and zoom in again and we see a cluster of genes and inside of that is the Huntington's disease gene right here these genes are depicted the boxes are depicting exons the hashed lines are depicting introns Phil Sharp told us so much about and at this point it looks like there are a lot of exons but that's actually as you zoom in you start to see at this level we really can't see gene structure it's kind of blurred together so let's zoom in again a little bit and blow up this Huntington's disease gene you see the enormous number of exons and the complexity of the gene here we see now a more refined version of the comparison between human and mouse this is a careful plot of the similarity between human mouse and 50 base segments normalized to what we've estimated as a neutral rate of evolution in that segment and you see we're speaking around exons but you can't see that here we see the single nucleotide polymorphisms that we've already heard about as well and they tend to blur together so let's take an exon here and zoom in a little more and we see so there we see the exon basically swimming in an ocean of introns here is the high conservation levels that you see that are often associated with exons here are some of these interesting spikes that Eric Green was just talking about and here's a polymorphism I've expanded this track and if you zoom in on that further you can get down to the base level this would get you right through to the dbSNP database at NCBI you can look this up it turns out to be a coding SNP so it actually will change the codon in that exon and maybe of significance and so you're often looking this is a launching point here for going into various parts of the experiments as Eric said so if you look what else can you see there are all kinds of other types of information that can be tied to this not only information from the literature here is one of the starting of the 5,000 articles on Huntington's disease gene or if you click through to the mouse ortholog of this you can see the expression of this in the head fold of an 8.5 day embryo in this image if you look through the expression data that Pat Brown was so elegantly talking about you can select data this actually is data from aphymetrics technology that's put on the web free by the Novartis Foundation but there's also a track showing the data from Pat Brown's lab as well and that is very important information you see the higher expression in brain tissue there so all of the different types of data that can be created either archivally or from high throughput methodologies can be linked to the genes and so I want to present a little bit in this last part of the last comment about the browser phase of this talk of a kind of a vision why do we have what are these new microscopes that we can put on the data put to the data my goal is to see our lab try to produce a browser that is essentially a continuous engine for discovery it would be possible in the not too near future not too distant future to look at a browser that could take multiple streams of high throughput genomics data producing variety chromatin immunoprecipitation and all kinds of other new methods they're generated a secretly fed into nightly updates of the browser and display and at that point that browser becomes a new tool for looking at the genome right today I want to emphasize this different philosophy right today you can go to that website and you will see all of the data that we could suck out of GenBank map to new regions of the genome before all of this high throughput data is already mapped there if you go to one of those places you will see a configuration of data that no human being has looked at before this is produced nightly by machines and collected and put out there it is a microscope you're seeing new stuff you're not seeing things and that has already been sorted over and picked over it's amazing and it's a good it's an exciting thing to give it away in that sense the last part of my talk is about challenges in the genome analysis and we've seen this determine the functional elements including the coding and non-coding and I want to emphasize again that we want to get to a full systems biology and it's hoping that the computer nerds and the math nerds will get some kind of model of the cell but we are not anywhere near being able to do that until we get the players right and we're not we even haven't gotten that right completely these show gene predictions that are made by computer this is the ensemble gene prediction in red here done by the group Clamp and Bernie at all and then this is the gene prediction by David Culpin his team at Affymetrix these are the two predictions that were used primarily evidence based predictions that were used in both the human and the mouse papers to identify initially a crude cut at what the gene set was you can see the evidence from express sequence tags very much highlights this particular gene the insulin like growth factor binding protein 6 so this is a case where the gene predictions are quite accurate but unfortunately this is the exception so if you look at other places where the computational predictions from several different groups have been made they disagree quite large in large extent and so we are at a case where we haven't saw the automatic gene prediction so that's really a significant challenge we hope has been said before conservation with other species will help and this is an image of the conservation levels that you've seen before in 50 base pair segments with mouse looking at another member of the IGF family the acid label subunit and if you look at this these peaks represent a chance 1 in 10 chance 1 in 100 or chance 1 in 1000 on this log scale of seeing that much conservation under neutral model and outside of the coding region there's a distinctive peak here and this is in fact associated with a known transcription factor so we can in principle identify this these might be real as well but this also might be noise and so I want to emphasize what Eric Green was saying we need more than just the mouse on this if you look at the data that Eric Green's lab has produced you can get a fascinating glimpse at some of the other structures that lie hidden from our current view the computational analysis can bring these out and here I'm emphasizing a different type just to be different instead of looking at the kind of sites we've seen before here we're looking at Matthew Blanchett's program looking at blowing up the region here in the 3 prime side of genes that has a structure to this so computers can also be applied in this case the M-fold program maybe in combination with the high conservation in this structure a stronger prediction of functionality and maybe even a hint and here I go really giving away something that is kind of crazy we're currently looking at this one with many areas in the laboratory this is about a 90 base pair region in the middle of the ST7 gene which is associated with tumor genesis and even in autism in one study this thing is right from either flanking exon we have 90 bases of DNA that is conserved with only one substitution in all nine mammals and look at its structure this thing does something I have no idea what it does at this point I'm just stupid enough of a geek to give it away at this point for anybody to look at I guess but it's fascinating it's fascinating to see the kinds of things you can discover with sensitive computational analysis and I was very gratified to see Eric's data on how accurate the computational predictions were when you got the marsupials and essentially confirming well, okay, just to wind up here we've seen this about 5% of the human genome we estimated about 5% of the human genome was visibly under selection this was based on comparison with a neutral mutation rate estimated from ancient ancestral transposons that we believe are under neutral rate there are a lot of caveats on this I want to say let's be cautious about this we haven't actually observed the selection in this, we need more species to really get a model of this we want to go far beyond these simple two-species experiments and crude, crude estimates like this 5% we want to actually have a model of the way molecular evolution is happening there's Adam Siepel in my group has got an exciting new model that's starting to get at the way codons evolve with about 500 different parameters we can get much more richly parameterized mathematical models continuous time Markov models of the evolutionary process in different types of functional sites and those will be ever more powerful microscopes on the sequence in summary, I think all of the previous speakers have led up to this but I'm not sure that anybody has said it in kind of simple terms there is a grand challenge out here of comparative genomics from the point of view of the human genome and that is to reconstruct the evolutionary history of each base in the human genome going back as far as we can we should do that I love I love the I love the analogy of origin stories that we heard earlier we care about how things got here not just for recognizing their function but I think we all care about how we got here thanks so I want to acknowledge a lot of the people in my lab again they can throw me off the stage if I read all the names I acknowledge a number of people there in the previous talk but there are a huge number of people that go behind the scenes under the hood so to speak on all of this and they deserve credit and then of course Eric spoke eloquently of this amazing collaboration the most amazing experience to work with this group of people we all hated the conference calls but in fact it was an extraordinary experience and this is unfortunately a very limited list so I won't read it so they don't drag me off with a hook thank you David thank you our last of these five targeted speakers is David Bentley a human geneticist for a long time and one of the people who was wise enough to move to the Sanger Center in its earlier days because I think he saw the power of what he was getting into he's going to talk to us today Richard thank you all for the pleasure of hopefully taking a brief exploration of human sequence variation which is very much a story about uncharted territory an exploration which starts of course in the foundation of genome sequence information we have just one reference genome sequence with us today there are some six billion individuals out there which remain relatively uncharacterized and there's an enormous wealth of information that we can bring to bear what can we use from genome sequence information clearly the two important areas the sequence providing us with this metric on which to lay important information and particularly the annotation of the genes incomplete but growing all the time and to overlay with that on the same coordinate system of course our increasing knowledge of variation at all points in the genome couple of information, metallial frequency and so on and to focus combining those two bits of information on the all important subset of functional variants most of course which might be expected to lie within protein coding sequences but perhaps now the emphasis is also on variants which lie or give clues to other possible functional sequences and therefore the requirement to really continue this effort of annotation alongside discovery of further sequence variation what about the origin of sequence variation just very briefly of course it arises from an ancestral sequence by a combination of substitutions of insertions and deletions but also by recombination which results in the reassortment or recombination, new combinations of previously existing variants providing an increasing number of independent but co-existing haplotypes which can then undergo selection, positive or negative or genetic drift which results in gradual alterations in the frequency which has haplotypes occur within a population as we move to look at a present day population the majority are SNPs of course perhaps particularly attractive because of the possibility of new automated methods for acquiring the information about SNPs in multiple individuals the suggestion of collecting personal genetic information on a very large scale the technology being available therefore to ultimately provide a barcode of an individual whether or not it's the right thing to do it is certainly technically very feasible review of SNPs in the genome 90% of all variants, the remainder being insertions and deletions, one SNP per 1.25 kb between any two genomes that is, and of course given that each one of us has two genomes one maternal and one paternal each of us then carries something like 2.5 million variants just between the two genomes in our deployed genetic content in trying to focus more on the particular functional variants we can take one looking towards the exons if we consider this fraction of 1.5% or so of the genome containing exon protein coding sequences perhaps 60 to 120 thousand of these variants in every individual might be within the exons and perhaps a third of those are non-conservative and therefore really top candidates for possible functional variation because they in some way visibly alter the protein structure and possibly function but of course this again neglects perhaps one third or two thirds more of variants that might lie outside exons and so we should not focus all our attention on the exons themselves Examples of functional variants are familiar to but this does illustrate a point in some cases variants are immediately predictable because of the nature of the change they give to the annotation of the gene in which they lie in other cases even well known variant perhaps rather less obvious on the change that it creates in the amino acid and further functional verification in order to prove or disprove a mechanism by which such variation might actually give rise to a phenotype and thirdly of course again a well known example but involving alteration interaction between two proteins factor 5 clotting factor the Leiden mutation abolishing the ability of the clotting factor to be degraded and resulting in elevated levels and a potentially higher risk of thrombosis as a result of the example I choose because it illustrates the principle of the association study of the beginnings of looking at more complex diseases which are so far successful in taking of course a collection of well phenotype cases and carefully matched controls to eliminate the potential variability between the two collections before proceeding with a test of such a functional variant and then to demonstrate that there is distinctly different distribution of those variants in the controls compared to the cases illustrating of significance also illustrating of course the substantial amount of additional information that we as yet do not know about other factors which confer risk which may be additional genetic variants or also environmental factors suggesting as well the long road to dissecting out every risk factor involved in any complex disease this then is an example of the direct approach for association studies where we have the gene we have candidate functional variants and we go ahead to test them in determining such a genetic association by contrast of course given that we do not have a complete knowledge of all the genes or other functionary important regions we may not have indeed knowledge of either the genes or the functional variants it has been suggested and much discussion is now focusing on the value, the prospects and the limitations of an indirect approach using large panels of SNPs spread across the genome placed in the reference sequence and selecting subsets perhaps those nearby on the basis of their performance in association studies to some extent or completely may reflect the as yet undetected functional variant and allow a targeted search to complete and characterize the functional variant from the data from the indirect assayed SNPs I'd like to spend a few minutes just exploring one slide to explore the direct approach and the implications of such an approach and also of course the implications surrounding evaluating the indirect approach as part of the haplotype map project the direct first, if we consider the possibility of discovering SNPs perhaps by targeted sequencing of the exons one provides the possibility by such a targeted sequence based system of going deeper should one wish to find additional variants additional variants which therefore may be rather rare and the possibility is therefore there for going to any depth to examine the possibility of the involvement of very rare variants in disease once one is established perhaps the major associations the caveat in such approach of course to these other regions which may turn out to be functionally important one might wish to go back to sequence those regions as well as the annotation develops in the next few years in the genome and this then of course results in perhaps a more complete view but still one is worried perhaps about the unknown ultimately one would argue that it would be jolly nice to get complete genome sequences from a large number of individuals that would be grand challenges facing all of us around the world as we look for technologies for cheap whole genome sequencing to return to the indirect approach for a moment I wish just to take you through the concept of linkage equilibrium if we consider two SNPs which might either be SNPs in the map or one SNP and an actually unknown functional variant a bililic variant if those are in equilibrium there is nothing tying them together and one sees all possible combinations of the two alleles there is no linkage equilibrium or no LD by contrast if there is in fact complete LD a scale of 0 to 1 and a D prime value indicates complete LD there has been no ancestral recombination in the population to disrupt the two ancestral haplotypes the two alleles and therefore one sees only a subset a limited number of the possible theoretically possible haplotypes and this principle then gives the first view of the possibility one might obtain a simplified view of the variation in the human genome on the basis of the LD between SNPs a simplified view which would be enormously useful in applying more refined analyses of association studies haplotypes then can be defined these are the limited number of haplotypes and LD therefore is reflected in this limited number of reserved haplotypes in the genome LD to look at the properties of LD association studies indirectly will only work if this LD applies across the genome LD does decay with distance from the early study we took on the chromosome 22 data led by Ian Dunham's group to illustrate that basically SNPs which are close together do have a high LD and it decays over 50 to 100 kb in general looking at all pairwise comparisons of SNPs if we look along the entire chromosome 22 you see this marked feature of the very variable levels of LD great peaks covering several hundreds of kb where there's very high LD between SNPs indicating therefore the association between variants in these sorts of regions would be likely to be very substantial and very useful for association studies but also other areas troughs where at this level of density of analysis it's an approximate figure indicates there's more work to be done the all important correlation given that LD is perhaps highest at areas where there is least ancestral recombination correlates very nicely with the observation of present-day recombination in pedigrees giving an important clue or confirmation of the correlation between present-day recombination and the presumed driver ancestral recombination of LD if we look in greater detail on chromosome 20 this is now a 10 megabase region from a current study as you increase the density of SNPs in the study the actual pattern of LD is very robust these LD plots to market density so a reliable robust dataset and furthermore again you see the correlation with genetic distance the flat areas correlating with the areas of high LD I now want to show you just two views as we begin to develop views of looking at LD within the human genome views which as yet will stand for the refinement but ways perhaps to try to infer useful reagents for association studies and also useful ways of looking at the underlying processes which have given rise to LD first of all the proposer from the Whitehead group examining the D prime values between pairwise markers and if it extends over all combinations of pairwise comparisons within a group one can declare a block of high LD on the basis therefore one assumes that there is little or no ancestral recombination in the entire region and therefore as one carries along the genome one then selects from these regions a subset of reference SNPs which actually still captures the limited set of common variant haplotypes which occur within these regions these other regions are not studied at this stage to sufficient detail and additional work is required but we have at this base one way of looking at a simplified view of the genome clumping together the variation of developing a simplified set of reagents to reduce the cost of association studies without loss of information we look on chromosome 20 for a moment a slight contrast to the previous data that I showed you here as you increase the SNP density you do see marked changes development of the block structure as you move to higher density so the analysis is not particularly stable at low marker densities and really does require gradual development up to in this case something like 65% two thirds of the chromosome looking in well established blocks of high confidence so towards an end point but not there yet the other view I'd like to share with you again an early and incomplete view the concept instead of looking at the blocks of high LD to look for the events of recombination the minimal number of recombination events which are required to fit the data for the pairwise LD between SNPs if we look at a simple run of the genome along here in comparison of pairwise LD what is striking then is the areas which are white reflect very high levels of recombination which must have occurred red reflects the absence of detected recombination an incomplete data set as yet but clearly one can see this very heterogeneous distribution of recombination and a very marked hotspot if we look at this part of the plot it's a hotspot in a West African population the data was provided from the Whitehead group analysis done at Oxford but also the same point identified in the CEF pedigrees indicating a level of correlation between different populations which needs further exploration I don't have a slide but just to mention briefly much of this work does involve the acquisition of further data one looks hopefully to the work of Alec Jeffries and the Leicester group where a very small region of the genome was nevertheless covered to 95% high LD blocks of 60 to 90 kB and the areas in between by now only 1 to 2 kB were directly confirmed experimentally to be rich in recombination events which would be an attractive and important outcome if the same pattern could be developed over the entire genome so what about the prospects for applying our increasing knowledge and hopefully simplification of sequence variation to the discovery of disease genes which reflect for a moment on the progress to date during the course of the last 15 years this data from online Mendelian inheritance in man provided by the NHGRI illustrates certainly an encouraging growth in the discovery of genes involved in human disease but once you point out of course most of these are Mendelian high penetrant alleles monogenic disorders on the whole and therefore this is certainly a valuable approach and we should not ignore it will continue to be valuable there are additional Mendelian traits segregating in families that can be approached by this sort of approach which has really been the foundation of much of disease gene discovery over the past 15 years but looking ahead to the complex disease I'd like to show you the iceberg the tip of which Eric showed you before indicating encouragement for the prospects that a number of associations are emerging in different complex diseases well known complex diseases where the all important allele frequency of the functional variant is indeed high and therefore a common variant can be detected given a sufficient relative risk in moderate size at least by tomorrow's standards association studies which can be done on increasingly refined sets of SNPs in the absence of direct knowledge of the functional variants but by working with the product of the haplotype map of the human genome these two factors are critical in the success with which such associations can be done reduced allele frequency of course requiring larger populations and also low relative risks requiring much larger samples to detect and there is a point perhaps if this arrow continues downwards well one moves out of the realm of the common variant common disease hypothesis there's every reason to suspect there's a complete allelic spectrum of variants causing human disease and therefore for these rather rare variants one should look again at the sequence based approaches through genes we should not cut ourselves off from being able to go deeper and deeper in obtaining a fuller knowledge of sequence variation in order to have as somebody mentioned earlier a full understanding of the genetics of human disease by way of concluding remarks I'd just like to share a couple of concluding remarks for you which reflects the very importance of the completion of the human genome sequence the international availability of it and the importance to everybody in the world these remarks are drawn not from the literature but from recent quotes to start with that of Elia Sahuni who said at the end of last year in sober reflection we have never faced a problem like this in the history of science as we move forward interesting to note Francis Crick took the same observation there seems to be no limit to the problems which now confront us it's all right and good very well to have the responsibility of realising the challenges that face us bigger than ever before as a result of the production of the human genome sequence what we also need of course is the sheer enthusiasm of young investigators who are prepared to dive in and use the information perhaps not seeing the full picture as yet and as exemplified now we have a little sequence of everything which was told to me by a young postdoc in India on reflection on the release of the human genome sequence thank you thanks very much David Bentley we're going to hear from two more speakers about the future the first one it's an enormous pleasure for me to introduce Ari Petrinus the Associate Director for Biological and Environmental Research at TOE really needs no introduction to you though but Ari I'd like to take this chance to really thank you publicly for your wisdom and cross agency leadership across sector and across oceans I think without you we wouldn't be where we are today thank you Richard those are very very kind words it falls on me to tell you now the road ahead for the Department of Energy and even though I must say our paths will diverge somewhat from the National Human Genome Research Institute in the years ahead many of the talks that you've heard today in fact have highlighted and have emphasized how much still remains very common to our pursuits and that opens tremendous possibilities for other cooperation in the future so let me first start by joining others in first of all congratulating the many scientists that have been involved in the completion of the human sequence and of course we take great pride in the sequencers at the Joint Genome Institute and their colleagues and collaborators at Stanford University and at other laboratories around the DOE complex that have been so instrumental in completing their piece of the human genome sequence chromosomes 5, 16 and 19 those folks shown here in fact our team deserve a lot of congratulations and a lot of thanks for the many contributions they made and I salute them publicly for all their contributions I'd like to say that our genome facility at Walnut Creek production genomics facility of the Joint Genome Institute will continue its sequencing tasks in the years ahead to both serve the sequencing needs of the DOE program as well as about the Genomes to Life program but more importantly will become a scientific user facility like the other scientific user facilities that the Department of Energy is well known for serving the broader needs of the scientific community in this particular case serving the sequencing needs of the scientific community following a rigorous peer review and evaluation in the selection and the queue in which those organisms will be set you already heard many talks about the value of sequencing many organisms and of course I am particularly biased towards the microbial world but in general I can use another famous man a famous quote for why we need to go further into sequencing other life forms the invitation program that you see that you had today started in fact with a reference to Aristotle who looked at the causative force between an acorn and an oak tree he made many mistakes when it came to biology but this is one mistake he did not make when he said that we should venture on the study of every kind of animal without distaste for each and all will reveal to us something natural and something beautiful these are great words to live by and to use as a compass as we proceed in this new century of biology the human genome project in our opinion has been a lot more than just a revolution in understanding human biology a lot more than understanding and doing major contribution making major contributions to human health and human medicine although they are of course the most prominent and maybe even the most important for us at the department of energy there is also in terms of setting the stage for the future in terms of developing the technologies pursuing aggressively computational science building the resources emphasizing high throughput and yes building better public and private partnerships for us it's been more than just biology even though for the little islands that we represent within the department of energy biology has always been central and we see it as encompassing the other disciplines computational sciences engineering chemistry material sciences and so on in the department of energy we are committed as I mentioned with the sequencing facility to building and operating for the broader scientific community the scientific user facilities that will be central to this new century of biology the light sources that the department of energy has built and operated were now dominated primarily by the science of structural biology and yes by structural genomics a little example you heard from Shirley Tugman a little while ago the department of energy is committed to maintaining these facilities and in fact building the next generation the fourth generation of light sources perhaps based on the free electron laser that can do even more for structural genomics and structural biology very important in this vision for the future is computational science and on this also on this field we intend to play a major role in order to support the broader scientific community's pursuits and needs in computational biology here's a little graph we came up for in the context of the genomes to life program representing our future in the post human genome program world this is one example of the kind of challenges we face in computational biology and the various problems we have starting from where we currently are with respect to comparative genomics and maybe a bit in terms of genome scale protein threading this is in fact our current computational capability which is in the 10 teraflop for those that are teraflop literate the fact of the matter is if we are to accomplish some of the goals that you've heard about with respect to molecular based cell simulation that Francis spoke of or the protein machine interactions that Eric alluded to we need orders of magnitude greater computational prowess we need something in the teraflop range and the multi teraflop range this is 10 to the 15 floating operations per second and in order to accomplish this we need to make significant investments in the computing science the basic computing science to realize that vision clearly for us the future in the post human genome project world is the program that was identified earlier and that's in the issue of science that you picked up perhaps this morning it's the genomes to life program a labor of love from many many people starting by our statutory advisory committee the biological and environmental research advisory committee chaired by Ray Jestland and many of his colleagues and across the scientific community that brought came together over the last three years to give substance to this vision and it's a vision that have that has significant and potentially valuable applications to the department of energy's goals and missions here are some perhaps futuristic ones but are the nevertheless realizable given the successes that this project the human genome project has had over the years I show three examples here in terms of bioremediation and the potential to save billions of billions of dollars in toxic waste cleanup and disposal which is one of the tasks the department of energy is saddled with from the legacy of the cold war another one is to help stabilize atmospheric carbon dioxide through carbon sequestration to combat global warming and another one is to contribute to the US energy security through bio hydrogen based industries for example in essence clean energy through the miracles of biotechnology one can say these are futuristic dreams they are however dreams that could be realized with the proper investments in the basic science that is described in the genomes to life program in just a few words the genomes to life program has four basic goals the first goal is to understand the fundamental multi protein molecular machines of life that make a cell work the second one is to understand the regulatory networks that drive these molecular machines of life the third goal is to relate the functional diversity of microbial communities whose tasks are primarily focused on microbes and microbial communities especially in the case of bioremediation microbial communities that exists at our contaminated sites the fourth goal which underpins all others and it's one that I made reference to is the computational infrastructure that needs to be built to support these goals that you'll hear from Francis and that you've also heard from other speakers earlier today so join us in realizing this potential of the genome revolution and read about us in the science magazine our article that you received today or visit us at our website doegenomestolife.org I think we all stand at the edge of an incredible region of undiscovered territory and we are excited and thrilled by the potential of exploring it properly and putting it to use for our goals and for the good of humanity I hope we will be able to meet that challenge and accomplish those goals and I thank you for your attention Thank you very much Ari and it's over to Francis who's going to give the closing talks and close out the session I might not get a better chance ever to thank Francis publicly not just for your stewardship Francis I have no doubt that we wouldn't be here but for you but for your staff who I think we've just developed some strong relationships with and respect so much and I know you're going to be thanking them publicly elsewhere so you won't name them all but also to say thank you for bringing us to an end and thank you for letting us focus and go out with a bang and not a whimper Thanks Richard well yes bringing us to an end not only of the genome project but of a long day and a wonderful day it has been I hope you also have had the same sense of awe at the lineup of presenters that have come to this podium since this morning and a similarly spectacular lineup will be here tomorrow I am impressed by you and by the goals who are still here and I will not abuse you by going on at great length but I did have a couple of things that I wanted to share with you particularly about the vision for the future of where we might go next before I do that I'm not sure that all of you saw because it was just released at noon today a really quite a remarkable statement coming from the heads of government of the six countries regarding the completion of the human genome sequence these six countries who work together to do this and so I'm going to read this to you because I think it is unprecedented for such a statement to have been made and it undergirds the historic significance of what we're celebrating today in a rather remarkable way it reads we the heads of government of the United States of America the United Kingdom Japan France Germany and China are proud to announce that scientists from our six countries have completed a sequence of three billion base pairs of DNA of the human genome the molecular instruction book of human life remarkable advances in genetic science and technology have been made in the five decades since the landmark discovery of the double helix structure of DNA in April 1953 now in the very month and year of the 50th anniversary of that important discovery by Watson and Crick the international human genome sequencing consortium has completed decoding all the chapters of the instruction book of human life this information is now freely available to the world without constraints via public databases on the world wide web this genetic sequence provides us with the fundamental platform for understanding ourselves from which revolutionary progress will be made in biomedical sciences and in the health and welfare of humankind thus we take today an important step toward establishing a healthier future for all the peoples of the globe for whom the human genome serves as a common inheritance we congratulate all the people who participated in this project on their creativity and dedication their outstanding work will be noted in the history of science and technology and as well in the history of humankind as a landmark achievement we encourage the world to celebrate the scientific achievement of completing the human genome project and we exhort the scientific communities to rededicate themselves to the utilization of these new discoveries to reduce human suffering and it is signed by His Excellency Jacques Chirac the Honorable George Bush the Right Honorable Tony Blair His Excellency Gerhard Schroder His Excellency Eunuchiro Kozumi and His Excellency Wen Jabao those are the heads of state of the six countries isn't that amazing so I couldn't deny myself the chance to read that to you and that was just too stunning to find that coming out here in the middle of today well what I want to do in a few minutes and it will be only a few because I'm going to exhort you to read more about it in this article that you were given as you came to the meeting today I want to say a word about this vision for the future a vision which we at NHGRI have been working on for more than 18 months and many of you in the room if you didn't run fast enough found yourself being pulled into that process by being invited to workshops to early house meetings by being called on the phone by various of us to seek your advice so this is very much a joint enterprise and well it should be we've heard a lot today about the human genome project about the things that were accomplished during its 13 year enterprise all of which are detailed in this visual which is the center fold of the article on the vision for the future but it's the to be continued down there in the corner that I think we should particularly now be thinking about and which will be the major focus of what is being discussed tomorrow and has already occupied a lot of this afternoon because after all we have accomplished all these original goals and the question on everybody's lips is what's next and preparing for that has been a process involving more than a dozen topic specific workshops that we have organized some with our colleagues and other institutes over the course of the last 18 months and that has involved both an early meeting at early house and then a more recent early house meeting where we tried to pull all this together as well as significant involvement by our council having been after all the ones to finally endorse the enterprise as it came out and I want to express my gratitude particularly to the council and to the subcommittee that led this enterprise particularly to the 600 scientists who were involved in the effort to distill all of these wise ideas about priorities into this vision and most particularly to my colleagues Eric Green Alan Gutmacher and Mark Geyer who slaved through more revisions of this document than you could possibly count in order to produce what you have in front of you and of course if you turn the pages you'll quickly encounter this metaphoric image and it is an image that we intended to carry a certain message and I hope it does that for you because you will notice that this is a building a building we want to now put together we need a good architect in order to design it and this in fact is an article that aims to lay out the blueprint of this building to design its architecture it's a solid foundation that's the human genome project down there and it has three floors this building does genomics to biology to health and to society it also has these six cross cutting elements which touch on each of the floors because they're important for each of the floors and they are as you can see labeled resources technology development computational biology training LC and education please note if you will the pathway up to the door is handicapped friendly we are very much interested in reaching out particularly to those who need medical advances to happen quickly because that is our dream for what the genome project will produce and notice also that the door is wide open because we will continue the process of having our data open and accessible to all and that is very much underlined by the words in this document and by a recent important meeting in Florida where the importance of pre publication data release of all kinds of data that represent community resource projects was laid out in a fashion that I think was very well received by both data producers and data users because I don't have time to go through the 15 grand challenges that are outlined in this document I am just going to mention a few in passing and again I think it's important to note that this is unlike the previous plans for the genome project that you may have read in previous years in science where we were basically putting together a description of how we would achieve the goals of the human genome project those are now achieved so in a sense we are really doing what Bruce Alberts and his panel did back in 1988 starting with a blank slate and saying where should we go next and of course the slate is not really blank it's already been written upon by a lot of people with good ideas and it's our task to try to sift through those so in the genomics to biology floor you will see things like defining the structure of human variation obviously now already underway in the elegant talk you heard from David Bentley with the production of the human haplotype map and all the ways that that will get used the sequencing of lots of additional genomes also very much underway as you heard from Bob Waterston and Eric Green and others to develop new technologies for sequencing, genotyping expression analysis and proteomics those are critical I think that point has been made repeatedly today that we should never underestimate the value of advances in technology they will often be the things that allow the real leaps forward we need to identify all of those functional elements of the genome we are initiating a project called ENCODE which stands for the Encyclopedia of DNA Elements that invites all investigators be they experimental or computational be they in the public or the private sectors be they in the US or elsewhere to focus their energies together on a carefully chosen 1% of the genome already identified and to get everybody to put their heads together and figure out how can we really identify which parts of the genome do what and that is a very exciting idea a cooperative enterprise to figure out how to determine the parts list of the genome and if that succeeds and we expect that it should then we'll be able to extrapolate that to the other 99% we need to push the proteomics agenda and there again I think technology is going to be critical in order to identify for mammalian cells all of the proteins and their interactions and ultimately though I take the points already made by my friend Ari and by David Housler that we have quite a challenge if we're really serious about developing a computational model of the cell still a worthy goal one that sort of electrifies the imagination I mentioned technology let me just put a few challenges out there which if achieved could change a lot of course we heard about the haplotype map in its construction but the utilization of that map in a broad sense to look at virtually any disease in a case control whole genome association study will be dependent upon drop in cost for genotyping so that you could carry out such studies for $10,000 or less on thousands of DNA samples that cost curve is coming down it needs to come down quite a bit more so I'm hoping that within this room or over in Missour if there's anybody still over there or the people who are listening to this on the web or by satellite download or wherever we'll begin to think about new ideas about how to do that or to do this maybe we wouldn't worry about genotyping so much if we could just sequence the whole genome and be done with it for $1,000 or less and there are new technologies that you've heard about briefly at least at this meeting the sequence single molecules using approaches like nanopores which if they can be reduced to practice hold the promise of allowing us to jump to a whole new Moore's law curve than the one we're currently on the one we're currently on is pretty cool because we're dropping the cost of DNA sequencing by a factor of 2 every 22 months but wouldn't it be great if we could jump to another curve and start down that one because if we want to get to $1,000 for a mammalian genome we need a new technology or we need another 15 years and we're impatient wouldn't it be nice if you didn't have to go to all the trouble of reconstructing and cloning and putting together the fragment that you want to use to make a transgenic mouse or do whatever experiment you're interested in and you wouldn't have to keep freezers full of all of those clones that you've mostly forgotten where they were and they've lost their labels and so on we would love to move into an era where when you want a particular DNA molecule you just punch it into the computer and out it comes in a couple of hours with high accuracy and low cost and moving into the epigenetics is an area we need to push to we need to understand methylation and other epigenetic phenomena and we need to of course look at modification states and abundance of all proteins here's a challenge in a single cell in a single experiment again these are quantum leaps they will be enormously difficult to achieve but we are intentionally putting them out there as challenges for the scientific community perhaps the floor that deserves now the most intense attention is this promise that we've always had that the genome project was going to move its benefits into health once we had a lot of the materials together and now we do and so in the genomics to health arena the focus on identifying all the genetic and environmental risk factors for common disease is a real possibility the ability to develop sentinel systems that can be used on currently healthy individuals to identify early onset of illness even before symptoms have appeared and those same kind of systems as elegantly described by pat brown to develop a new molecular taxonomy of all disease one that I'll come back to in a minute is the business of chemical genomics where we try to take advantage of the wealth of small molecule technology and make it available to academic researchers I'll say more about that in a moment if we're going to see all of this really reach into all the places that it should we're going to need large human cohorts in order to assess genotype phenotype correlations in an unbiased way and we need those cohorts to include not only people who are ill but people who are unusually healthy because they may tell us interesting things about genes that confer resistance to disease along that same line our study of genetic variation may very well help us sort through the complex area of health disparities most health disparities I suspect don't have much to do with genetics they have to do with access to health care they have to do with socioeconomic status and culture and other things but we have a chance to begin to unravel that and furthermore as David Weatherall will talk about at the end of this meeting tomorrow we have an obligation and an opportunity to apply genomics not only in the developed world but also in other parts of the world that are very much in need of such advances let me say just a word about this high throughput robotic screening because I think it's a topic which is less familiar to many academic researchers but which in fact could be a paradigm shift if we can learn how to appropriately develop the opportunities think about the genome translation toolbox that has now been developed what's in it well, things like arrays to look at transcription things like knockout mice things like full length CDNAs now being produced by the mammalian gene collection program here at NIH things like SIRNAs to knock out expression things like the databases that we all depend on and small molecules well, what about that? of course, this is a central focus of what the pharmaceutical and biotechnology industry are doing as the first step towards drug development you make an assay you screen a library of a very large number of molecules and you identify those that have an interesting effect be it agonist or antagonist why could we not in fact make that same technology more broadly available to academic researchers in order to facilitate the identification of very useful probes for biology and the identification at an early stage of compounds which might ultimately become therapeutically useful there are reasons to believe that such an initiative is an idea whose time has come this new paradigm of scope may be required to assign function to the genome access currently to small molecules is limited certainly in academia and actually the technology in pharmaceutical industries is also right for further advances there are 4 convergent developments in the last 5 years that particularly make this an attractive initiative at this time the human genome project advances in combinatorial chemistry that allow you to develop large libraries of molecules that do a good job of representing chemical space compound brokers from whom you can buy such collections already put together from reasonable prices and automation technology from the genome project that allows you to do very high throughput screening at a reasonable cost and putting all those things together suggests the opportunity of a public sector screening and chemistry initiative just quickly to say how this might work those of you involved in drug development would be familiar with this pipeline which begins with identification of a target and then development of an assay screening, high throughput or otherwise some limited medicinal chemistry to take that first set of hits and convert them into compounds of reasonable specificity and solubility and then the really hard work begins if one is going to go on to put this into clinical trials with lead optimization toxicology safety and then phase 1, 2 and 3 and sometimes more than that trials the success rate and the cumulative cost curves for this pipeline look like this now what is the public sector's current role in contributing to this well it actually is fairly limited it is probably primarily in most academic labs devoted to the business of identifying targets but notice this sweet spot that's here in terms of the success rate and the cumulative cost what we are proposing among many other things in this vision is that we contemplate the idea of moving the public sector contribution to this enterprise to the right to encompassing the ability to do assay development high throughput screening and those first steps of medicinal chemistry optimization in order to generate useful compounds which investigators can use to probe pathways and if they turn out to be interesting to hand off through an appropriate licensing agreement to a company that will turn that into a therapeutic agent I think that is a very exciting paradigm it's an unfamiliar one to many academic researchers there are few labs out there beginning to do this and it has a feel of something that could really change the way we approach problems in biology and medicine finally before I run out of time if I haven't already genomics to society that third floor is just as critical maybe even more so than the rest of the building because if we don't focus on these issues and give our energies to them these advances may not have the benefits that we all dream of we need to continue to push the privacy and discrimination agenda that the United States Senate currently wrestling with two fairly similar bills on genetic discrimination will figure out a way to craft a compromise and will in fact put that through and this will be the year where the genome project was completed the year of the 50th anniversary of DNA that is also the year that federal legislative protection against genetic discrimination is achieved in the United States we need to think about patenting and licensing practices not just about genes but also about haplotypes and protein structures and a whole host of other entities that lie ahead of us for which I think unfortunately we're not all that well prepared in terms of how we're going to handle intellectual property very importantly all of this study of variation is going to have profound consequences for our understanding of race and ethnicity and we are obligated as scientists to learn what we can and then translate that into information that will be usefully included as part of the often contentious dialogue about race in our society as we learn more about variation and its role in disease risks undoubtedly we will also learn about its role in behaviors including some that are quite controversial and we need to be prepared for how that kind of information is going to be incorporated into our thinking and I would say that it is also our obligation as scientists to define that there are certain areas that ought to be considered off limits and I shouldn't have said as scientists in society together with scientific input and societal norms there may be areas such as for instance almost everybody currently considers reproductive cloning to be areas that we don't wish to go into who decides and then how does one enforce those boundaries is a question that we have to wrestle with so there's our building read more about it in this article if you will we'd love to hear more feedback from you about it it is something that I think will be guiding us although it will require revision I'm sure on a regular basis it is I think inspirational to consider this as a real opportunity for the future and to consider what we might do here I offer you this quotation which are the final words in this particular article taken not from a scientist but from an architect an architect who built quite a few buildings here in DC as well as in Chicago and other places Daniel Burnham make no little plans said he they have no magic to stir men's blood themselves will not be realized this might have been the motto of the albert's panel I guess but it could be our motto here today as well make big plans aim high and hope and work remembering that a noble logical diagram once recorded will never die but long after we are gone will be a living thing asserting itself with ever growing insistency a wonderful goal for the next several decades thank you all very much