 Dr. J. Craig Vanter, President and Chief Scientific Officer of Salera Genomics Corporation and founder of the Institute for Genomic Research, Tiger, Rockville, Maryland. Educated by his parents for whom he was a discipline problem and three years of beach surfing in Southern California, which was no problem, educated by the Vietnam War, which was absolutely transformative for him, University of California at San Diego, a VA in Biochemistry with Honors, a PhD in Physiology and Pharmacology, educated by teaching at the University of Buffalo, researching at the National Institutes of Health, educated by Washington bureaucracies and venture capitalists, and by his own strategies for the sequencing of genes that have revolutionized the biological sciences in the 1990s. Dr. Vanter has emerged as this year's Time Magazine's a scientific icon. Many of you probably saw it standing there in his white lab coat with his hands on his hips. He has become an American icon. I don't know what it is else you want to know about this icon, but what I know is that Dr. Vanter is brilliant, courageous, very collegial, compassionate, and generous, and it's a privilege to present him to you. Thank you very much for that very kind introduction, Mr. President, Mr. Ambassador, students and faculty. It's a real pleasure to be here to tell you about what we're doing at the forefront of at least decoding genomes, including the human genome. I'd like to take you through some sort of a type of a personal journey of the last 15 years to set the stage for what we've been doing and what you'll hear from the other distinguished scientists here. So if we go back to 1984, 1985, my lab of 20 scientists had just moved to the National Institutes of Health. I've been working for 10 years trying to isolate the adrenaline receptor. I've often thought that people study diseases and genes that they have a particular interest in. I don't know what drew me to the adrenaline receptor, but I've had a fondness for it in adrenaline throughout my life. After 10 years of work, we finally had enough protein to finally get a tiny bit of peptide sequence which allowed us to clone the gene for this receptor from the human brain. In fact, it was, I think, the first neurotransmitter receptor cloned from the human brain. But in that first year at NIH, we completely retrained ourselves in molecular biology. We decided the tools of biochemistry were not sufficient for going where we wanted to go. So we had the absolute privilege of being at NIH where we could just stop what we were doing and retrain ourselves in a new field, and in doing so, cloned this gene. We had just finished doing this and had published our paper describing the brain adrenaline receptor and it became clear from the work of others that these receptor proteins were part of a very large gene family. It looks like they could be as much a 5 to 10 percent of the human genome, in fact. And we were looking for ways to characterize this very large family. When I read a very exciting paper published in Nature by tomorrow morning's speaker, Lee Hood, describing what seems like a simple but extremely elegant approach that improved DNA sequencing and measurably. And that was attaching four different fluorescent dyes to the four different letter bases of the genetic code. And that allowed them to be read sequentially. Before that, everybody had to use these radiographic methods that were extremely slow and tedious and even somewhat dangerous in terms of radiation exposure depending on who was doing the experiment. So this was a tremendous breakthrough and I found the company that he had licensed this technology to applied biosystems and arranged for my lab at NIH to be the first test site for this new instrument. And we got the first automated DNA sequencer in February of 1987 and a few months later published the first paper in the scientific literature actually sequencing genes with this technology. Simultaneously is when the first discussions about the genome project began. They began in several sources but one principally was in the Department of Energy. I think in part trying to find uses for the national laboratories for peacetime purposes. But several senior scientists, Professor Tobalco at the Salk Institute at the time and others chimed in and thought that this was a key project. There was a lot of debate because the technology clearly didn't exist for it at the time. But I was one of the few non-geneticists who got very excited about this project. In part because I just spent 10 years trying to isolate one protein in one gene. And if this project had the promise over 15 or 20 years to get all of the human genes it seemed like a fantastic scientific adventure. Using the first automated sequencer we actually produced one of the first sets of human genomic DNA. And it's important in terms of understanding where we are now to understand what the issues have been and the changes of strategy. Initially it was thought that only small clones, small pieces of DNA could be sequenced. At the time the largest pieces were clones called Cosmids. They were about 35,000 letters. In early in 1980 Fred Sanger's group published the sequence of the Bacteriophage genome lambda, which had 48,000 base pairs. It was the largest single sequence until in 1995 we published the Homophilus genome. And so the assumption was that you had to do this mapping stage first where these small clones were ordered. They were lined up by various methods and only once you had the complete order would you start to sequence them. I think the project started very naively, not in terms of building new technologies and new approaches, but in terms of just what tools were at hand. Fortunately it evolved quite rapidly. We started by sequencing the tip of chromosome 4 trying to find the gene that's linked to Huntington's disease. It was part of chromosome 19 where there was a gene linked to myotonic dystrophy. But what we found is even when we had these very large stretches over 100,000 base pairs or genomic sequence, our computers could not interpret this genetic code. The early assumptions were all we needed were personal computers and we would run the sequence through those and we would find all the genes as it sometimes works in bacteria. There's tremendous differences between the code of the human genome where only about 5% of it actually codes for genes and that of bacteria where over 90% generally codes for genes. So it was very difficult to find these genes. In fact one of the key findings we made early on is that only by going to CDNAs, and CDNA clones are derived in our cells we make copies of the actual gene in something that's called messenger RNA. In the heart cells find all the genes linked to the expression in the heart and the brain cells. We have just the messenger RNAs for genes in the brain. We decided to use the cells as our supercomputer because each cell, each of our 100 trillion cells, knows how to find the genes necessary just for that function. Due to another key discovery early in the history of molecular biology where the enzyme reverse transcriptase was found that converts messenger RNA back into DNA and that's called complementary DNA or CDNA. We decided just to randomly pick 1000 clones from a brain CDNA library. This is a library of genes expressed in the brain. And we just sequenced in from the ends instead of taking time to sequence the entire clones. And it was amazing to us that we made hundreds of new gene discoveries in a very short period of time. Because this technique works so well we had to name it so it was called express sequence tags. Express because from the mRNA we knew these were express genes. And there were tags because they were just partial sequences from one or both ends of the clone. This rapidly became a dominant technique and we published the results of this in Science in 1991. And at the time, so this paper described 337 new genes in the human brain. At the time, this is just the start of this decade, out of the estimated approximately 100,000 human genes, we knew the structure of less than 2000. So you stop and think of all the medicine, all the knowledge that we have, it's ignoring almost all of the molecular biology of our own bodies. The human brain, there were only two dozen genes known. So even though this seems like a very small number to us now, adding 337 in a few weeks was a major advance. A few weeks later we published a paper in Nature describing over 2000 new genes. And this technique has gone on now about 73% of all the genes in the public databases are from the EST method. This is a statistical method and it can discover a great percentage of the genes in a complex species such as our own, but it can't discover all of them. The other problem is back when we were trying to put together these pieces of DNA, one key fact is the sequencing machines can only read around five to six hundred letters of genetic code. So if you want to sequence something that's a million letters long or three billion letters long, you have to come up with different strategies. The one that was derived out of the NIH program was to try and sequence these small clones, but to sequence a small cosmic clone you had to sequence a thousand fragments and try and put those back together again. And that was sort of the limitation of our computers and the software at the time was to deal with a thousand pieces of DNA. All of a sudden with ESTs we had first thousands, then tens of thousands, hundreds of thousands, and now millions of sequences that we wanted to congeal in the computer to represent the unique set of human genes. So one of the biggest breakthroughs that we came up with informing Tiger was new mathematical algorithms that allowed us to deal with large numbers of sequences. I left NIH in 1992 and formed the Institute for Genomic Research. The initial acronym was just IGR, which we read to be EGOR, and we were concerned about the bad press we were already getting. So my wife and colleague Claire Frazier suggested we add the article in front and make it Tigger after Winnie the Pooh thinking that would help until a French reporter came to interview me and he said, you named it Tiger to be particularly aggressive against the French, didn't you? And I said, yes, of course. And it's been Tiger ever since. But what we did at Tiger is we made CDNA libraries from about 300 major organs and tissues and sequenced hundreds of thousands of these EST clones. We spent a year and a half analyzing the data and wrote up a 200-page manuscript and submitted it to Nature. Now, this is only unusual in the sense if you understand that Nature never publishes a paper longer than five pages. So the editor was very distraught over this but finally decided to publish the first ever special issue of Nature with this paper in it and asked us for some art for the cover and we suggested this drawing that you see here. But the then editor of Nature thought this was far too graphic for the readers of Nature. And I hope at least some of the students of biology have noticed what the problem is, is that this is actually a hermaphrodite. I at least hope somebody here noticed that. And what it turns out is the Nature editor was concerned about confusing the physicists that read Nature. So he asked if we could come up with some tamer artwork. We found an Italian artist to help us. Now I thought this had too many arms and legs but it seemed to be acceptable. In this paper and this special issue that was published in 1995 we described ESTs that collapsed into maybe 35,000 human genes. But of those we can only put names on about 10,000 because most of them were new and didn't match anything that had been seen before in biology. But it was a dramatic change in four years from the 2,000 genes that we had in the database when we started. Well, we were reflecting when this was coming out on what we could use these dramatic new tools for and how we could go back and come up with new approaches for the human genome and others. And my friend and colleague, Ham Smith, who got the Nobel Prize in 1978 for discovering restriction endonucleases, the molecular scissors that cut DNA at very precise, recognized points, suggested that we sequence the genome of a coli where we actually try the experiment instead of creating a map of the clones up front is that we just take this entire chromosome out of the bacteria, break it apart, sequence the little pieces and try to use our mathematical algorithms and our computers to try and solve this jigsaw puzzle. Ham and I wrote a grant and submitted it to NIH proposing that we would sequence not E. coli because it was being funded and it was in, I think, eight year of funding at that time from NIH. The first three years were just created, it was spent creating a clone map of E. coli. And we said we decided to choose homophilus influenzae. Ham Smith spent a lot of his career on it. He isolated the first restriction endonucleases from it. It's a key human pathogen, being one of the major causes of meningitis. And it's one of the few bacteria that children are actually vaccinated against. We proposed we would use this method, but we were skeptical that we would get funding. Fortunately, we had money in the endowment at Tiger that we decided to use. And we had the project about 90% completed when we got word from the NIH that our project was impossible and they weren't going to fund it. And just a few months later, we published the first genome for a free living organism in the journal Science. Even though this was not our intention, when we started this, it was just a basic science experiment to see if these techniques would work. Seeing the first complete genome really transformed my own thinking in terms of knowing that we had to have the complete genetic code of virtually every species that we were working with because it was such a tremendous difference in what we could do with this information. The circles you see here, many bacteria have a circular chromosome and the different colored bands that you see going around the edge represent the different genes and the different function by color coding in the genome. We also have linear maps that allow this to be interpreted more easily. I could talk for the next several hours just on all the findings that came out of this single genome, but I think the one that affected me the most was in front of all the genes associated with creating the cell surface antigens, for example, enzymes associated with lipopolysaccharide biosynthesis, iron transporters, anything that was expressed on the cell surface had in front of those sequences a unique piece of genetic code that it was a repeat of four letters of the genetic code. Richard Moxon at Oxford had found this in front of one gene and proposed that this was a novel mechanism for antigenic variation, that every time DNA polymerase passed over this, one in 10,000 times there would be an error in which shift the genetic sequence so that you actually got a different protein or a stop codon was put in and you got no protein expressed. What was found is that we found these, as I said, in front of almost every key cell surface gene and enzyme controlling cell surface molecules. And what it became clear is what Garwin proposed in evolution that it was just a random error system, just random chance for variation is not quite correct. And now that we've done dozens of key human pathogens, we found these mechanisms one type or another in every single genome, where basically it's pre-programmed into the genetic code to have variation. So genomes are not static. That's the reason virtually everybody in this room has homophilus influenza in your airways. It's evolving in real time, fooling our immune systems, changing its cell surface antigens, and surviving real-time evolution through these pre-programmed mechanisms. When we were looking at these genome, we found there were roughly 1,800 genes and Ham and I were having a little celebration when the science paper came out and it was sort of a time for true confessions. And I said, you know, Ham, I have to confess, I'm really glad you understood all this, because I didn't. He said, me, I thought you understood it all. And what we decided as a genome with 1,800 genes was beyond either of our mental capacities to understand how they all work together to form a living cell. And so there are several choices open to us. We could do what happened with the yeast genome once it was finished a few years later, where yeast researchers around the world are doing knockouts where they take out one gene at a time to try and see what that does to the function. We decided to look for a simpler biological answer, and we found an organism being characterized by Clyde Hutchinson in North Carolina called mycoplasma genitalium. This is a picture of it here. We're attached to human cells by a small foot. Clyde had indicated that this genome was probably the smallest for any free-living organism seen, and we decided it would be a great one to sequence the genome of. I don't know if this shows up. These are two different views of mycoplasma genitalium, one an electron micrograph and one a Gary Larson cartoon. It says, hey, I've got news for you, sweetheart. I am the lowest life form on earth. We decided this was a great opportunity to study the basis of life. Claire Frazier led the team that sequenced this genome. It took only three months to decode it, and this is the genetic map. There's 470 genes in this chromosome. And so upon finishing this, we immediately asked, well, how can it live with 470 genes and homophilus needs 1800 and we need about 100,000? Is 470 really the smallest number that's needed for life? It turns out there's a lot of things that happen in evolution. There was a second mycoplasma that was sequenced a few months later, mycoplasma pneumonia associated with walking pneumonia. That is slightly larger. It has about 680 genes, but one interesting finding was all the genes that were in mycoplasma genitalium had counterparts in mycoplasma pneumoniae. Pneumoniae genome was just roughly 200 genes bigger than genitalium. So sometimes evolution works particularly with pathogens by these organisms throwing out DNA as they form a symbiotic relation. Both these organisms are human pathogens and basically everybody in this room has both these mycoplasmas associated with our physiologies. And so they derive from a much larger, grand positive organism as many as 5,000 genes by throwing out genetic material, for example, as they developed in concert with our bodies to provide the key nutrients that they needed. This was a perfect test bed for evolution and we started a project on transposon insertion. What transposons are there like small viruses that can insert randomly in the genetic code? But because we had the complete genetic sequence, we could work out exactly where each of these insertions took place. So we could create a map of the genome. And the assumption was that the 200 extra genes in pneumoniae should be all knocked out without killing the organism. It turned out that was the case. We sequenced about a thousand junctions from each of these species and we found that we could disrupt a significant number of genes in each of these. To make a very long story short, we went down to around 300 genes that we decided were probably essential for life because we couldn't knock those out and keep a viable species in the laboratory. In looking at the functions of these different genes was really stunning to us. Genes associated with energy, metabolism, translation, replication, transcription, things part of the basic cellular machinery, most of these genes were essential and we couldn't knock them out. Two different groups, the unknown class, which represented about a quarter of the genes in this genome, we could knock out roughly half of them, but there was 103 genes that we have no idea what they do in biology that proved to be essential for life. The other category was ones associated with the cell envelope, things associated with the antigenic variation that should be variable. I think this was a very stunning finding to us that out of the 300 genes for creating the smallest living organism that we can conceive of at the present time, one-third of those genes we have no idea whatsoever what the biological function is. It's beginning to emerge that our philosophy developed that the exploration of genomics became a very humbling experience, proving how little we actually know about biology of any species let alone the most simple and minimal ones. Well, the third genome we decoded came from an unusual source. This is the research submersible Alvin out of Woods Hole that went to this black smoker off the coast of Mexico a mile and a half deep in the ocean. The temperature in the center of this plume is about 400 degrees centigrade. The surrounding water is about 2 degrees centigrade. The Alvin broke off a piece of this chimney right here, took the samples back to Woods Hole, and this bug was isolated out of that chimney wall. It was named Methanococcus unacei after Hager Ganesh, the expedition leader. And this is a true autotroph that has some very unusual properties, at least unusual relative to our human-centric view of life. At our body temperatures, this organism is frozen solid. It comes to life about 60 degrees centigrade. It's happiest at 85 degrees centigrade, but it can absolutely survive and be happy in boiling water. The other thing this organism does, it takes this carbon dioxide and hydrogen and makes everything it needs for life. It's called an autotroph. It doesn't need sugar. It doesn't need to bring in amino acids. It makes everything just from carbon dioxide and hydrogen. Well, a few years ago this sounded like something that would have come from outer space, not something that might be ubiquitous on this planet. And when we decoded this genome and published the results in Science in 1996, the biggest discovery was over 50% of the genes in this species didn't match anything we'd seen before. And there were whole sets of genes in a row called operons, which we assume code for whole new physiologies that we don't have the slightest clue about that we found embedded in these chromosomes. This changed our view of the world. Mycoplasma and homophilus had a lot of genes in common. And we began to think that the Earth's gene pool might be a lot smaller than we had expected. This very dramatically changed our views and it's only grown exponentially since that time. The initial plans from NIH and DOE was to determine the genetic code of only one bacteria, that is the laboratory tool E. coli. It was thought that all of the microbial world looked alike and we didn't need to do any more species. Well, we're clearly in an exponential growth phase since we published the first two in 1995 in terms of the number of different bacterial species and microorganisms that are now being decoded. And just a few months ago, the American Society for Microbiologists had a meeting on microbial genomics and decided that we now need to do at least 500 microbial genomes just to get a hint of what's out there. And that each one of these genomes is providing such tremendous new information that it's stunning everybody. I'll give you a few more examples. Dinococcus radiodurans was isolated in the 1950s by the attempts to irradiate meat for sterilization. And regardless of the dose of radiation used, a red pigmented bacteria kept growing out of this. It turns out this organism can take phenomenal doses of radiation, up to 3 million rads of radiation. I was showing this to the Department of Defense and I said this was their data on humans and this was E. coli. And a lot of people look very embarrassed. These were both different bacteria. So I think somebody did that experiment once. But these are phenomenal doses of radiation and what happens is the chromosome gets blown apart. Here's two glass beakers. This is after about a half million rads of radiation. The beaker started to burn and crack and melt, but the bacteria in the bottom just kept happily replicating and saying, you know, could you turn up the heat a little bit, please? The chromosome gets blown apart into hundreds of little fragments that miraculously over 12 to 24 hours, it stitches these fragments back together to reform its chromosome and it starts replicating again. I think this is one of the most amazing processes we've ever seen and you can ask, well, why would an organism evolve, particularly before atomic energy, to survive millions of rads of radiation? But the other key fact about this organism is completely desiccant resistance. It's been found ubiquitously on the planet. It's been found on granite surfaces and Antarctica completely dried out. And we think both the process of desiccation and absorbing doses, cumulative doses, of ionizing radiation over a long period of time are probably linked biologically. Francis Crick and others have proposed panspermia as the origin of life on Earth. Life came in from some other planet or some other part of the universe and established itself here. This organism would be a great candidate for panspermia because it can absorb huge doses of radiation. It could survive a space environment. It would reach an aqueous source, stitch its genome back together and start replicating again. But don't get too surprised when you hear Dan Golden announce that they discovered panspermins and outer space because every time they flush the commode on the space station or the shuttle, millions, if not billions of copies get launched into outer space. And it's definitely out there floating around and we'll come back at some stage. I think one of the most exciting projects is coming up with the attempt to intercept a meteor comet tail to see if there's any microbes in the frozen ice. This genome has just been decoded and it actually has three chromosomes in a plasmid, which was a very different structure than scientists studying this did, assumed it had. And the different chromosomes contain genes associated with different functions, leading us to believe that this evolved through some unusual steps in evolution. Well, earlier this year we published a paper describing the genome of Thermatoga maritima. This is the Carl Woe's evolutionary tree showing the assumed three branches of life that were in fact confirmed by our sequencing of methanococcus, at least at some level. The Thermatogas are one of the most deeply rooted bacteria and it's thought to represent a truly ancient bacteria. So we decided it would be a great genome to decode. What we found on decoding this genome was in fact a lot of the genes looked like they came from the archaea through a process called lateral gene transfer. I don't know if this shows up or not, but there were whole cassettes of genes that you can see with the gene order and the sequence highly conserved between archaeal species and bacterial species. And we think this is a very key part of evolution. Evolution is not just mother-to-daughter transmission. Organisms are constantly exchanging DNA and unusual mechanisms with viruses, with other mechanisms. And I think this is the first very clear-cut evidence of this. And so I don't think we're going to end up with an evolutionary tree. I think it's going to become something like a neural network with cross branches everywhere, making it quite difficult to deconvolute, but having all the genetic codes, I think, will help a lot. In terms of pathogens, we've now decoded a number of pathogens. I'm going to show you a few just to show you the kinds of information that's coming out of these. Tuberculosis is one of the leading causes of death of adults in the world. There's over 3 million annual deaths, but there's a large pool of close to 2 billion individuals with latent TB, up to one-third of the population, and TB seems to be coming back. A couple years ago, the CDC tracked a very unusual case of tuberculosis in Tennessee at a clothing factory where the index case was a 21-year-old male. It's not clear where he contracted TB, but in a very short period of time, all his family members, 75% of his co-workers and 80% of his social contacts became skin tests positive, and many of them developed tuberculosis. This is one of the most virulent in terms of transmissible TB's ever seen. Fortunately, it was fully drug-sensitive, but treating tuberculosis is not a simple thing to do, and had this 21-year-old clothing factory worker been working at a major department store in New York, we might have a new TB pandemic underway right now. So we decided this was a great strain of tuberculosis to decode its genome, particularly because the Pasture Institute and Sanger Institute in England were decoding the genome of a laboratory strain of TB. What we found really surprised us is that it turned out this so-called Oshkosh strain actually had many more genes than the normally understood tuberculosis. In fact, we found some mechanisms. Tuberculosis is characterized by the number of these insertion elements that insert into the genome, and here there's two quite close together that look like as it is inserted that cause three of these genes that are in the Oshkosh strain to be spliced out. But there's a finite number of these, and so now we can go to the computer and we have a number of very clear-cut testable hypotheses, whether any of these genes are associated with a tremendous increase in transmissibility of this new TB. But the other thing that became clear to us because it has more genes is this is probably not a new strain of tuberculosis, but an ancient strain that's re-emerging. This is the tick, the deer tick that carries the Lyme disease spirochete. This is the first spirochete that was decoded. And again, the type of hypothesis that we can generate out of having the complete genome decoded and having the metabolic pathways is we can come up with at least potential new therapies. One of the unusual things when this spirochete is grown in culture requires very high concentrations of a chemical called anastatial glucosamine. And what we had found is on the surface of the cell a transporter that transmits dimers of anastatial glucosamine into the cell. And we're trying to understand the biological relevance of this until we realize that the tick host is basically a walking grocery store for anastatial glucosamine. Chitin, the entire exoskeleton, is a polymer of anastatial glucosamine. So scientists now at Yale and other places are seeing if they can just block this transporter to see if it would disrupt the entire life cycle of this key pathogen. And so we're anxiously waiting those results. Even it turns out that it's not effective as a therapeutic. It shows the power of starting with the computer going forward. Malaria is not a bacteria. It's a eukaryote similar to our own genomes. And this slide shows the spread of drug-resistant malaria over the last several decades to the point now we're traveling in some parts of the world. You're really taking your life into your own hands because of potential death from malaria. The U.S. military realized that and realized that in the first few months of sending, not the first few weeks of sending troops to some of these areas it could result in as many as one-third of their troops as casualties from mosquito bites because of drug-resistant malaria. The malaria genome was thought to be non-sequensible because it has a very unusual abundance of just the letters A and T. In fact, the malaria community had trouble cloning genes and the notion was that it was going to be non-decipherable. We decided to do an experiment by isolating chromosome 2 out of a gel and doing whole-genome shotgun on this to see if it was, in fact, sequenceable. But because there were no genetic maps we had to use some unusual tools. Several groups have developed techniques with and get single molecules of DNA on glass slides shown here as a single molecule of chromosome 2 from malaria. What was found with these single chromosomes on these glass slides you could treat them with restriction enzymes and the enzymes would still cut. So what's seen here is these gaps are where the enzyme cut because there's surface tension on the glass the ends would pull away and so we could amazingly do restriction digest maps of single molecules of chromosomes and we use this information to verify the final structure that we ended up with and published this last year in Science. This is probably impossible to see but this is a lower eukaryote. A lot of the genes are broken up with introns about 40% of them and you'll see as we look at higher and higher species we see more of these introns which allows more and more genetic variation to take place. At Tiger the first plant chromosome has now been completed. This is from the International Project to sequence the Arabidopsis genome. There's five chromosomes based on genetic maps that was thought that chromosome 2 was the smallest when we actually sequenced it and ended up being a lot larger than was anticipated and now a European group has just about finished chromosome 4 and these two chromosomes will probably be published together later this year. These chromosomes resemble our own chromosomes in terms of having centromeres and telomeres at both ends and for the first time we're able to scan along and look at the DNA content and the gene density and here in the centromere the gene density goes way down but it's pretty uniform along the chromosome but these random insertion elements go way up in the centromeric region but amazingly we still found key genes in the centromeric region even though they were far lower density for example an RNA helicase was a key gene that was found early in this region. One of the surprises we found in sequencing this chromosome was that there was a complete copy of the mitochondrial genome inserted into the Arabidopsis chromosome. We thought at first this was an artifact but we could sequence off both ends of this mitochondrial insertion and prove that it was actually inserted into the chromosome. What this graph shows is that if it was an exact match you would see just one solid line of identity. What happens is the chromosome underwent rearrangements and again we think of chromosomes as being static entities but it turns out mitochondria rapidly rearranged. We don't know if this is a functional copy of the mitochondria that's actually in the chromosomes but every genome we look at every chromosome we find some absolutely amazing findings. While we were thinking that we were able to deal with a wide range of genomes from very high content of G's and C's which people thought would be non-sequensible to very low content of G's and C's which people thought would be non-sequensible and all these worked extremely well with the whole genome shotgun method and we were looking to expand this to larger species but we're waiting for the technology breakthrough. Also with this slide this is a summary of all the gene finding and all the different genome sequenced and this is a very humbling number I don't know if people can read it it says 47% on average of the genes found in each of these genomes are totally new to science we've never seen them before and we don't have any idea what they do I think it's going to be a tremendous process going forward. Mike Hunkipiller had applied biosystems Mike came from Lee Hood's lab to form applied biosystems and developed three more generations before this latest machine and he called me about a year ago and said he had a fantastic breakthrough in the DNA sequencing machines and when I come out and look at it because in addition they were thinking of putting up 300 or so million dollars to sequence the human genome you don't get asked twice for those things so I decided to go out and look and I was actually amazed with the technology that they had developed all the other machines needed tremendous manual intervention this was the first truly automated sequencing machine that can run 24 hours a day basically instead of a gel it has these very fine capillaries there are only 50 microns in internal diameter but a liquid gel gets pumped into these and the gel helps separate the DNA one molecule at a time in a size ladder and the DNA with the fluorescent dyes that Lee Hood initially developed flow off the end of the capillaries into solution where a laser beam shines through that solution activating all the dyes simultaneously where the genetic code is read into the computer we like this enough so we got a number of these these little boxes cost $300,000 each and we ordered 300 of them and have set up the largest sequencing factory in the world by quite a bit but the amazing thing due to all the automation is the only about 50 people required to do all this work in contrast to a year ago we had over a thousand people it requires a very large amount of air conditioning and a large amount of electricity a day but I think one of the things you'll see in a minute on a video is the most important thing we've built is the world's second largest supercomputer facility which is what we need to interpret this genetic code it's such a huge calculation it requires so much compute power that even with these tremendous a number of the new compact alpha processors I think we're still going to be computationally limited even though we can do over 250 billion of these complex comparisons per hour, per chip if I could have the video please this video is just going to show a few of the key players and some of the facilities to go to the videotape this works so well this morning when we try to so what you see behind me this is Ham Smith along with his assistant Cindy they are the first step in this critical process Ham's a remarkable scientist he's never had a technician in his entire career he's always done his own work with his own hands and that's why he's the best in the world especially in the large samples of DNA to make these critical libraries is an essential part of what we do these are robotic devices that pick 10,000 clones an hour with only three people that do all the work each of these little bacteria contain one piece of human DNA that we then prepare in the laboratory for sequencing we have a small number of people and we have these robotic work stations we have I think approximately 150 of them that each one has these pipette heads that deals with 384 samples at a time there's machine reading there's barcode reading there's cameras on them that track every step this is just one day's worth of 384 well plates it costs about $250,000 a day to run this facility and it's a very large number of samples processed by a very small and dedicated team here you can see the room full of these robots they do every step these are the PCR machines based on the work of Kerry Mullis we can copy the DNA and everything is part of the amplification process this is the robot arm on the sequencer where it automatically loads the samples into the gel the sample is the foil and it pierces them just to load the ones it needs at a time so we don't get oxidation so we have, you'll see in a minute a giant room full of these machines but just only a handful of people that go through and do quality checking and quality control work here's Ham Smith who wanders through the lab all day long trying to make sure people do a good job with the libraries that he's made everything's digitized everything's computerized this doesn't show up well but that's the genetic code of these machines each one of these boxes you see costs around $300,000 each one has a laser in it and so the air conditioning in this room would probably be enough to cool any major complex buildings anywhere in the world what you see is a lot of machines all working and you only occasionally see people that tend to them add solutions check the data and check the samples on them this is the main sequencing lab we have four labs this one holds 200 of the DNA sequencers and generates about 4 terabytes of data every 24 hours the biggest challenge is dealing with the data flow out of this operation we've switched optical fibers on each floor and the data from these machines is fed over to our computer facility this is an optical fiber network out to our computers this is the new ES40 computer each one of these has four of the new EV6 Alpha chip the entire Library of Congress can be held on just four of those disc boxes so I think you can see it's a very unusual facility it's an unusual biological facility in terms of doing high throughput what are we doing with this over the first few years we plan to finish the genome of the fruit fly Drosophila the human genome and the mouse genome we also are potentially sequencing the rice genome and will be incorporating the Arabidopsis genome why these five genomes Drosophila is the first insect and insects not only have a huge impact on disease but also cause billions of dollars of crop damage a year rice and Arabidopsis are the two model organisms that cover as many as 250,000 different plants and we'll be able to layer additional genomes on top of these two base genomes any additional insects will be able to lay on top of the Drosophila genome absolutely essential for understanding the human genome as many key scientists have shown in fact this shows the color coded human chromosomes overlaid with the mouse chromosomes so even though they don't correspond one for one because of recombination basically you can cover the entire mouse genome by layering the bits of the chromosomes on top of human and that's what we'll do with this process we can go from biology in the fruit fly the Pac-6 gene if you have mutations in that gene in Drosophila it leads to an islas phenotype mutations in the same gene in mouse leads to a blind mice maybe three of them mutations in the Pac-6 gene in humans lead to a disease called aniridia where these children are actually born without an iris so they can't regulate the amount of light going into their eyes and they usually end up going blind at a very early stage in every case can we rely on data from the fruit fly or the mouse no but I think in most cases we can the Drosophila genome which we announced we completed the sequencing phase a little over a month ago is one of the most important models in all of biology the history of Drosophila genetics is the history of human genetics basically the first genetic maps were generated in Drosophila at the start of this century and those techniques have been copied into human now when we sequence the homophilus genome which was 1.8 million base pairs we had to sequence about 26,000 clones it took us about four months at tiger to do that with about 24 staff and we had one person Granger Sutton who developed the key algorithms for assembling that data with what we just finished in Drosophila the genome is 77 times bigger we had to sequence over 3 million clones it only took four months with a staff of about 40 people but the number of algorithm specialists has expanded substantially Granger Sutton came over from tiger and Gene Myers came in from the University of Arizona to hold up our algorithm development team we've now compared the assembly of Drosophila against the known map of Drosophila and out of 1700 STS markers we found only 12 discrepancies the first thing Jerry Rubin our collaborator at Berkeley said when he saw this he didn't realize his map was that good more importantly we compared the sequence that we generated to about 22 million base pairs that had already been sequenced and out of all the matches there was only one discrepancy Rubin went back to his lab and they checked the clone and found that they had an error in that clone it was actually a chimeric clone that had rearranged and now we're down to zero discrepancies between our mathematical assembly and what had been done a clone at a time Drosophila as I mentioned earlier is a key model for human disease genes 10 human disease genes on the NIH website the biggest number of matches is to Drosophila 73 of them have counterparts in Drosophila 52 in C. elegans and 25 in yeast so I think Drosophila is going to have a huge importance now the shotgun sequencing model what most people don't realize is you get most of the data very early in the process after 3x coverage we essentially have the entire genome 3x means we've sequenced the genome an average of three times the key basis of how the mathematics works and it's remarkably simple in concept in execution it's a little bit more complicated is we decided to sequence two different size clones of DNA those that are 2,000 letters long and those that are 10,000 letters long and one of the key features is we sequence 600 letters in both ends in addition we have both ends from back clones that are 150,000 letters long from the Berkeley project and in human we have those from the Tiger effort and the effort in Lee Hood's lab doing back end sequencing a method that Lee Hood described with me a key assumption with this because you probably read everybody and the human and other genomes would totally foil this process and I credit Gene Myers with this as he realized if you just ignore the problems we can fundamentally put 99.7% of the genome together without even worrying about the repeats so the philosophy was to concentrate first on what we knew and we could easily solve and when that number is 99.7% it was pretty good the key part of these mated pairs from all the clones in addition for joining the scaffold we require two different sets of these mates which means that there's one chance in 10 to the 15th that we would make an error if we require this type of scaffolding an assembly progression actually goes together very quickly I don't know how well it shows up here we have 99.10x coverage we're at 10x coverage before we do any other work right out of the computer essentially the entire chromosome is ordered now with the human genome we're going up a little bit in size to about 3.5 billion letters we have to do 64 million sequences in contrast to 3 million for Drosophila and we think it's going to take us on the order of 12 to 18 months and we've added a few more people to the algorithm group but you've heard now the changes in the public program which we consider very complimentary to what we're doing and you've heard this term draft sequence well draft sequence equals gene discovery and a very important part of shotgun sequencing is between 0 and 1x coverage we get essentially 100% of the genes that are multiple exon genes we're going to make the two processes go together our plan is very aggressive we expect to have 1x coverage of the human genome by December of this year we're close to that goal already that'll give us 67% of the genome and about 90% of the genes by early spring and March we'll have 90% of the genome covered and by April we'll have over 97% of it covered the public effort is working on the clone by clone approach where they hope to have partial coverage of a large number of back clones covering the genome and what we found is that we can put these two together and we think move up the timeline for having the human genome sequence completed to some time and next year this is what draft sequence looks like at the initial chromosome you have a number of different size pieces all less than 20,000 letters where the goal is to get it all put together in 150,000 letters if we add to that just 1x coverage of our data it shifts the spectrum very dramatically in the size of the assemblies at 2x almost all the backs are ordered completely laying over the data we'll have by April on top of the public data we've recently ordered in sequence with only occasional sequence gaps for some repeats this is what the two ends give you in the computer they actually give you all these scaffolds that span the repeat areas and so it's not a problem and it's hard to imagine how you could have 60 million of these links but I think it gives you some sort of intuitive feeling for how this will go together in the year 2000 we will have a complete sequence of the human genome minus the regions that everybody's ignoring like the telomeres, the centromeres and the ribosomal operon regions that are being ignored by all the projects another key advantage from the whole genome shotgun sequencing we're taking the complete DNA out of sperm and white blood cells from five different individuals and just sequencing one individual because you get one set of chromosomes from your mother and one from your father those sets of chromosomes differ in about 3 million letters of genetic code from each other so roughly one in a thousand letters are different so just sequencing one individual we would end up with about 3 million polymorphisms or single nucleotide polymorphisms if you've heard them referred to as SNPs in addition comparing just one individual to the public data would give us more than 6 million SNPs in our database and if we do all five individuals we'll have on the order of 20 million variations and I think this is going to be one of the most important facts that come out of sequencing the human genome is understanding individual genetic variation we're helpful that it's going to be the key basis for the future of medicine leading to individualized medicine leading to the empowerment of individuals with knowledge over your own genetic code allowing you to deal with preventative medicine paradigms we think the importance going forward is understanding the logic of the genome we have 100 trillion different cells each of those cells has the same genetic code but each of those cells expresses different genes dynamically in real time so comparative genomics is what we learn from the microbial genome that the only way we're going to understand the human genome is to have a large number of other genomes to compare to it to help understand it we've had a very narrow view of biology we've been dealing with this dogma that there was one gene, one protein one function, one disease and that just doesn't hold up anymore this was like looking under the lamp post for your lost keys there was a major news announcement in 1989 where teams headed by Lapchichoy in Toronto and Francis Collins in Michigan found the so-called cystic fibrosis gene and then since that time there's been hundreds and hundreds of studies published characterizing spelling differences in that chloride ion channel linking it to cystic fibrosis but last year in the New England Journal of Medicine there was a series of articles published showing that the same spelling changes in the same gene can lead to multiple medical outcomes you can have cystic fibrosis and all these different diseases or you can have just chronic lung disease just male infertility just asthma just chronic pancreatitis just chronic liver disease or more disturbing of all to most people as a large number of these people had no apparent illness whatsoever if we're asking for two simplistic of an answer we're going to get fooled like this with the dynamic changes that take place as we go from a single cell to 100 trillion cells with all these genes interacting we're all going to be different in key and subtle ways this is going to be important in terms of whether you're one of the 60% of the population that most drugs help versus the 40% that they don't help or actually even toxic to with cancer chemotherapy only 30% of the patients respond to any therapeutic regimen if we can predict who in advance will respond and who won't I think it'll result in a major change in medicine one example is most people think if you take a baby aspirin a day it'll save you from having the side effects from a heart attack or a stroke it turns out that's only true for one out of three individuals but because it affects so many people everybody's told to do it particularly by the aspirin manufacturers it may be a trivial example but understanding your own genetic code and your genetic differences will help you to understand whether you're one of those three or the literally millions of other examples that will come up I think the impact of having these sequences is going to be a new starting point in biology and medicine it's not an end point on its own we view it as the starting point for what we hope to do every technique that's ever existed in biology will come back to be absolutely essential and you're going to hear obviously in many of the following talk particularly from Dean Hamer about genes and behavior we're going to start to understand the genetic basis of traits personality, intelligence we'll be able to really get down to the issues of nature versus nurture and I think it's going to be a very exciting ride there's a few caveats to throw in there that I'm going to end with I don't think it's far away before we're going to have this United States Department of Genetic Identity or some other version of this in the future is not far away before a baby leaves the hospital their complete genetic code will be determined and given to the parents on some sort of computer chip in Sweden every child born since 1950 has a blood sample stored on a little card in Sweden it'd be possible to go back and genotype the entire population of Sweden everybody born there in understanding the genetic outcomes this is a physicist that actually works at Tiger who put together a much nicer genetic profile for himself and ended up on this slide I altered it somewhat but gave him the chance to remove his picture but he was so delighted that you'd be looking at his picture that he decided to leave it which makes me think some of my adjustments might be very appropriate but I think it's very important to note that he has 0% chance of getting ovarian cancer so I think genetics is a wonderful thing I think the concerns as we get into trying to understand behavior is the difference between statistical references and actual causal effects and the literature is becoming full of sequence variations in the genetic code linked to different behaviors and you're going to hear some excellent studies with Dean but I think we have to be very cautious about these in the 1930s studies done at Cold Spring Harbor led to the eugenics movement that led to some of the atrocities that happened in Nazi Germany it's very easy to look back on science and see the foolishness it's very difficult to look forward and see it and understand that if we're not really diligent both as scientists to ensure that only top quality science gets through but as a society with the effort you're making here to educate yourselves are actually essential if you don't become educated in these issues you're going to advocate all the decisions to politicians and to scientists Art Kaplan just did a very interesting experiment that I think is summarized nicely on this slide he asked the Pennsylvania legislature where their genome was a third of them thought it was in their brains a third of them thought it was in their gonads and a third didn't know and this was a group about to pass laws on banning human cloning when they didn't understand the most fundamental issues that they were dealing with and this is a very disturbing quote from an Arizona state senator who that's an abortion of a Shakespeare quote people used to think their destiny was in their stars now we know it's in our genes it's in our DNA but I don't think you'll find any scientist that's on this program that believes an absolute genetic predeterminism it doesn't work for mycoplasma genitalium with 300 genes it was the single cell it certainly is not going to work for us with 100,000 genes and 100 trillion different cells so I think we need to be extremely cautious as we go forward thank you very much for your attention Dr. Keller he's over here Dr. Rentner that was a wonderful wonderful talk and testimony to the great advantages of going private could you talk about some of the problems that we as a society ought to worry about or to be thinking about in response to the extent of private investment to which private investment is driving biological research today I think it's an excellent question I think you have to look at sort of the history of science in this country over the last multiple decades before World War II most of the funding for science was private industry and the government only took over with the so-called golden age of science with NIH and DOE and NSF getting into funding I think we would not have had the breakthroughs in this field Lee Hood sitting down the table from me it would have been an obscure finding in his laboratory had he not helped start a company to make these instruments which then went on to invest hundreds of millions of dollars a year in research to go forward it's a combination of the two approaches I think it's more scientists have to have access to unique resources I think a lot of people thought of something similar to the EST approach but the day after I thought of it flying back on an airplane from Japan with nothing to do for 12 hours I could go in the laboratory and do the experiment because on the intramural program at NIH the scientists have the advantage of having money to go do the experiment if you have to spend a long time a year or so writing a grant waiting for somebody else to like a new idea most ideas go down the toilet if we did not have independent money to do the homophilus genome that experiment would have been years away from being done so I view it obviously in my own history as an absolutely necessary component I would not have been able to do most of the experiments that I have done and I'm now doing without private capital going into it and I think they would not have been a genome project without private capital those speak to the advantages now what about the disadvantages what happens? well there's lots of disadvantages the good news were a public company on the New York Stock Exchange the bad news is were a public company on the New York Stock Exchange so I have thousands of bosses and but I think in this case the goals of science and the goals of commerce are totally in sync I think that's a key part of our model that the business will only succeed if the science is spectacular because it's trying to move forward the applications of human genome research to individuals that doesn't mean they always are but I think it's something we strive for in our own particular case but nothing comes for free whether it's from the government or from private industry you know in that same vein one of the interesting things about Solera that was first announced is Craig made the commitment to make much of the data that we're talking about public and what I'd like to ask you Craig with a thousand bosses or even with the board of directors will you for certain be able to execute that kind of commitment because it must be a temptation for a board of directors to say look we've invested if we make it public then they'd benefit from the investments we've made so I'd be curious about your thoughts there in fact it's a question we get all the time and it's a very simple one to answer when this opportunity first arose and I met with what were then the Perconoma Executives and they said they would give me the technology and the money to sequence the human genome with our whole genome shotgun method I said it would only do it under one condition and the condition was that we were able to release the data publicly and that the company would not attempt to hold it privately I didn't have to convince anybody I think everybody feels with this particular genome there is no other choice it is the moral thing to do and it's the best way to move science and medicine forward in fact the stockholders had to vote for the reorganization of Perconoma to form P.E. Biosystems in Solera and if you go back and read the prospectus everybody that voted to do that which was an overwhelming majority of the stockholders did it on the basis that the genome was going to be released so that was one of the fundamentals for forming the company in the first place Tody White the CEO of Perconoma gave me a unique challenge he said fine we'll give you the money and the technology to sequence the human genome now come up with a business plan that allows you to not blow my 300 million dollars that's where my creativity was put to a real challenge and the model that we're trying for is the information business model Lexis, Nexus and Bloomberg are organizations that take data that's largely publicly available and process it and make it usable and interpretable for people and I think that's what we're going to do when you see the massive computer infrastructure that we have I think most biologists do not yet realize what's coming and what's going to be required in the computational effort going forward and so I think it will be justifiable for people to subscribe totally on the basis of that alone but the board of directors of the company and the stockholders this has been a premise from the beginning and nobody's ever wavered on it even for one microsecond a number of questions have come up from the audience just in this vein and one of the questions asks who owns the patents on all these genes? So patents is obviously not a simple issue except when you back off and look at the overview of how medicines and therapies get developed in this country and it's a trade-off we make as a society the best examples come from individual genes that became medicines on their own the best example is insulin before human insulin became available from recombinant DNA techniques most diabetics were treated with pig insulin isolated from pancreases and slaughterhouses and diabetics gradually became more and more resistant to the pig insulin over time Genentech and Eli Lilly patented the human gene for insulin that allowed them the right to commercially develop insulin as a product they don't own my insulin gene they don't own your insulin gene all they have is a right to commercially produce insulin and if they didn't have that right in that period of exclusivity they wouldn't have invested the hundreds and hundreds of millions of dollars it takes to get a new drug on the market there is no purpose that I know of for having a patent on a human gene other than trying to further the development and new diagnostics for the benefit of the American public the fact that some people take huge financial risk more biotech companies I think go under then succeed they only succeed if they come up with a new therapeutic that really helps the population so who owns the patents or the companies, the investors the NIH the universities the NIH has filed more human patents than Laura has to date I don't think that's the key issue as long as they're truly available for the pharmaceutical companies to make new therapeutics otherwise we won't have new medicines coming out of all the science several questions that come up from the audience in regards to just genomic structure and maybe combine both of these and have you or anyone else comment on them the first one is if there are so many genes are genes of unknown function how do we know that so much of DNA is so-called non-functional and can be virtually ignored and then the question that goes along with this if only about 5% of DNA is coded or codes for genes in humans what could be the purpose of the rest of the DNA those are both excellent questions quite often the DNA outside the genes is referred to as junk DNA and I think that's mostly scientists expressing their ignorance we don't know what it is so it must be junk we know it's not genes Sydney Brenner makes the unique distinction between junk and rubbish he says you store junk in your attic for some potential use later on rubbish you throw out I think certainly all the repetitive elements in the genome have a role in evolution in the rearrangements of DNA I assume Lee Hood will be talking about the tremendous work he's done on the T cell receptor domains and others the genomic sequence in our lymphocytes is different than the genomic sequence in the egg cells and sperm cells we're constantly evolving and changing the DNA between the chromosomes from our two parents interchanges in the form's new chromosome structures the so-called junk areas are actually essential for that they're essential for chromosome replication I think we're still learning the genetic language we know lots of three letter words in the last year I learned a lot of four letter words there's clearly six letter words there's advanced punctuation I think one of the first things we hope to find out of the genome is tissue specific promoters the signals that says this gene should be expressed only in these cells in the heart and not anywhere else at this certain period of time enderverma at the Salk Institute has used some of these tissue specific promoters to develop new routes for gene therapy actually using an inactive AIDS virus and a human tissue specific promoter he got the human gene to go to the right place and it be expressed that gene therapy is we're not just a big bag of proto plasma we have a hundred trillion different cells and the challenge is to get the genes to go to the right place so that's part of the junk DNA that's in there I think it's largely an ignorance factor I'd like to expand on what Craig said because we've been thinking about the genetic information as discussed this morning in a somewhat linear form but in fact the genes are expressed from a spherical body the nucleus, the chromosomes have architectures which are quite complex and determined not only through development but in different stages of expression of genes and so one of the things that I think is exciting to think about in terms of what we're calling junk DNA is the architectural contribution that this DNA makes because genes do get expressed in spatial ways as well even within a single nucleus so this might be one of the things that begins to emerge as one looks at the very exciting data sets coming out of whole chromosomes one could maybe start thinking about those and not just the genes Dr. Fred I would add just one thing further in a sense you can think about chromosomes as digital strings that contain a multiplicity of different languages so one language is the language of encoding regions another language is the language of the information that turns the genes on and off at the right place at the right time there are a series of languages that are associated with the unique functions chromosomes carry out in being the transmitter of genetic information and the generation of information and cells but to really and I agree completely with Craig the repetitive sequences 40% of your genome do play fascinating and critical roles in evolution but I think the really interesting question is what are the other languages out there we don't understand and it's here that there are really wonderful opportunities for students in computer science and applied math to think about a future of deciphering the digital language of life and a final question before we break for lunch today that said that the human genome was probably going to have 140,000 genes and not 100,000 as previously thought do you have any comments on that? That was one of the downsides of private investment in genomics I can tell the story of early on what happened with the formation of human genome sciences we just published a paper in the scientific literature saying there were 60 to 80,000 human genes and the late Walle Steinberg who was the initial investor in human genome sciences who was in the process of doing a $125 million deal with Smith Klein Beecham called me up screaming opportunities at me and said what do you think you're doing saying there's only 80,000 human genes and I said why what's the problem that's what the number looks like he goes well I just sold 100,000 of them to Smith Klein Beecham can I make one comment the other question that is very interesting for the audience is we don't know how to define a gene very well and what has turned out to be the fascinating truth is something that in the past we would have called a single gene might have 30 different forms that carry out different biological functions so there are wonderful ways for amplifying information and what a gene is is a matter of controversy and debate this afternoon at 1.15 for music and 1.30 for the second lecture and I want to thank Dr. Venner and our participants here one more time