 Thank you, Andy, and it's a pleasure to be here. It's truly remarkable that 15 years ago we started this course, and the idea was we thought there was going to be a lot of interest across NIH in genomics, and it was popular. And then the next time we did it was still popular, and Andy and I and Tara have agreed we will just keep doing this every 18 to 24 months, as long as we fill lips at auditorium and our remote sites, and we continue to do this. So my lecture this time is going to probably be quite different than it was 15 years ago, because so much has happened in genomics in that time, and I guess maybe it's a sign of getting older, seniority, or something that I find that a lot of this lecture I'm going to give is historic at this point, because so much has happened in 15 years. So what I'm going to do is really just set the landscape, set a context for the entire class, if you will, this set of lectures. I'm going to give a lot of history, because there's a lot of contextual information I think they'll be very helpful. And I will drill down in a few areas where I have a bit of expertise and where I know other lecturers are not going to drill as much. So we will do a little bit of that, but it's going to be a wild ride, because I have a lot to cover, because the landscape is so big these days, and there's just so much to cover. But I will start with some foundational milestones, if you will. There is a lot of history that is worth setting even before the term genomics ever got invented. A lot of places I could start, I choose to start with Mendel and his peas and understanding some basic principles of genetics. Meischner, who figured out that DNA as a molecule and described it for the first time in 1871. Avery was the one that figured out that DNA was the hereditary material through some of his creative experiments. But really one of the most important developments in genomics was the 1953 report by Watson and Crick of the double helical structure of DNA. That is the insight that provided knowledge about the structure and function of DNA and really provided insights to figure out how it was that DNA was that informational molecule that carries all the blueprint from one cell to the next. So that double helical structure and knowledge of it set up a whole cascade of things that took place starting the 1960s with the elucidation of the genetic code, something that took place in part on this campus. Later in the 1970s and early 1980s, the molecular biology revolution followed, which yielded a set of techniques that allowed the manipulation of DNA in the laboratory and found molecular biology applications in virtually all areas of biomedicine. And in many ways that set up what took place in the 1990s, the genomic revolution. The genomic revolution, of course, had as its centerpiece the human genome project, which got out of the gates in 1990 and set a course of a series of milestones, which is sort of depicted in iconic form here, of learning how to map and eventually sequence complex genomes, first starting on smaller organisms and then eventually working our way up to larger or more complex genomes, particularly that of human. Of course, this arguably was the most historic single undertaking in biomedical research, and indeed, some people believe it was probably the scientific accomplishment of the century. This is the context by which this course is entirely based, what really was set up by the genome project, and in many ways, every speaker you're going to hear from throughout this lecture series is going to extend on developments that have essentially transpired since the completion of the human genome project. So what I want to accomplish today, over the next hour and 15 minutes or so, really four things, provide you sort of fundamentals of how we map and sequence genomes, and then number two, describe to you how exactly it was that those techniques were used in the human genome project to map and sequence, for example, the human genome. I'm going to then go on and talk about several areas that are compelling developments since the human genome project, and you will see along the way I will touch upon the port importance of comparative sequencing, as well as some of the new frontiers and genomics that really represent the main reasons why genomics continues to be at the cutting edge of biomedical research. So this is really a lecture about the human genome project and beyond, with the first two areas here being about the human genome project and the second two areas being beyond the human genome project. But what are genomes anyway? Well, I mean, genomes can conceptually be thought of as an archive of information, if you will, just made up of G's, A's, T's, and C's. People like to make analogies to an encyclopedia volume set, for example, because it is if you could take a genome sequence and type it out on pages of books, G, A, T, C, and the proper order as they appear in a given genome, and as you can see from this kind of a slide, genomes differ in size, mammalian genomes are in the order of three billion bases, and more classic model organisms such as fruit flies and nematodes and yeasts are substantially smaller. These represent the genomes that were under priority study when the inhuman genome project sought to fulfill its goals, and they were selected in part because they represent workhorse model systems in the case of the lower organisms, and in the case of human because of our own interest in understanding our own genetic blueprint. But it was really these other organisms that served as a testing ground to figuring out how it was that we were actually going to fill these volumes with the genome sequences of these different organisms. In reality, we all recognize that we don't have volumes in the nucleus of ourselves. We have these chromosomes, and in this case, this is the human chromosomal map, if you will, the cytogenetic map, which represents the 24 chromosomes that were the targets for systematic mapping and sequencing as part of the human genome project. When the genome project began, I will be candid because I was part of it from the beginning. I was then a postdoctoral fellow. We truly had no idea what we were doing. We truly just knew what the goal was and figured that if we had an audacious goal, we had good funding, and we were motivated, and we'd figure it out. And indeed, we did, but boy, those early days, it was not exactly clear how it was that you went from chromosomal maps like this to actually complete sequence of each of these chromosomes. What we did know how to do was to clone DNA, and we were getting better about it all the time. There wasn't a mechanism at that time to immediately read those encyclopedia volumes. So instead, you had to first organize them, and then eventually be able to open them up and start to read them. And the way you organized them was to break down those big old chromosomes and break them down either into big pieces or eventually into smaller pieces and take those pieces and clone them into some clones, systems that allow you to manipulate the DNA in a laboratory to organize yourself and eventually to be able to read all of that DNA. You do this by organizing these pieces, by conceptually overlaying them, by figuring out how they relate to one another. This is the process of mapping, taking pieces of DNA and figure out which ones overlap which other ones. And this is precisely the way the first maps of the human genome were actually constructed. Cloning systems were developed, one in yeast called yaks, another one in bacteria called bax, that essentially allow you to break down these encyclopedia volumes into a series of pages or into individual pages. And these were substantially bigger than some of the older cloning systems that were available that simply were just too small to be able to make any headway in putting back together these encyclopedia volumes, if you will. So conceptually, what we did in the early 1990s was start with the human genome, for example, studied almost chromosome by chromosome, first by ordering the chapters, if you will, collections of pages using the first technology, these yeast artificial chromosomes, which are almost now antiques these days. A more contemporary cloning system allowed us to then go back and develop page by page maps of each chromosome using cloning systems called bax, which allow you to more easily manipulate these DNA molecules in the laboratory and are more amenable, as you will see for DNA sequencing. So at the end of the day, when the mapping phase of the human genome project was done, every single chromosome had a map that looked like this, which is overlapping sets of back clones. You simply would select one back clone in a tiling path fashion from left to right across every chromosome, and then you would have in your hand one page of the human genome, but you knew precisely where it was on what chromosome, roughly where it resided and so forth. And then you were ready to read that page of that encyclopedia volume, if you will. That was the mapping part of the genome project. Attention, of course, turned to DNA sequencing, which nowadays is really mostly what we talk about, because mapping, as you will learn by the end of this course, is sort of something that's not even needed because of powerful technologies for doing DNA sequencing for some mapping that's needed, but not nearly to the extent it was in the early 1990s. Now DNA sequencing has a rich history associated with it. I've told you about some of these milestones, my share, and Avery, Watson, and Crick. Let me fast forward to 1977. It was a monumental year because two descriptions came out of two techniques for sequencing DNA, one by Fred Sanger, one by Wally Gilbert. Each of them won a Nobel Prize for their discoveries in 1977. For Fred Sanger, it was his second Nobel Prize, a true underachiever. He had won one previously for protein sequencing. But what was remarkable about this was that 1977, especially Fred Sanger's method of di-deoxy chain termination sequencing, it was regarded when the genome project began some 13 years later that that method was never going to be the one that was going to be used for sequencing the human genome, that for sure we would need some revolutionary new method for DNA sequencing. And indeed, it was thought that only when that came about would we be able to actually start sequencing the human genome. And perhaps the pessimism for that was, in part, back in 1977, one graduate student working in a laboratory for an entire year would only get 1,500 bases of new sequence. And even when the genome project began, that had gone up a little bit, but still wasn't all that substantial. Not enough to support the sequencing of the genome. But what transpired during the early phases of the genome project were systematic improvements in di-deoxy chain termination sequencing, through automation, through better chemistries, through better physics, and so forth, so that it got to a point where it was very clear that method could be used. Later on in this course, you're going to learn about all new revolutionary methods for sequencing, whereby now in 2010, I don't even know what the number would be with one graduate student working in a laboratory for a year in terms of how much sequence data they can generate. It is truly monumental amounts, and you'll be hearing a whole lecture about that later on. But originally, with di-deoxy chain termination sequencing, it was all about tagging DNA molecules with radioisotope and separating these DNA molecules by gel electrophoresis and exposing these gels to x-ray film and getting these classic sequencing ladders. This was a method that certainly when the genome project began was the one that was being used, but quickly was going to have to be put to rest. There was no way we were going to sequence the genome using radioisotope-based methods. Fortunately, more automated methods were developed that tagged the DNA molecules with different dyes, which depending upon which base at a given position there was, it would get a different color dye. That set up a situation where these gels can now be read by a laser, and a computer can analyze what color the laser was able to detect. And as a result, now you had a more automated way to actually read the DNA sequence. And I'm sure many of you have generated data like this or have sent out DNA samples to get these chromatogram-like data from these fluorescent-based di-deoxy chain termination methods. This was the kinds of technologies that were used for sequencing the human genome. And what transpired were a series of more and more automated ways to do this, first on gels, but more importantly later on capillaries using extremely efficient high throughput instrumentation that allowed large amounts of data to be generated, even though we were using the classic Sanger method for generating that data. At first, our attention was turned because it was quick and easy, and because we were all very interested in what parts of the genome actually encode for genes. And so a lot of interest came in deploying these high throughput sequencing methods to analyze how genes were expressed. For example, it came up that there were ways to capture little snippets of sequence from CDNA clones called Express Sequence Tags. Later there were ways to concatomerize RNA into little small little bits and be able to count how much of each tag you would find that would indicate the levels of expression of individual genes by analyzing those tags. And later came a whole project that's here at NIH called the mammalian gene collection, which recently concluded that generated large complete inserts of CDNA clones to provide us a resource knowledge of, for example, all the human genes or many of the human genes. You're going to hear an entire lecture about gene expression methodologies, including some very new exciting ones that come with some new methodologies, and that'll be by Paul Meltzer later in the course. Of course, the real goal of the genome project was not just to look at genes, but to look at genomes and to sequence genomes. And so the methods that were used and the strategies that were used for sequencing genomes basically involved one fundamental paradigm that I want to describe to you. It is still a paradigm that is used today, and it is known as shotgun sequencing. It is a paradigm that while some of the methodologies might have changed, the fundamental principles behind shotgun sequencing have remained the same. And those principles involved generating ridiculously large amounts of redundant data that allow you to then put back together the puzzles of what the starting sequence actually looks like. Let me introduce this to you. You take, for example, as we did when we sequenced the human genome, one of those back clones, one of those pages at a known position and one of those chromosomal volumes, and you take that DNA and you just prepare multiple copies of it in the laboratory. You just make a DNA prep of that clone, and then you rip it to shreds, totally at random. You are randomly breaking this apart so as to attempt not to introduce any biases, and then you clone all those DNA fragments, and you just start picking random clones. These are known as shotgun clones because you sort of shotgun clone the whole starting DNA, and you just start reading DNA from within these clones. You read those inserts, and you do it over and over and over again. You automate this, you have colony pickers picking the colonies and robots preparing the DNA and all of these sequencing instruments churning away sequencing the inserts, and you're just gonna feed all of that data into a computer so that when you want the final puzzle to be put together looking like this, you do this by just generating lots and lots and lots of redundant sequence data. In fact, when we sequenced the human genome, we read every base at least 10 and in many times, 15 times on different little snippets of sequence. And why do you do that? Well, the reason why you oversample the genome all relate to some statistics I'm not gonna go through called Poisson calculations, but it all relates to the fact that you wanna recover every last base if you can, and in order to do that, you're best off at a statistical level if everything is truly at random, and of course it's not perfect, but it's close, is that you wanna sample it on the order of at least eight, nine, 10 times. And in doing so, you will capture at least statistically 99% of all of the bases. Notice that even when you've only read it twice, you actually capture quite a bit of the sequence, we'll come back to that in a handful of slides. But the bottom line is this is the whole rationale is you oversample, and by oversample, statistically you should recover virtually all of the sequence data that you need. Well, what are you actually doing with that sequence data? Well, this is shown conceptually here. You were building contigs, like I showed you earlier for clones, but now you're aligning the contigs based on shared sequence. So here are your sequence reads, and you do this in a computer, of course, you don't do it manually, and you're just overlying this, one read after another, lining it up based on shared sequence, and eventually you've been able to put this little puzzle together, so that's the actual assembled sequence. So you take all of this data that was generated at random, you feed it into a computer, and boom, it just organizes it all nicely for you, using one of a number of different sequence assembly programs. That gives you pretty good sequence, typically. Not perfect, it might have some low quality regions, it might even be missing some sequence, but it is what is known as working draft sequence. What is required to perfect that sequence is a series of laboratory and computational steps that refine it, that finish it, that allow you to go in and recover sequence that gets you better quality across low quality regions, and even captures sequence you never got the first time. In a very slow, meticulous, craftsman-like way, it's sort of like that hard polishing you might do on a manuscript. That last 10% may take you half the time, because it's getting every word perfect, every grammatical mistake fixed, and every transition nice and smooth. But at the end of the day, you can get very high quality finished sequence. The goal, for example, in the Gino project was to generate a human sequence that was air no more than one air every 10,000 bases. We've gone back and look, it's more like we ended up getting about one air out of about 100,000 bases, so we actually well exceeded the goal. Very high quality sequence can come out of this. But recognize that finishing is tricky. You can end up with errors if you're not careful. Let me just show you a couple examples. Here, for example, there are three shotgun reads, and as you can tell, if I'm seeing this right, it'd be very hard to read the fact that there's actually an A right here. And this read, maybe there was a hint of it, you completely missed the fact there was an A here. This is why you get redundant sequence. By reading it three times, the computer was able to figure out, yes, there is an A here. And then in a situation like this, this was a shotgun read that you would have sworn there was just two G's there, but the computer suspected sign was wrong when you go back in and do a special sequence read across that region, eventually you can detect, yes, there are three G's here, but that required somebody going in, looking at it, reviewing it, going in the laboratory, doing an experiment, getting a new read, and proving that it's three. It is the reason why sequence finishing remains relatively expensive. We are very, very good at generating draft sequences of genomes. We have still not gotten nearly as good at being able to afford to get really perfect high quality sequence of all genomes, and that represents a challenge for the future. But we had this methodology, we had shotgun sequencing, it was working, we were able to automate it. We took those methodologies out for a test ride during the early phase or the mid phases of the genome project. Some historically significant genome sequencing projects that represent milestones within the human genome project included first the sequencing of the eukaryotic genome yeast, this was actually the first eukaryotic to have its genome sequence. It was later followed by the sequencing of the nematode worm in 1998, the first animal genome sequenced. Just about that time is when capillary sequencing instruments came on the scene, and those were first really used on a very large scale for this project, the second animal genome sequence when the Drosophila genome was completed. And that then really set the stage, it was recognized that it was time, in roughly 2000, to really now hit the accelerator and tackle the human genome. It was very clear these capillary instruments were working, Sanger sequencing could get us the human genome sequence, the methodologies were refined enough, and so off we went. The genome project said, we have our paradigm, this is how we're gonna sequence the genome, we have these encyclopedia vibes, we have all these sequence ready maps page by page, we'll take them one at a time, shotgun sequence them, get a sequence of each page, stitch them all together and call it a day. And that was decided, that was the way the human genome project was gonna sequence the human genome. Now I would be remiss to pretend that the genome project was the only game in town, that it was the only effort to sequence the human genome. There was this company, Solarogenomics, that came up with a different strategy. Their idea was forget all the mapping, forget all these backs, just take the whole encyclopedia volume, shotgun the heck out of it, and then feed it all into a computer and put it all back together. And you know what, it actually works, doesn't work perfect, but it does work, and they demonstrated this could be done this way, and in fact nowadays these sorts of strategies are the kind that are used for getting draft sequence. The problem is that there's not much of a path, at least not currently defined, to be able to take a draft sequence generated this way and finishing it. At least if you have sort of a clone by clone backbone that you've done your sequencing, that represents starting material and mapping information that makes it more approachable for actually finishing a sequence to very high quality, like I described earlier. But nonetheless, two draft sequences were generated, the genome project generated and published it in nature, and at the same time in a coordinated fashion, Solarogenomics published their whole genome shotgun, which is the way their method is called, in science. The difference is that the genome project was committed to finishing the sequence to high quality, and the interests of Solera were different and were not interested necessarily in finishing to high quality the sequence. So this was published in 2001, and then it took a couple years of finishing, and a key effort to sort of refine the sequence, and then sure enough in April 2003 was the completion announced of the finished human genome sequence, and this is a cover of a DVD that actually has on it the sequence of the human genome. And then one year later all that was analyzed a bit more in the final publication describing the human genome project's effort to sequence the human genome. And I thought it was very interesting that this was an essentially finished sequence, it's not 100%, not every base is covered, places like the centromere are still too difficult to sequence, and there's an even occasional few hundred really, really hard parts of the genome that really even by today's methods can't be fully recovered. And some of you may have seen in nature I just think at the end of 2009, just a month or two ago, there was a whole article about this that there still are people including some people on this campus who are working away still trying to fit in the missing pieces of the reference human genome. This is an effort that goes on in part because it's hard and part because it's expensive and in part because some of our methods still cannot recover certain parts of the human genome, but again as a percentage this is actually very small. But the bottom line is we continue to refine and learn a lot about the human genome. So that's really the first phase of this historic tour that I've tried to take you on related to the human genome project. Now the fact of the matter is, and I probably don't have to convince you of this because you recognize it just by this course, is that the human genome project was not the end of genomics. The human genome project actually was just the beginning of genomics. And so what I wanna describe now for the remaining part of my talk is sort of the journey ahead. What's now in store for genomics that you're gonna hear about throughout this course that really represents the new frontiers that are being created by the foundation of information and technologies provided by the human genome project. And what I would tell you is that increasingly and very relevant to Citi here in the NIH clinical center, so much of these activities relate to the application of genomics to various problems in medicine and disease work and thinking about human health. And in fact, even before the genome project concluded, there was enhanced interest in how genomics and medicine were gonna meet. That was true in both the popular press and also the scientific press and that gave birth to phrases such as genomic medicine or personalized medicine or individualized medicine. These are all different phrases for essentially the same thing. And so really what I wanna now describe to you is what I think are sort of the important points that are gonna get us from starting at the human genome project to a journey that'll take us along a path that'll eventually yield some realization of genomic medicine. Now what exactly do I mean by genomic medicine and clearly people interpret this in different ways? I would simply say that what I mean is healthcare that is somehow tailored to the individual based on genomic information. Now this path as I show it here is one that I wouldn't claim for a minute. We know exactly what that journey will look like no different than we really didn't know what the journey was gonna look like to how to sequence the human genome. I do know it's gonna involve a lot of steps and I'm completely convinced we don't even know what those steps are gonna be. We can only define the first handful of them. I also know that we were successful at doing this. We need to be successful at doing that and only if we are successful will we truly fulfill why we promised on the importance of sequencing the genome and that sort of was I think the motivation behind trying to make this journey is we recognize the importance of changing the way we practice medicine by understanding better the human genome. So let me take you through what I think are sort of three major steps, not all the steps just the three major ones especially those that there's been major developments on since the completion of the human genome project. Let me first by talking about really step one which recognize that the genome project was about mapping the genome and sequencing the genome and I went through that in what 15 minutes I reduced 13 years to 15 minutes but that's what I guess historians have to do. The first big step we are taking in that journey involves this next phase of interpreting the human genome sequence. Note that it's something that I wouldn't even claim to put a time limit on because I have no idea how long it's gonna take. I'm actually of the belief that decades from now we'll still be interpreting the human genome sequence because we'll constantly be unraveling additional complexities that we hadn't appreciated the decade before. This is going to be a huge undertaking and for those of you who think it's not a huge undertaking I would just remind you to go stare at some DNA sequence every once in a while. So shown here as .0001% of the human genome sequence and all you have to do is stare at it for a while and realize all the complexities of the human body and tell yourself how the heck are you gonna figure out how it all works across those three billion ordered letters. Well we are trying very hard to understand how it works and you have to start in small steps. So one of the first set of steps that are being undertaken and unraveling the complexities of the human genome is just to catalog what parts of the genome are functional and what parts are not functional. Catalog them and annotate them and just try to understand which parts you really have to pay attention to. Sort of like you have a whole lot of notes in a class and you just say I just wanna know what's on the test, okay. That's the first thing we're doing. Just try to figure out what's the functional stuff. And one thing that has come to the forefront from studies that have come out mostly since the genome project concluded was insights about the general demographics about the genome. General numbers that you at least can have as a benchmark for trying to interpret the human genome sequence. For example, we now know from studies that involved comparing our genome to other species genomes that I'm gonna describe in a few slides that probably about 5% or so of the human genome is being sort of held on by evolution is evolutionarily constrained and there's four is presumed to be functional. So at a primary sequence level, we now have an estimate that's something like 5%, maybe six or 7%, but somewhere like that, 5% of our three billion letters at a primary sequence level are the functionally important ones that we really need to pay attention to. Now we don't know exactly what those bases are. We know there's about 150 million of them, but we don't know the position necessarily. This is a bulk estimate based on a series of studies that have now been performed. It might end up being the lower bound that's actually functional. There might be other ways that DNA can confer function other than through primary sequence, but that's a good benchmark estimate to think about. So off we go. We wanna identify 5% of these bases that are functionally important. Let's start with what we understand and what we fundamentally understand are genes. We understand the central dogma of molecular biology. DNA makes RNA makes protein. There's an intermediate molecule of RNA. There's a lot that can be studied in the RNA world. I alluded to it earlier. We should be able to figure out something about genes. And the other reason we can figure out something about the catalog of genes and be able to go through and annotate that set of sequences that actually encode for genes is that we've learned a lot about genes over the series of decades leading up even to the genome project. For example, we now know that genes are consists of exons and introns. We now know that these exons get put together in mRNA molecules that can be alternately spliced and so through splicing and dicing and all that, you can sort of get complexities but you can study that RNA and as a result you can figure out which parts of the genome were actually made into RNA and therefore might be coding proteins. We have another huge advantage with something I alluded to earlier that was learned in the 1960s. We know the genetic code. We have the famous lookup table. We can feed all this information into computer programs and it can help figure out where you have open reading frames and where it is you might have genes and you marry all of that information with information about RNA sequence and before you know it, you can get your hands around the catalog of human genes. So as a result of that knowledge and a tremendous amount of data that's been generated, we now know that something like one and a half percent of our three billion letters are make up sequences that directly code for protein, protein coding genes. It corresponds to something on the order of 20,000 genes and yes, there's geeks like me that sit around in bars at meetings and argue about whether it's 19,050 or it's 21,050. Don't you guys worry about that? It's roughly 20,000 genes and we'll figure out exactly what the right number is in the coming years. Turns out we're very complicated. We make, as humans, make far more than 20,000 or even 22,000 proteins. We have different ways of alternately splicing RNA, different ways of post-translationally modifying proteins. So we have great protein complexity in the face of a relatively modest set of genes. The good news is that even as of today we have a pretty darn good inventory of all the human genes, one that will be refined in coming years, but this is really not the grand challenge in genomics now because we have a fairly good catalog at hand. That is actually not the case necessarily in the case of the other three and a half percent of the functional sequence. Remember, five percent is functionally important. One and a half percent is yellow, is coding for genes. That leaves three and a half percent. So there's three and a half percent of these letters we need to highlight in another color. It's functional. It's being constrained through evolution, but it is functioning in ways other than directly coding for proteins. This is non-coding functional sequence. And the truth of the matter is we have a lot. We need to learn about that. Now we do know some things about non-coding functional sequences. For example, we know that genes are regulated by a whole suite of elements, promoters and distal elements, such as enhancers and silencers and insulators and so forth. And we know they all get together and do all sorts of complicated things to dictate when and where genes get turned on and how they get spliced and so forth. This all exists out in the non-coding parts of the genome and it's functional. We know about some of these. We have a lot to learn about a lot of them. In addition, we know there are sequence elements that are important for packaging up our chromosomes, for segregating our chromosomes, for replicating our chromosomes. Those are all non-coding functional sequences. And then just think about it. Over the past 10 years, once upon a time, we thought DNA made RNA and then RNA mostly made protein and RNA otherwise was fairly non-contributory to the grand scheme of things. But wow, we've learned a lot about non-coding RNAs in the last 10 years and all sorts of microRNAs and non-coding RNAs. There's a whole RNA world out there that's functionally important but it's not coding in terms of not being protein coding. And so this also represents some of that purple stuff. And then of course there must be other ways that DNA can first function in ways that we don't know about yet and that would be in ways other than coding for protein. So all of this must make up this landscape, if you will, of non-coding sequences that are functional, which amount to about 3.5%. Well, we know it's gene regulatory elements. We don't know how much. We know it's chromosomal functional elements. We don't know how much. And then there's a whole bunch of stuff we don't know about. And in fact, it's not like we can pick up a textbook and read about it because it's not even described in textbooks yet. This is a grand challenge in genomics because at present we have a very poor inventory of what turns out to be the majority of the functional sequence in the human genome. So this really represented something and continues to be a major priority in genomics in the journey of understanding how the genome works. And it turns out we didn't have a whole lot of tools at least available a number of years ago, five, six, seven years ago. There were some and they've gotten a lot better in ways I'll describe in a minute but it was really time for a consultant to be brought in. We recognized in genomics that if really we're gonna understand parts of the genome that were functional but we didn't have any clues about how they function, we needed a really smart consultant to come in and really guide us on the strategies we would use for studying that part of the genome. And we thought about consultants that had been useful in the past and it turns out we had to go to a consultant that existed even before Mendel. And that consultant turned out to be Darwin because what Darwin left in his principles provided many, many key insights that have led to how we have gone about now and interpreted the human genome sequence. So for those of you who didn't realize this, Darwin had his 200th birthday last year, February 12th. I don't know where you've been in the last year if you didn't realize it was Darwin's 200th birthday. All you had to do was pick up a journal that seemed like every month Darwin was being featured. In some cases featured multiple times to be featured on nature and then science and then nature and then science and then nature again, you must be a pretty smart guy. And so I think the end of the day all you had to do was watch journals the last year and realize that this guy really left behind some pretty smart ideas. Now he said a lot of very smart things and one thing that was attributed to him, although I've come to learn it, unclear if he actually said it or wrote it, but it was at least attributed to him, is that it's not the strongest of the species that survives nor the most intelligent, so one that's most adaptable to change. Now why does that become relevant for genomics? Well it turns out it's that change part that really represents one of the keys that have provided us insight about how the human genome works. The reason for that is that as Darwin described, fundamentally the forces of evolution make it so your genome should change, you should constantly adapt to an environment. But fundamentally we share most of our physiological properties with our other animals across this planet and those things can't change because almost all species are gonna need some basic biochemistry like insulin and hemoglobin and things like that. And as a result, those are the parts of the genome that you don't tinker with because evolution doesn't mess with something that's really, really, really, really functional. And so one could imagine that left behind in every genome is a legacy of what parts of the genome could tolerate change and what parts of the genome could not tolerate change. And the stuff that didn't tolerate change is probably the really important functional stuff. So could you get clues if you opened up those genomes and read them as if they were evolutionary notebooks left behind cataloging all the experiments that have gone on? And a more contemporary scientist and a genomicist, Eric Lander, made the quote, which I agree with, is for the last three and a half billion years evolution has been taking notes. Those notes are in the genomes of various life forms. And so the idea behind comparative genomics, which is all about helping us in part to understand the human genome is not necessarily to compare the morphology of these various species as Darwin would have done, for example, but to open up their genome notebooks, read them and catalog what they had in common with humans. Because if there are things in common between humans and creatures such as these separated by millions and millions of years of evolutionary time, there is a reason for it. And the reason for it is because they're functionally important. So this could provide a clue. How does this get applied? Well, you use the experiments of evolution to decode the human genome by taking let's say the human genome sequence, compare it to another species, let's say the mouse, and simply line up corresponding parts of their genomes, orthologous sequences in other words, compare them in a computer and simply highlight those sequences that remain the same over the tens of millions of years of time separating these species. These sequences that are in common, in other words are conserved or undergoing evolutionary constraint aren't definitively functional, but simply provide you an enriched set of sequences that you could then look at more carefully as being candidates for being functional. So this is exactly the reason why off we went after we sequenced the human genome to sequence the genomes of various other animal species carefully selected for a whole host of, based on a whole host of criteria that even continues now, including most recently cow for example, but it represents sort of classic mammals, some interesting mammals such as marsupials and monotreme, and even include some fish species and even many more that are not shown here. These are just ones that the papers actually were able to be featured on the cover of these prominent journals. And what you get from this is sort of a diverse landscape of sequencing, a human sequence which is actually really good quality and even a mouse sequence which now actually is a finished sequence of very high quality. But as I alluded to earlier, we actually can't afford to get high, high, high quality sequence of all these genomes are really more like a draft sequence and that's why I'm depicting them as these dashed lines. It became very clear though that to really read and analyze all these various genome sequences to identify the most conserved parts of our genome, we needed many more species and laws listed on this slide. In fact, what became very clear was that you need statistical power to sort of have your programs be able to find very, very small patches of sequence that are highly conserved because those are the ones, some functional elements are very small. If you don't have enough species, you can't quite separate the signal from the noise. And so one of the things that was analyzed well, how do you really need to even have a good draft? What about a weak draft? Or can you skim read this sequence? And so there's a lot of debate several years ago and some of it was informed by a study such as this one by Sean Eddy which basically demonstrated that if you wanted to use comparative sequence analysis to identify let's say a 50 base segment of DNA as highly conserved, you can get away with that with only like four species. But if you want to have a very small stretch of sequence like six bases or something like that, you really need to use many more species than that. And we know that some, for example, regulatory sequences might be as small as six or eight bases. So one might imagine we need a whole bunch of species. And so as a study that we actually did my group did with Eric Lander's group an analysis published in 2005, it became apparent that one of the best ways to deploy large scale sequencing and a very selfish way to figure out how our genome worked was to augment the species that already had been sequenced to relatively good draft sequences but to selectively pick additional species across the mammalian tree that but not invest in them by sort of getting really high quality drafts but basically by lightly sampling them. Remember earlier I showed you that even with a two X shotgun sequence you can get upwards of 85, 87% of the sequence will be recovered. That is why the idea of roughly a two X skim read if you will or just two X redundancy light sampling, skim reading, whatever you want to call it would be a very effective way to inform us about the most highly conserved parts of our own genome. As a result of that, that's what transpired. And this group of species were selected 22 additional mammals and these have been sequenced at low redundancy and I could actually tell you in fact later this afternoon I'm on a conference call where all this data is being analyzed and being prepared for publication I would predict almost guarantee you sometime in 2010 you will see a publication describing the analysis that was done or analyses that were done using these 22 additional mammals to help us identify the most conserved parts of the human genome because that's the one we really, those parts are the ones that we want to study first and foremost as being functionally important. So the idea is to augment this high quality sequence medium quality sequence and sort of light read skim sequencing put them all together, feed them all into computers analyze them comparatively and try to glean out the most highly conserved parts of the human genome in other words simply taking many, many species that should really be 24, 25, 30 species and just comparing them systematically to the human genome sequence in a way that gets you this high statistical confidence to highlight let's say the top 5% most conserved parts of the human genome and that is exactly what has now been accomplished through this light sampling and analyses sequencing and analyses that are going to be reported later this year. So that's just great. We have candidate regions across the human genome that are incredibly conserved and we can highlight but it raises the obvious question, what do they do? We can figure out mostly the yellow stuff, right? The genes, because we have all these rules we understand it's the purple stuff that just because it's conserved if you look at it it doesn't jump out of you and say, hey, you know, I'm a region bound by a transcription factor hey, I'm a region that dictates that the DNA is going to be replicated starting here. We had a gap between knowledge of what's conserved and knowledge of what it actually is doing and recognizing that understanding how the human genome works is more than just figuring out what's conserved it was recognized that we needed to have a complementary program to help put together what these conserved sequences and other sequences are actually doing functionally. This gave birth to this project called ENCODE that NHGRI has been directing for a number of years now and code stands for encyclopedia of DNA elements and the idea behind ENCODE is to develop a large consortium where everybody is staring at sequence and figuring out what it does and what is functional in the human genome. It had a lofty goal as often our projects do compile a comprehensive encyclopedia of all functional elements but recognizing that seems to be a theme we didn't really know what we were doing. When we started it was recognized that don't do the whole thing at once but initially just do 1% pick 1% of the human genome have everybody in this international consortium study that same 1% and everybody compare notes, figure out what you learned and then figure out how to do the whole genome and do this in a consortium way whereby computational methods and comparative sequencing and various laboratory methods are all brought together and everybody doing the same set of regions initially and then figure it all out and then figure out what can scale up to the other 99%. So I'm not gonna describe this pilot phase to you because it was published in a series of papers in 2007 of major paper in Nature and 26 companion papers in genome research but needless to say we learned a lot. We learned a lot about how to interpret the human genome and we recognized what would and would not scale up to the other 99% and so as a result using the lessons learned from that indeed ENCODE has now scaled up considerably and is now tackling the whole genome and in fact nowadays you can go as I'm sure we will be hearing in subsequent lectures you can go to various websites and genome browsers as they're called and you will pull up parts of the genome where you will have genes that are very clearly indicated and you'll also have regions that are clearly indicated as being conserved non-coding regions shown in purple and increasingly you will learn information about what functionally these might be doing based on data generated by ENCODE. Some of that data is sort of things that you might predict about known transcription factors, binding and things like that but there's other aspects of genome function that are not necessarily at the primary sequence level or at least our knowledge of it and so for example there is of course this entire landscape of epigenetics related to how DNA is modified either through methylation or through histone modification and that's another major part of ENCODE. I'm not gonna talk about it because we have Laura Elnitsky coming on February 23rd to give an entire lecture about epigenetics and how this is all very relevant to understanding how the genome works. But all this is being layered over and over and over again on the genome and this is just sort of a representative, actually it's a cartoon shot of the kinds of data that ENCODE is generating related to what parts of the genome are made into RNA, what parts of the genome are bound by transcription factors, what parts of the genome are sensitive to digestion by DNA1 indicating regions of chromatin that have opened up a bit and various other methodologies related to epigenetics. This is actually a cartoon view because if any of you go to a genome browser that has ENCODE data and start bringing up the huge amounts of data that's available it looks more like this. So just unbelievable numbers of tracks of data coming out of the various laboratories and computational groups that are analyzing the genome as part of ENCODE. What's the challenge here? Well it turns out what's the real challenge is not generating the data now it's synthesizing the data and really figuring out from the suite of high throughput methodologies that can now be applied in the laboratory and by computers, which ones are truth? How they are, what they're telling us? So it's one of these cases where you have a lot of data and now we have to figure out how to and it's telling us a lot but it's not entirely certain in all cases what is really the right answer and what might be not the right answer and this represents the huge challenge in ENCODE and needless to say there's a lot of activity to figure out the best analytical methods to interpret all that data and also to make it user friendly for folks to not have to deal with screenshots like this but more synthetic views of what really is functional and what's the most likely important set of sequences and what they are doing. It is also recognized that to really understand how genomes work we really love to study our genome and there's reasons for that but the fact of the matter is we have to constantly learn from organisms that could be more tractable in the laboratory, more manipulatable in the laboratory. As a result, ENCODE has a sibling called ModENCODE which has a similar program but focused on the genomes of worms and flies because of obviously the more powerful systems of what you could do genetically in understanding how the genome is working in these organisms which will teach us lessons about how the human genome work and mouse genome works and so forth. So to end this part of and it's just the first step from that journey let me just point out that we are gaining richer and richer more and more high detailed views of the human genome. It's this first step towards interpreting the genome and it's certainly one that is overwhelmingly important but I'm convinced we'll be doing for many more years but we've made major inroads already. I do wanna actually pause here for a moment I just think it's interesting and you're not gonna hear much more about comparative sequencing in this lecture series at least not that I'm aware of and it's something that I've done in my own group but I just want you to make a couple comments about a couple of other applications of comparative sequencing that's not related to understanding how our genome works but has other very interesting and compelling reasons for doing it and I just wanted to just touch on this just to point out there's a lot of popular and scientific press for example and just understanding a lot about human evolution through comparative sequencing again popular press, scientific press a lot of interest now in being able with these new sequencing technologies to be able to sequence some of our ancestors even who aren't around anymore because the ability to sequence very small amounts of DNA recovered from Neanderthal bones for example and so forth so there's a tremendous amount I think we will learn in the coming decade about human evolution through some of these comparative sequencing programs and then more broadly there's just so much to be learned about evolution and about speciation and about sort of life forms across the planet and very interesting that some genomic colleagues of mine, David Hauser, Steve O'Brien, Oliver Ryder basically started a program called Genome 10K we had an organizational meeting last year that I attended shown here and here's the idea and I just think it's of general interest the idea is that it turns out that even for the first 40 genomes that we went to sequence what actually ended up being rate limiting was not actually sequencing the genomes it was actually getting the DNA sure it was easy to get a mouse and a rat and then when you start to get some platypus, a monkey from South America and so forth all of a sudden getting those DNA samples turned out to be very challenging and we're going to get to a point where sequencing a genome is going to be dirt cheap and one could imagine a day and maybe that day is only five years away or 10 years away we could sequence any genome cheap enough and that the way you will do biology study biology the way our kids or grandkids will study biology down the road will be actually not by sort of reading about them or looking at them but actually by perusing their genome sequences comparing them and figuring out differences between them so the idea was to start now to collect something like 10,000 DNA samples from all various vertebrate forms across the planet and have them available so when sequencing gets cheap enough we'll have the DNA in hand instead of waiting then and then going out and collecting and taking years and years and so the idea behind this of course is just being proactive and we know where the sequencing is going it's getting incredibly cheap and so we just want to get to a point in anticipating when sequence cheap enough will the DNA in hand sort of reminiscent to what Gretzky used to say he said I would skate to where the puck is going to be and the idea is get the DNA now have it ready to go when the sequencing gets cheap enough if you're interested in reading more about this proposal it'll actually just got published at the very end of last year of meeting report and it's just again I'm not sure that it has some biomedical implications certainly but it actually has more broadly interest in evolutionary biology and just sort of the future of how biology I think is going to be studied and I think it's actually quite interesting what I want to do now is shift gears to now the second major area the second major step in this journey from the human genome project to genomic medicine and here I told you about how we're comparing sequences among different species but the other real important application of genomics now is comparing the sequence within our species doing interest species sequence comparisons and particularly doing this in a way that's going to empower human genetics and the rationale of course behind this is to sort of fundamentally understand which genetic differences amongst any of us are relevant we know some of these calculations we now know that something like 99.7% of our sequence is absolutely identical but that still leaves 0.3% or something like that that some of which are glitches are variants that confer some risk to a disease or overtly result in a disease if you think about sort of what it is like if you compare any one of our genomes to your person sitting next to you for example your genome is just sprinkled with variants in fact any two of us differ by maybe three to five million differences across the two times three billion bases in our genome the truth is the great, great, great, great majority of those variants are completely innocent to have no phenotypic consequences but I subset of them do somehow affect us at a phenotypic level some of them are bombs as I depict here have a negative consequence but there's probably some variants that are positive as well that confer something that would be desirable what has been very relevant is to try to understand which of these variants contribute to disease and again that's the whole rationale between for fulfilling the promise of why we wanted to sequence the genome and interpret the genome so in thinking about disease and to really show you one of the highlights that has taken place in genomics and to also set up some of the other speakers let me just quickly sort of make it a simple division by saying if you want to think about genetic disease and virtually every disease is genetic and you can divide it into two very broad categories you sort of have your rare simple monogenic Mendelian disorders cystic fibrosis, sickle cell, hemochromatosis things like that whereby it's really a single gene and a mutation of single gene that is the dominant risk to get that disease there might be some other genetic variants that contribute there might even be an environmental component but by and large these are single gene disorders but they're typically rare and they're considered genetically simple that contrasts the more common diseases that fill hospitals and clinics every day these are common or complex multigenic in other words non-Mendelian disorders whereby you have multiple genetic variants in multiple genetic loci that all conspire together in addition to what is typically a large environmental component that all sort of taken together results in a risk for developing that particular disease now the reason it's useful to break this down is because you've just seen different developments for each of these classes of disease since the genome project began in the case of rare and relatively simple genetic diseases they're simple because they're actually easy to find those genes they used to be really hard but once the genome project began which was here shown here is a graph of genes associated with single gene disorders who were discovered and you can see guess what the genome project really accelerated our ability to do that because we had maps and we had clones and we had sequence and so there were strategies that were available for these rare genetic diseases so that you could even stop plotting these things in 2005 because you could make your point this just became very straightforward and as a result huge numbers of rare genetic diseases have had their gene identified since the genome project began but that's a very different picture than what took place for what turns out to be the more common genetic disorders these genetically complex ones because those are really hard because these are all subtle genetic changes that require a huge amount of statistical power to figure out which variants confer risk and it turns out that you just simply have to have a more robust set of strategies to be able to accomplish that it also turns out you need to study a lot of people and you need to understand genetic variation at a more global level it is the rationale for a project that took place after the genome project called the HATMAP project which served to catalog and describe common variants and the pattern of those variants across all human chromosomes across large populations and you're gonna hear more about this but the major rationale for the HATMAP project was to empower studies that were gonna study the complex basis of genetic diseases or the genetic basis of complex diseases and you're gonna hear in detail from two lectures later in the course in two successive weeks about this general human genetic variation and the use of those for what are known as genome wide association studies so I'm not gonna describe it I'm just gonna give a punchline to really set them up and the punchline is the idea was if you could have a good comprehensive HATMAP of the human genome and if you had the tools in place to be able to study thousands of individuals with the disease and look for regions of their genome that are inherited at a statistically high frequency with the inheritance of the disease you will get clues about where there reside variants in the human genome that confer risk to complex genetic diseases and all that is a mouthful and you're gonna hear details but needless to say the punchline is that it turned out it worked in fact it worked really well because the HATMAP project set up a circumstance whereby you could do these genome wide association studies and end up finding specific regions of the genome and in some cases the actual genes that are associated with what was originally thought to be very very difficult to identify genes because these were so genetically complex. The first success story was this one and it came out just shortly after the HATMAP project was releasing data but let me just show you the striking advance that took place. So here in 2005 when the previous paper reported that this region of the genome right there contained a variant that conferred risk for adult onset macular degeneration that was a huge success story but I don't think anybody could anticipate how straightforward it was gonna become to do these genome wide association studies. Even in 2006 there were a couple more successes in 2007 a few, 2007 by the second quarter you can see they're adding up and now look what happens. Every quarter we're simply cataloging papers that reported regions of the genome containing variants conferring risk to a complex genetic disease and in 2008 it just keeps filling in and filling in and filling in and it did not slow down in 2009. First quarter, second quarter, third quarter and we don't have the updated slide for the end of 2009 yet but needless to say this general paradigm really really was unexpected in terms of how successful by at least some people maybe optimist thought but I don't think many people believed it was gonna be quite this successful. Now to be clear this did not identify what the underlying gene was and these studies as you'll hear about from Karen and Lynn the fact of matters it's just telling you what regions of the genome have variants in them that confer risk and now there's a lot of detective work to try to figure out which variant within these regions is actually the one conferring risk but this still is remarkably spectacular how many candidate regions this has provided and have set up a whole set of studies, a follow-up study to try to really now drill down much deeper to try to actually identify those variants. It was not surprising that in 2007 right in the middle of all these advances that Science Magazine recognized this as a huge breakthrough. So out of everything going on in science in 2007 it was these advances in human genetic variation that were sort of identified as the breakthrough of the year. There is one other little subtlety that I find particularly interesting because it directly relates to what I was telling you about that first step in the journey related to understanding how the non-coding parts of the genome are working because those non-coding functional elements turn out to be really, really, really important for these complex genetic diseases. I just pulled out one example from last year where in one of these genome-wide association studies in this particular related to autism the hottest candidate region where the greatest statistical likelihood of having a variant SIT that confers risk to autism was right there. But if you look carefully at that part of the genome there's a gene there, there's a gene there and there doesn't seem to be any genes within here. Is this an anomalous finding? Well no, it actually turns out that out of the hundreds and hundreds of successful genome-wide association studies it is now looking like 70 or 80 or 90% of the time the regions conferring risk to genetic disease are not encoding regions. They seem to be in non-coding regions. So remember that 3.5% that I told you is really hard to find and we have to color it in purple and we have to use encode to figure out? Well we're not just doing this because we're a bunch of genome geeks. We're doing this because it turns out to probably be very medically relevant and very relevant to what turns out to be these huge health burdens of complex genetic diseases. And so this is making the next part of the journey very hard because we don't yet understand the non-coding parts of genome as well as we understand the coding regions. And I know many labs that are working very hard trying to figure this out now and it's the motivating factor to why we want to understand the non-coding functional landscape because it does seem that the majority of the variants conferring risk to complex diseases are gonna be in that part of the genome. So stay tuned, you're gonna hear more about this but I wanted to set that up because I thought it linked very nicely with what I talked about earlier. So what I've described to you at this part of the talk in going from here to here were really two of the steps so far. I told you about how efforts like encode are helping us interpret the human genome sequence and how we're using various methods now to understand the variants that are conferring risk to human disease. We did really well with simple genetic disorders. We're getting better with complex genetic disorders but it's still a long way to go. The third area I wanna describe and really the third and last step in this journey and then 15 years from now I'll come back and I'll describe another three but it relates to something that I think is gonna affect all of this and it relates to DNA sequencing itself. It turns out that to sequence the human genome for the very first time costs something like a billion dollars and people will argue with me whether it was 700 million or 600 million or 1.2 but roughly a billion dollars to generate the first reference sequence of the human genome but as soon as we got that reference sequence everybody realized that this was awesome, this was powerful and we wanna do this a lot. In fact, we wanna do it for many, many different people cause it's gonna be incredibly powerful for human genetics. In order to accomplish that it was very clear we wanted to how to reduce the cost of sequencing substantially and in fact we set out as a goal in the community to get to the point where we could sequence a human genome for $1,000. A nice rounded figure, $1,000, that's a very reasonable cost for a clinical test in a hospital for example and one could imagine if you get it down to $1,000 you could do an awful lot of science generating genome sequences for $1,000 and so a major priority both by funding agencies such as NHGRI by even the private industry was to develop new technologies. Finally it was time to move beyond saying your sequencing and basically take operations that look like this which sequenced the human genome and develop something new and nifty some micro, nano, fancy, schmancy kind of technology that would get you down to sequencing a human genome for something like $1,000. The good news is that there really have been some spectacular advances in this arena with all sorts of what are known as next gen sequencing technologies and these are gonna affect everything I've described so far and I think are gonna really affect everything that you're gonna see in this path for the coming many years. So Elliot Margolies will be here February 9th and his whole lecture is gonna be about next gen sequencing technologies and so I'm not gonna describe them to you other than just describing them as fancy schmancy but the fact is a very important thing to learn about because it is changing the face of genomics in ways that even are surprising people like me. What this is setting up is remarkable advances in ability to sequence whole human genomes featured on journal covers such as nature and in fact so far there have been a number of human genomes that are sequenced. First they were people that might be famous such as these two individuals but then other major publications have come out recently describing the sequencing of someone of African descent and Asian descent and Korean descent and so forth and in fact now you stop making the slide and I just noticed that Mike Metzger just published just this month a review in nature review genetics where there's a very nice table that lists all the individuals either known or unknown by name that have been described or are soon to be described where their whole genomes have been sequenced but trust me we're gonna quickly move beyond the point of having individual people who we know whose genomes are sequenced because sequencing is getting so cheap it's gonna go on all over the place by in lots of different settings research settings and I think clinical research settings for example to even more powerfully catalog human genetic variation well beyond what was learned by HapMap there's an international project called Thousand Genomes which I really go read about that is developing basically the horse power and the infrastructure for eventually sequencing a thousand probably even a couple thousand human genomes from very carefully selected populations in an effort to try to catalog human genetic variation worldwide. That's sequencing sort of people's somatic genomes but of course with time we wanna deploy this on other genomes of interest and one of the first set of genomes that come to mind that'll be of great interest will be the genomes of cancer specimens because cancer is a genomic disorder and so the cancer genome atlas which is a large NIH project being directed jointly by NCI and NHGRI is an effort to really now full force assault cancer through a genome sequencing approach and there've been papers reporting some of the earliest findings and a lot more to be found but needless to say this is something you should certainly keep your eye on. That same sequencing technology that is sequencing whole human genomes and sequencing cancer specimens are also being applied to sequencing other things that are relevant to us. One of the first things one can imagine is very relevant to us or other creatures that are part of our own each individual ecosystem. Don't think for a minute that you're living alone you're not, you are not just you you are living with a whole host of microbes that are in you and on you and I know everybody wants to go wash their hands but the fact of the matter is this has given birth to a very large roadmap project called the Human Microbiome Project and this is all about sequencing the microbes that live in us and on us and probably are very important for health and disease and we have a whole lecture gonna be given by Julie Segre who's really become a world expert in her particular area of microbiome research. She'll be here March 16th to describe this to you. And finally, we're thinking of other ways that when one could imagine having a standard of care either whole genome sequences or at least some reasonable amount of genome sequence from individuals, how will that actually be deployed in a clinical setting first? And there's lots that we could talk about there but one of the first areas I'm convinced in fact you're already seeing this in some cases is helping guide individuals as to which medications they should get and there's a whole field of pharmacogenomics which basically rates the fact that all drugs that are out there that you buy at least the legal drugs that you buy, they all work. The problem is they don't just work for everybody and we all have known this as physicians for a long time but what we are learning a tremendous about now is the genetic basis for that variation in drug response and the recognition that we probably can just prescribe medicine in more rational ways if we first conquering an individual's genome by knowing what variants they have and therefore which medications are likely to work and so we have a special visitor coming up from North Carolina on March 23rd as a world expert in the arena of pharmacogenomics to tell you about this aspect of genomic medicine because in some ways I think it's gonna be one of the first applications of genomic medicine. So this is just sort of all things to look for and you're gonna be hearing more about. The truth of the matter is I don't wanna imply for a minute that these powerful new sequencing technologies are easy. They're relatively easy to now implement but that doesn't mean to extract all the data you get as necessarily easy. The fact of the matter is it is actually overwhelming as like this kid trying to just get a little bit of a drink of water from all the data pouring out of one of these sequencing instruments. It becomes the rate limiting fact to deal with in genomics now is not generating data. It's analyzing data and I talked to colleague after colleague after colleague in genomics and I actually talked to lots of colleagues who aren't considering themselves genomicists and they say these new sequencing technologies overwhelm them in terms of data analysis and it's sort of ironic because when 15 years ago when we started this course what I will tell you is that it was all about generating sequence data and you built big sequencing centers to generate that data and then every sequencing center every sequencing group they had a little bioinformatics group to analyze the data that was sort of coming out. That was a picture 15 years ago when we first offered this lecture series and even then we knew we needed bioinformatics lectures the picture has completely changed in the last 15 years so that even a small individual machine I mean could do it but certainly a small sequencing facility completely needs a large bioinformatic analysis pipeline to be able to deal with all the data coming off. This becomes the major bottleneck and it's sort of the bottleneck that in many ways I think we're going to be facing until we sort of get some quantum leaps and some developments in this arena. And so what I refer to as the computational bottleneck it relates to having enough hardware to be able to process and store the kind of data that's coming out of these sequencing instruments is developing computer programs that are uniquely looking at these large onslaught of data sets that these machines are producing and it's even just getting enough people to work on and having the expertise. They're all limited and they're all a bottleneck it's something we're certainly looking at but it's a reality of what we're dealing with greatly catalyzed as a result of these sequencing technologies that really are forcing the issue. The good news is this is what you're going to hear about for the next three lectures. So both Andy and Teri will be giving lectures because we recognize that to really get your hands around genomics these days you really have to have a strong foundation in bioinformatics. And so this is why when Andy said earlier on we're going to mix computational and experimental lectures. This is exactly why you can't separate these. You can't be successful in one arena without being successful in the other. So that's the next three weeks for you and what I've now I've attempted to describe and setting the landscape is just sort of how all these highlights featured on nature and science covers have so many ways conspired to sort of create this massive tsunami this wave of data overwhelming on the one hand exhilarating on the other but if I was going to take one image to sort of describe the genomic landscape in 2010 this would be the image I would use because I feel like it's both euphoric on the one hand because so much has happened but it's also daunting on the other hand because we see what the bottlenecks are and yet we can see how compelling it is to actually be successful because we can see the advances and how they really will inform perhaps the way we practice medicine down the road. So lastly what I would share with you is that if I haven't convinced you already genomics has not gotten dull since the end of the Genome Project. I think many people thought maybe that was sort of going to be it and the Genome Project was completed and genomics would sort of just sort of fade into a background discipline that has hardly been the case. Genome Project ended in 2003 and I think it's been a spectacular seven years since and think about how much I covered here which I give it as historical and yet it really was only in the last seven years and thinking about this and preparing for this lecture couldn't help but notice this that came out that really I think proves the genomic revolution continues at the end of last year being the end of the decade ABC News got together with something called MedPage Today and they got a group together to sort of analyze what were the major medical advances in the first decade of this new century and all these people analyze all the things that had happened and I thought it was remarkable, guess what? Number one in the list was not the Genome Project but actually the human genome sequence starting to reach the bedside sort of seeing these journey that I've been describing moving towards genomic medicine and recognizing that not only was it just a Genome Project but it was really the rest of the last decade that continues to have that momentum associated with genomics. So I think that is the scene we have set for you and that will be what the other lectures will be describing in greater detail. So I will stop there. I guess we do have a few minutes for questions if anybody has any questions. So thank you very much. And we also know that folks sometimes need to leave near the end of the hour so people can quietly get up and I'm happy to take questions and we'll probably do the other lectures will do the same. Yeah. And can you have to go to the mics? Oh there's a, yeah we do have to go to the mics especially because we have people remote. So if you go to the mics there's one on each aisle. So may I ask you a rather frivolous question to try to spoke some holes in the idea of functional genomics. One factor which we are counting on is comparative genomics to get at the functional regions of the genome, so-called conserved regions. Is there an argument one can make that there could be conserved regions of the genome which in fact are not functional by argument or by experimental. Okay, so I'll give you both examples. So one question is, the direct question you ask is could there be conserved regions of the genome that are not functional? And absolutely, and some of it just simply might be that if we looked at enough species we would see that it really isn't all that conserved that we simply didn't have enough statistical power to be able to, you know we're just setting thresholds. These are all statistical arguments. So absolutely, and there could be perhaps other reasons why they're conserved in ways that we just haven't figured out what those evolutionary forces are. The converse also, and I probably didn't say this directly, but I should, is just because something is not conserved doesn't mean it's not functional. There are almost or for certain lots of non-conserved regions of our genome that perhaps some that make us uniquely human or uniquely primate-like that by the methods I described you would not detect as being conserved and yet they are functional. So both are absolute. These are not absolutes and this all relates I was thinking more in non-conditional terms that there are sequences which are conserved which were never functional. Yeah, I mean it's certainly as possible. And I bet in that in many cases those are gonna be because we just don't have enough statistical power to show that they're not conserved across a hundred species for example. Yes. Congratulations for the comprehensive review of human genome. So what gives our intelligence when we will find out what genes and proteins are responsible for making us think? So yeah, well I wish I knew that. So I don't know the answer obviously. There's absolutely no doubt that that is one of many things that are fueling this desire to better understand primate evolution and better understand sort of some of these comparative genomic studies among primates is around brain development even within primates different brain morphologies and so forth to give some clues I don't know about smart genes or whatever but mostly about cognitive function among different animals. There's a tremendous amount I think we can learn about brain circuitry by better understanding sort of genomics especially comparative genomics among primates. Thank you. Other questions? All right, see you all next week.