 Okay. Good morning, everyone, and thanks for coming. My name's Andy Box-Avonis, and on behalf of my co-chairs, Tira Wolfsburg and Eric Green, I'd like to welcome you to this 10th edition of Current Topics in Genome Analysis. This course is really intended to provide a survey of major areas within the fields of genomics and bioinformatics, and the individual lectures are all going to be presented by our colleagues who are leaders in their respective fields. For those of you who have not been with us before, we hope that this course will help bring you up to speed in areas that are becoming more and more prominent in biological research, and for those of you who are joining us once again, we hope that the lectures fill in some of the gaps in your background and update you on some important changes in genomic technologies and approaches since the last time this course was offered two years ago. Before turning the platform over to today's speaker, I'd like to spend a few minutes going over some logistical information for the course. There are 13 lectures in the series, an hour and a half each starting today, and ending on April 25th, we'll be meeting here in the Lipsid Amphitheater, promptly at 9.30, please. You'll notice, hopefully, from the course syllabus that you picked up on the way in that all of the lectures are intermingled between the laboratory-based and the computationally-based lectures, and we hope that this serves to convey to all of you that you really need to use these kinds of approaches in concert with one another to do cutting-edge biological research in the future. Now, one of the primary ways we're going to be providing you information over the next 13 weeks is through the course's website, which you can just find at genome.gov slash course 2012. From the main page, there are a series of links here that will take you to the syllabus, so you'll see what all the lectures are, and the handouts, and what our intent is, is that the course handouts will be put up on the website a couple of days before each lecture, allowing you to download them, print them out, and read ahead. Just by way of reminder, we won't be having copies of the handouts available here in the lecture hall, so please be sure to print out a hard copy before you come and join us in the hall. Of course, we hope that you'll be able to join us in person each week and have the opportunity to interact with all of the lecturers, but if you happen to miss a lecture, we've made arrangements to have each one of the lectures videotaped, and once the YouTube version is available on Genome TV, you'll be able to watch that at your desktop, and we anticipate that the lectures will be available probably about one to two weeks after the live lecture. There's also a mailing list for the course that many of you have already subscribed to. If you haven't, I strongly encourage you to subscribe. We'll be sending out reminders of each of the upcoming lectures, as well as any information about changes in the schedule or cancellation. It is wintertime. There undoubtedly will be a snow day somewhere along the line, so we'll give you a heads up before you come to the hall if there's any changes in the schedule. With respect to continuing medication, that's good, continuing medical education credits, it could be the same thing. Here is the accreditation statement. You can earn one and a half credits per session for a maximum of 19 and a half AMAPRA Category 1 credits trademark for the course, so if any physicians who are in the hall, please make sure to sign in in the sign-up sheets that are at the back of the hall to get your CME credits. You actually have to be in the hall to earn the credits. You can't earn the credits by watching the videos online. Just by way of disclosure, none of the three of us as the planners have any financial interests or relationships with a commercial entity that is relevant to the course. One final detail, if you're carrying a mobile phone, Blackberry, Pager, please take a moment to put them on silent, please, just as a courtesy to the speakers. So with that by way of introduction, it's my pleasure to introduce to you today our first speaker in the course, Dr. Eric Green, one of my fellow course organizers. He is the director of the National Human Genome Research Institute prior to his appointment as NHGRI Director in 2009. He served for many years as NHGRI's scientific director beginning in 2002. And I have the pleasure of serving as his deputy during his time in that role. He was also the founding director of the NIH Intramural Sequencing Center, a state-of-the-art DNA sequencing facility that's played an important role in the advancement of genomic science. Particularly in the area of comparative genomics, something that we're going to be talking about many times throughout the next 13 weeks. During the almost two decades that he spent directing his own independent research program, he and his group made major contributions towards our understanding of the human genome, having had significant involvement in the sequencing of the human genome, going back to the very beginning of the human genome project and having developed technologies and strategies for the large-scale analysis of vertebrate genomes, which really have provided us great insights into genome structure, function, and evolution. Because of his work in the field of genomics, Eric's received numerous awards and recognitions, including his induction into the American Society for Clinical Investigation and the Association of American Physicians. Today, Dr. Green will be presenting his perspective of the current genomic landscape, thereby setting the stage for many of the talks that will follow his over the next 13 weeks. As those of you who have had an opportunity to listen to Eric lecture in the past, already know he's a wonderful speaker, and I'm very sure that you're very much going to enjoy today's talk. So with that, please join me in welcoming today's speaker, Dr. Eric Green. Thank you, Andy. Let me start out by offering my own personal thanks to Andy and Tira for organizing this series. I am honorifically included as one of the three organizers. I did almost nothing, essentially nothing, but the historic involvement, I think, is why my name is still left as one of the organizers. I reflect back. Andy and I have started this series back in the late 90s. Our numbers are right. This is the 10th time we've done this series, and it just started one day over a discussion I had with Andy back a long time ago. We said, do you think people would be interested in hearing sort of a survey set of lectures about genomics? And sure enough, it's been wildly successful now in its 10th iteration. And by any metrics, we think this outreach effort we do at NHRI on behalf of the NIH community and actually broader now with the reach we get by the web and by our YouTube channel is well worth the effort we put into it. But the thanks really should go to them for organizing this series. I wanted to also start off with two disclosures. The first disclosure is as I put this lecture together, I realized that the topic was even, by title, was a little more grandiose than what was realistic for this lecture. So in fact, the genomic landscape is particularly huge. So I'm gonna really pretty much limit my lecture to the human genomic landscape, which you'll see after about 70 minutes or 75 minutes, is also incredibly huge, and is enough just to deal with just the human side of it. So that's disclosure number one. Disclosure number two is that I'm fairly boring and I have no relevant financial relationships with commercial interest, so I had to do that for the CME part. So with those disclosures in mind, let me tell you what my plans are for this lecture, and they really have evolved in the 10 times I've done this in part because there's just been so much that's happened in genomics. So I'm actually gonna start by providing a historic context for genomics, in particular human genomics, and I'm gonna run that through the Human Genome Project as if that's a historic event in the distant past, amazingly enough, because it feels like yesterday. I wanna spend actually the bulk of the time setting the stage for what has gone on in genomics since the end of the Human Genome Project and in doing so, setting us up for a lot of the other lectures you're gonna be getting in more specific areas. And then I'm gonna end the talk, giving you sort of a landscape view towards the future. So really what this lecture, which was deliberately designed as the first lecture of the series is just a big tour. It's a tour of the past, it's a tour of the present, and it's a projection into the future. It really is, and I really am, nothing more than a warm-up act here for the other 12 lectures that you're gonna hear about. They're gonna really give you a lot of meat and details. I'm really just setting the context for everything you're about to hear. Well, starting at the beginning, the genomics, there's a rich history and a lot of territory I could potentially cover. There's a lot of places that I could highlight in the history of genetics and genomics. And in thinking about what to really emphasize, I really wanted to do this to ensure that what the examples I gave in many ways set the context for what you're gonna hear in the coming weeks. In picking specific examples, I guess you could and probably should start with Mendel and his contributions to untangling some basic principles of genetics, which nowadays are incredibly relevant in thinking about the genetic basis of human disease. We gotta get into DNA a little bit. And so Meissner deserves credit for discovering the chemical of DNA, but then people like Avery and his colleagues deserve credit for figuring out that DNA was actually hereditary material, figuring out by doing all those strange experiments with bacteria and injecting them in different forms into mice and rats and seeing which was lethal and which was not. But if I had to pick one single historic accomplishment that set the stage for genomics, it would clearly have to be Watson and Crick and the discovery of the double helical structure in 1953. Arguably this publication and this accomplishment, I believe was the single most important scientific, publication scientific discovery of the last century because it just set the stage for so much of what was gonna take place in biomedical research since 1953. And in many ways, even though the word genomics had certainly not been invented then, it really set the stage for genomics. Because it was that insight that was provided by the double helical structure of DNA that immediately became apparent how it is that DNA was the information molecule necessary for biological life. And it also then set up a series of studies that answered innumerable questions such related to how it was that DNA encoded information for making the building blocks of cells, proteins in particular. Coming out of that, of course, was the central dogma, molecular biology, the DNA made RNA and made protein. We now know it's a lot more complicated than that, but at the time it was an incredibly important fundamental principle to appreciate. And also coming out, if you fast forward to the 1960s, was of course the elucidation of the genetic code and understanding how it was that DNA encoded the information for proteins. And I can't help but point out that as I wandered here from the front of the clinical center and back through here towards the lobby of Lipset Auditorium passing by the remarkably nice museum display about Marshall Nuremberg and his accomplishments in elucidating the genetic code and putting the NIH firmly in the historic context of this important discovery on the way to our knowledge of how DNA works. And his work here at NIH and the Intramural Program in elucidating the genetic code will forever be an important part of biomedical history. You can then fast forward, of course, to the late 1970s, early 1980s. With that came the molecular biology revolution. We learned how to manipulate, clone DNA and be able to use it in all sorts of applications for biomedical research. And the molecular biology revolution also included the development and methods for actually sequencing DNA. Because of course what we had learned along the way was that DNA was incredibly simple. Basically consists of four chemicals. We don't even have to say the chemicals because we could just abbreviate them as their first letters, G, A, T and C. And with the ability to actually sequence long stretches of DNA, it became obviously possible to really now start getting at the underpinnings of how DNA becomes an information molecule. So by about the late 1980s and then certainly ideas started bubbling up, recognizing that the whole concept of a genome, the entire genetic complement, the entire DNA complement of a cell, of an organism and so forth, is a finite problem. And that the human genome, for example, just consists of three billion of these Gs, A's, T's and C's. And with methods available for now being able to sequence DNA and now increasingly more sophisticated molecular biology methods available for being able to manipulate large stretches of DNA in a laboratory, the audacious idea of actually determining the complete sequence of the human genome, all three billion letters sort of came to the forefront. And so indeed that set the stage for the next revolution, the genomic revolution, which really took place throughout the 1990s. And of course the centerpiece for the genomic revolution was this endeavor, the Human Genome Project, which began about 21 years ago, this large international effort, highly coordinated across multiple countries with but a major leadership role provided by people here at the NIH, had as a major focus around just getting the complete sequence of a reference human genome. I will tell you, because I guess I sort of entered the picture a little bit here, I was a postdoctoral fellow, recent MD-PhD graduate and got involved as a postdoctoral project working on some of the earliest technologies that were then used and involved in the Human Genome Project and had involved in the project itself as one of the first funded centers then at Washington University and got out of the gates on day one in the Human Genome Project and then participated in it throughout. I'll tell you a couple things about the Genome Project when it began as a participant's view, especially a young, impressionable postdoctoral fellow participant, is that we had no idea what we were doing. There really was this audacious goal at the end of sequence of human genome and there were some cursory methods, but there was fundamentally no plan how we're actually gonna get there. It was just a compelling, incredible sense of purpose of what this was all about and then it spanked a little bit of fear of actually not being sure you're actually gonna be successful. It's a perfect mixture. If you bring a lot of good people together, they figure out a way and sure enough that way was figured out more rapidly than we could have ever anticipated. A short 10 years later, of course came the announcement that a draft sequence of the human genome had been generated, capturing lots of attention by leaders around the world, some of our current leaders even, but you can even see the White House at that time was involved and even the popular press picked this up as a story of major historic significance. That moment was a press moment in many ways. The scientific moment comes with publications and of course just a few months later came out this historic publication in nature 11 years ago reporting the initial analysis and generation of that draft sequence of the human genome. It wasn't a complete story then, it was just a draft sequence, had lots of refinement needed to be done, back to the laboratories we all went, refined that sequence and then completed it and in April of 2003 declared completion of a reference sequence of the human genome and with that came an end of the human genome project. So that's a rapid pace historic review of what took place getting us through the human genome project. It was also important to take a pause from it and think about what the genome project was and what it wasn't. What it fundamentally wasn't was the completion of the field of genomics, if anything it was really just the beginning and there has been a tremendous amount that's been written about the historic significance of the genome project. I could show you lots of different slides. I happen to like this one which is a fairly recent from April of this past year where this individual out of Rutherford and the Guardian who writes a lot about genomics pointed out how the human genome project was just a starting point and he wrote the following, he said the mistake that we often make and I've heard people make it is we say that the human genome project was an endpoint. In fact the human genome project was a pregnancy. 10 years later we now have a clue what we don't yet know. The human genome project may be finished but understanding our genome is only just beginning and I would actually say that is a very important thing to keep in mind and in many ways sets the stage for this entire lecture series is that we have the genome project as a starting line and we have so much in front of it and so much has developed and what this course is gonna do these lectures are to teach you a little bit about some of the topical areas and then drill you into detail on those because you will quickly see and I will show you over the next hour all the new opportunities this is creating and all the new challenges that still remain. So with that as the starting line and the starting point if you will since the end of the genome project around here just wave after wave after wave every single year there seem to be accomplishment after accomplishment and what you are looking at here is actually the pullout piece from the reprint that I'm gonna talk about later but that's available in the back of the amphitheater and this is just sort of a nice view of it but you have a version that you can now pull out and put up on your refrigerator in your laboratory or your office but it does illustrate the fact that the genome project as a starting point really did represent the beginning of a tremendous number of accomplishments that have taken place since then and so that's what I wanna now transition into tell you about what have been some of the major accomplishments in genomics since the end of the genome project now I will focus again my attention on human I will focus my attention on health this is the National Institutes of Health and needless to say that's area that obviously we're focusing on in particular at National Human Genome Research Institute and in fact when we sort of looked at that moment in time where genomics was going we were very much aligned with what the popular impression was as well is that the reason we did the genome project was because we saw the opportunities to understand how our genome work and figuring out how to use that knowledge for improving the way we practice medicine and so really essentially as soon as the genome project ended the popular press picked up on this and even the scientific press picked up on this marrying the idea of genomics in medicine leading to phrases such as genomic medicine featured here on these two covers there are other phrases that have been applied to this but genomic medicine what I mean is healthcare tailored to the individual based on genomic information not treating patients as generic individuals but having some insights about their own unique genetic makeup that may allow you to tailor how you take care of them based on that genomic knowledge largely synonymous with things like personalized medicine, individualized medicine you'll even hear I refer to as precision medicine there are you could argue around the edges what these different phrases mean the one we tend to gravitate to I tend to use as genomic medicine but they're all largely meaning fundamentally the same thing personalized and individualizing in a more precise fashion how you take care of patients based on knowledge of their own unique genomic makeup well you might imagine as the National Human Genome Research Institute we take this sort of a thing very seriously and see as our mission having accomplished the human genome project figuring out how to make a genomic medicine a reality and so we think a lot about that journey that we now are all on that defining the path that's gonna get us from that starting point of the human genome project to the finish line vaguely defined as realizing genomic medicine as I just defined it now we go into this journey which I admit will be a long hard journey up a little optimistic we were quite successful initially at being able to come up with a successful end to the human genome project and while the number of steps might be unknown and I wouldn't even pretend to define all of them I remain reasonably optimistic that this is gonna be a successful journey but we have to prove it in order to sort of really put a check mark there I believe that in carrying out this journey is sort of required if we're gonna truly fulfill the promise of why we sequence the human genome in the first place but this is a very useful framework for us to think about in the steps that are gonna be needed as we inch closer and closer to realize genomic medicine and the other thing about those steps and this is what I'm gonna describe to you is as you make this journey with every step you get a little more data, a little more knowledge and with that comes a little bit more insights about disease and about medicine and how you might be using genomics to improve the way you take care of patients. So I'm gonna describe to you now are five of these steps these are not comprehensive in nature these five steps I've just chosen as major topical areas and you're gonna see they relate in many ways to things you're gonna hear about more and more from future lectures but these five are just meant to illustrate the kinds of steps that are needed to make this journey a successful one. So let's start with something very fundamental understanding the function of the human genome sequence. Now let me remind you once again what the genome project was about and what it wasn't about. The genome project was really about first mapping the human genome organizing our understanding of it and getting organized in a fashion that would allow us to sequence the genome and then going through a phase where we actually rolled up our sleeves and actually determine the order of the three billion letters in the human genome. That was a genome project. That was not about understanding that sequence because interpreting the human genome is really very much an activity that's gonna go well beyond the human genome project and it's gonna take a number of years that I couldn't even guess right now but I wouldn't dare wanna put a specific number on it. And the reason why is just a reminder that DNA sequence fairly complicated stuff on the one hand it's simple cause only four letters it's just complicated cause it just goes on and on and on and on. Shown here is real sequence of the human genome it's only about 0.001% of the human genome but it immediately reveals the fact that coming out of that text is hardly an immediate interpretation of how it is that it actually functions. Well, when it came time to now looking at the human genome sequence and starting to try to interpret it to figure out its function you had to go with what you knew at the time when the genome project ended. And indeed that's what we did. What do we know the most about at that time? Well, the thing we knew the most about when the genome project ended were we knew about genes. We knew about coding sequences. We actually were fairly sophisticated in part because of Marshall Nuremberg's contributions to understanding how it is that DNA actually was able to encode information for making proteins. And we even even more sophisticated by great knowledge that we had gained about intermediate molecules such as RNA. And we even knew that genes were consisted of exons such as these colored boxes that actually had the exact nucleotides that were encoding specific amino acids. And they were broken up every once in a while by blocks of DNA that were introns. And we even knew that when RNA got made that you would splice then together all your exons to sort of make them adjacent. And there was even alternate splicing whereby some exons got put into messages and others did not. So we had some knowledge about that but even more important we sort of fundamentally understood the language of DNA when it came for encoding information about proteins because we had the famous genetic code lookup table that you see right outside this auditorium. And so with that became quite a bit of information about being able to go through sequence and be able to use various tools both knowledge about RNA sequences that we were able to generate and have large data sets but also predictive tools, computer tools that allow us to go through and systematically review the human genome sequence and now just start highlighting all the coding sequences that we were aware of. And so this was actually a situation of where we got a lot out of the gates fairly soon because of our knowledge of how genes work. Now, I don't mean to imply for a minute that we fully understand the full repertoire of genes nor do I mean to imply for a minute that the complexity associated with gene expression which genes get expressed where and all the alternate forms and what are all the different issues associated with gene expression incredibly large set of complicated topics and one lecture you're gonna hear from is from Paul Meltzer who later in this course described for you some of the genomic approaches that are being used for better understanding gene expression. But beyond these yellow highlighted sequences that represented coding sequences of course came tremendous desire to understand what else is out there in DNA sequence besides DNA directly coding for protein. And here, lots of clues had come up that there was a lot of important choreography being orchestrated by non-coding DNA sequences, non-coding meaning they didn't code for proteins but how are we gonna find those? We didn't have a genetic code, we didn't have knowledge about many of these things and we knew lurking in DNA sequence were lots of surprises about how DNA might function. Well here we actually needed a consultant because we needed help. We didn't really have computer tools available, we didn't have a lot of knowledge and sort of surveyed the available consultants and ironically the consultants that had the consultant that had taught us the most about what we needed at that moment in time even predated mental. And in fact, the consultant we needed was Darwin who actually laid the foundation intellectually for what was gonna be needed for being able to take on the next challenge in genomics. And Darwin said a lot of things and this is a quote that allegedly was attributed to him although recently I've been told it was unclear whether he really said it or not but it sets the stage for what his contributions were where the quote says it's not the strongest of the species that survives nor the most intelligent that survives, it's the one that's most adaptable to change because what Darwin taught us was that species are able to somehow adapt to changing environments. He didn't know about genomes, he didn't know about DNA but he knew something was going on and something that was going on was the DNA was changing and that there was evolutionary processes in play that scrambled up the DNA and then individuals within species that adapted well to the environment because of those genetic changes are the ones that thrived and the ones that were able to survive and as a result, they adapted the best but all this was being kept track of in their genomes. And so a contemporary genomicist in this case wrote for the last three and a half billion years evolution has been taking notes and those notes are all kept in the genome sequences. And so what was very clear was that we could learn a lot about our own genome by reviewing the laboratory notebooks of species that have undergone various biological innovations to look for things that have changed and things that have not changed and in particular, the things that have not changed that are common across many, many, many species are likely to be those that are the most biologically important otherwise evolution would have stepped in and changed them. So the whole notion of comparative genomics became a very important area of research immediately following the human genome project and it was the realization that we are just as a species this small teeny little insignificant twig and a very sophisticated complicated tree of biological innovation just across the mammals and the vertebrates. And the genome project recognized this and we in parallel with sequence in the human genome other species had been selected for genome sequencing exactly for that purpose. But these had been mostly biological systems such as mouse and rat and so forth that had been used as laboratory models and saw the importance but we recognized that it was actually to fully harness the power of evolutionary's notebooks and to truly take advantage of these lessons that Darwin has taught us that we need a statistical power to be able to go in and look at lots and lots and lots of species genomes and be able to figure out what has and has not changed over tens of millions of years of evolutionary time. And to do so you wanted to sample many different branches not just ones you were interested in you actually wanted all different branches across the phylogenetic tree of mammals and in fact that's exactly what has taken place. And fast forwarding lots of studies had been done in order to sort of accomplish the kinds of comparative genomic studies that were of interest upwards of 30 million species have now had their complete genome sequence among a huge literature that you could find I would just point to this recent paper that came out in nature describing what I think is the most robust analysis so far across as many species in this case 29 mammalian species. And with that has come tremendous insights about the most highly conserved parts of the human genome based upon analyses of many different mammals. What has that taught us in terms of just sheer numbers if you look at the human genome what is this now teaching us? Well what we now have tremendously good data for is something like about 5% of the human genome is constrained evolutionarily conserved across virtually all mammals. So about 5% of our three billion letters are constrained at such a high degree across so many different species that are widely separated in evolution that they've almost for certain gonna be biologically important. Evolution just would have never tolerated to keep them the same if it wasn't for the fact that they were evolutionarily important. Now that's still a lot of bases that's about 150 million bases as a minimum that we're going to need to really understand at a biological level. And it's probably a lower bound and there's lots of reasons I could give you. It's probably not 5%, it's probably higher than that but it's on that sort of order of magnitude to keep that in mind. Well what have we learned? What's consuming that 5%? Is most of it protein coding? Small part nine coding or is it just the opposite? Well we have a pretty good inventory of our protein coding parts of our genome, the yellow stuff. And we now know that only constitutes about one and a half percent of our genome encodes for protein. So out of that 5% that's incredibly conserved only about one and a half percent. In other words, of the 5% one and a half percent across the whole genome is protein coding. Now that corresponds to something on the order of 20,000 genes. You can argue a little about it what the exact number is but that's about what it's looking like. But of course we make many more than 20,000 proteins because of all the different alternate splicing that goes on across different mRNAs. And meanwhile there's lots of different ways we decorate our proteins in terms of post-translational modifications. Where we as a species get our complexity is not in our gene number, it's what we do with every gene. That's why we're more complicated than a lot of other organisms that have smaller genomes, similar gene counts, but are just not as complicated. They don't know PowerPoint for example and we know PowerPoint and so we do a lot more with every gene than they do. But wait a second, if 5% is important but only one and a half percent is protein coding, do the math. That's three and a half percent that remains. That is not protein coding but is important. In fact it's so important it's evolutionary constrained to the same degree as protein coding regions. So we need to color in an additional three and a half percent of the genome with another color and that is functionally important but in ways that are other than directly coding for proteins. Well what is that non-coding functional sequences? What are they doing? Well we know about quite a bit of it as broad classes. We know for example that there's this incredibly complicated choreography of gene regulation involving lots of different kinds of elements, non-coding elements, promoters, enhancers, silencers, insulators and so forth and a huge amount has been learned and lots of information is now available about the complicated circuitry involved in regulating genes and all of that circuitry is non-coding functional sequences. We also know that there's a lot of important functional sequences that are involved in packaging up our chromosomes that a lot of these sequences are also involved in segregating chromosomes and also in replicating chromosomes and so amongst the non-coding sequences are elements that are relevant for all of these and meanwhile, wow, have we learned a lot about RNAs? Remember Central Dogma taught us that DNA made RNA and virtually all that RNA went on to make protein few exceptions like ribosomal RNA. Well we now know that RNA is a very, RNA molecules can do all sorts of things and they can function in biologically important ways and then that field is just exploding and these are all non-coding RNAs in this particular case, they fall in that category and of course I have to add a question mark because there is no way I believe today that we've discovered all the ways that DNA can confer function and I'm sure this slide will expand in the coming decades. So what does that leave us? That leaves us with simply take 5% minus 1.5%, we believe about 3.5% of our genome is functional non-coding sequences, these are gene regulatory elements as I described, chromosomal functional elements I described, oh and of course down there's the question mark because I believe there are undiscovered functional elements, you're just not reading about it yet in textbooks but they're out there and we're gonna find them, we're gonna characterize them and then we're gonna catalog them. Well of course it gets more complicated than that because what's transpired since the end of the genome project is a massive upturn in our knowledge about ways that DNA can first function beyond its primary sequence because everything I'm talking about here is the primary sequence of DNA, we are now learning more and more about this other language of DNA, the epigenomic language where DNA gets decorated, gets decorated with methyl groups, it gets methylated with histone proteins and this is now coming to the fore because of knowledge that epigenomic changes are very relevant in disease processes and therefore become very relevant for all sorts and developmental processes, therefore very, very relevant for biology more broadly. Allora Elnitsky will be coming next month and describing both the epigenomic landscape of the human genome, she'll also be describing some of these gene regulatory elements and their importance in understanding non-coding parts of the genome. But we recognize as a community of genomicists that this was really important stuff that we needed to sort of be able to now start interpreting the human genome sequence, making that publicly available, helping the biological community understand the sequence, we needed to know at the primary sequence level, epigenomic level and so forth. That is the reason why, and I'm sure Laura will mention this in her talk, that our institute, for example, launched major projects revolving around cataloging functional elements in genomes. The major one that we launched is called the ENCODE Project Encyclopedia for DNA Elements which focused on the human genome, but we also kicked off complementary projects called MOD ENCODE for model organisms ENCODE which focused on similar studies, but looking at the much smaller genomes of laboratory models, specifically Drosophila and nematode worm, and also some projects with mouse as well. And these projects especially, let me emphasize again the human aspects of ENCODE now having published a significant pilot effort and soon later this year, you'll be reading a major paper that'll come out of ENCODE and its accomplishments. But what this means for any of you is that nowadays, if you have a particular genomic region of interest and you want to know what has been established about its functional significance, we will overwhelm you. That's what I will tell you. You'll go to a browser and you will open up that browser and you'll dial in those region or let's say you do a couple of regions and you'll see things like this which you will find overwhelming, which is fine because all of this data represents laboratory and computational data at where there are gene models, where there are RNA molecules being made across that stretch of DNA, where there are transcription factors binding, where there are regions of open chromatin, where there are various epigenomic marks and every one of these tracks reflects that and from that you could try to interpret what it all means and more and more this will become an issue of interpreting the massive data sets that have been generated by ENCODE and other efforts and earlier this or I guess last year at this point the consortium put out a user's guide basically a manual how to interpret the ENCODE data and I would point you to that if you're interested in actually navigating and using ENCODE data for your own uses and of course ENCODE is not the only project involved there are other projects even here at NIH a major roadmap or common fund project looking specifically at epigenomics very complimentary to what ENCODE is doing and more and more data is getting generated along the lines. Oh and by the way it's not just about the primary DNA sequence and it's not just about epigenomic marks in DNA we are learning increasingly that there's yet a whole additional level of complexity because we're learning and we already knew that DNA is a three dimensional molecule that's existing within the nucleus and there's probably a lot of stuff going on there in the nucleus that might be very relevant to genome function and in fact more and more such as described in this review article last year in Nature. The genome has a three dimensional structure and with it comes some interactions that also become very relevant for genome function. So that is a quick whirlwind view of just that first step where all I'm emphasizing is interpreting the human genome sequence. This is an effort that X number of years out from the Human Genome Project we've gone about this far. It's like a great novel. Decades from now we'll still be interpreting the human genome sequence. This is gonna be an effort I'm sure will take place for all of our lifetimes and even then we'll be refining it more and more. At best right now we're sort of the cliff note stage of this. We're just understanding the fundamentals but trust me there's a lot more to be learned. Before I move on past this first step I should at least emphasize the fact that there are other interesting things coming out of some of the studies I just quickly reviewed for you especially in comparative genomics that may not be directly on a trajectory towards human health but indirectly understanding more about fundamental biological principles. For example there's certainly been a major upswing in understanding human evolution by using genomic tools and featured both in the popular press and the scientific press certainly is a great interest in our evolutionary origins including species that no longer are here but the tools of DNA sequencing allow us to explore and just increasingly I think excitement around understanding fundamental principles of evolution. First with our own species but then more broadly across all animal species and just as an example an effort known as Genome 10K is an effort that is attempting to collect DNA samples from 10,000 vertebrate species and having them available so that when the cost of sequencing drops sufficiently it will mean that we can actually just generate complete genome sequences of every available animal, vertebrate and with that one could imagine becomes a rich set of information for being able to explore biological innovation and fundamental principles of evolution and one could imagine next generation either our kids or our grandkids the way to learn evolution is not from textbooks but by sitting at computer screens and surveying the complete genome sequences of thousands and thousands of vertebrates understanding the innovations that took place and it would just be a far more robust sophisticated way to understand evolutionary processes. Okay, so that was the first step. What's the next step along this journey? Well, the next step is not just understanding how a hypothetical human genome sequence operates but understanding how, because we're not just interested in some hypothetical reference sequence we're interested in our sequences we're interested in our patients sequences we wanna know how we differ probably more than anything because that's the underlying issues associated with understanding how to better treat patients down the road. So understanding human genomic variation became a very major priority shortly after the genome project ended. Fundamental idea of course is that each of us has two genomes in us we don't have just three billion letters we have six billion letters we got three billion from mom, three billion from dad and across those six billion letters we vary every once in a while compared to the person sitting next to you across your six billion letters you probably about three to five million places in your genome where that single nucleotide is different. So about three to five million single nucleotide differences between you and the person sitting next to you. There's probably tens of thousands of places where there actually are larger structural variants either things that have come in or things that are gone or places that have been duplicated or you carry multiple copies and the person sitting next to you only has one copy that sort of thing or the structural variants. But the fact of the matter is we know that these variants indicated here by V are sprinkled throughout but we also know that the great, great, great, great, great majority of these have really no phenotypic consequences whatsoever they're completely innocent but a subset are very relevant. A small subset of them might be sort of one of these metaphorical time bombs that might influence your getting a particular disease and give you increased risk for disease or there might be other variants that are good variants that might be more attributable to positive phenotypic features but we would like to know which of these are completely phenotypically neutral and which ones are phenotypically consequential. And then the other thing we also believe is that there's a lot of variants we all share in common compared to the person sitting next to you it's not like you've got three to five million and it's a completely different set compared to the person next to you there's probably a lot that are in common. So the idea was could we catalog a lot of these variants and find out at a large scale what all the variants are that are at least common above some threshold and then study them and figure out which of those can we sort of ignore and which of the ones that we might be really interested in figuring out might have a disease or other phenotypic consequence. So this was the rationale for launching another international project that began shortly when the Gino project ended called the International HapMap project and its goals were to not only develop very deep catalogs of genomic variation but also to understand a little bit about the relationship of those variants across stretches of human chromosomes. We now know that all these variants across a stretch of chromosome are not completely random in how they move from one generation to the next but rather they are clustered together in what are called haplotype blocks whereby within a given stretch of DNA a series of variants tend to be inherited in block from one generation to the next and knowing that structural relationship across variants would be very valuable. And so through a series of studies this large international effort one published in 05, 07 and then in 10 significant millions and millions of common variants across different human populations were cataloged, made available publicly and also additional information about their relationship to one another across these haplotype blocks. When better technologies became available the more ambitious endeavor was launched called the Thousand Genomes Project which attempted to now use new sequencing technologies I'll be telling you about in a minute to basically get deeper and deeper catalogs of genomic variation. Again across different human populations. This is going so well that its name is actually sort of outdated somehow the project involves several thousand genomes that are now having their complete sequence established or at least sequence across parts of their genomes to basically get to the rare and rare variants. A pilot phase of this was reported in 2010 the very last issue of Nature that you can read about and you'll be reading more about this and many publications will be coming out from Thousand Genomes in the next year or two. Lynn Geordi's gonna be here on March 7th and really dig much deeper into population genomics all this about human genetic variation and I'm sure he'll be talking about HapMap Project and Thousand Genomes Project as well. So we now have lots of information about functional sequences in the human genome lots of catalogs about common variants and increasingly rare variants across the human genome across different human populations and with that comes the opportunity for the third step along this journey the third step being attempting to now understand the genomic basis for human disease which of those variants play a role in human disease? And in describing what has been accomplished in genomics since the end of the Genome Project in the area of human disease work and genomic applications that have advanced the field it is actually very useful to describe sort of a framework once again that sort of summarizes what I call the genomic architecture of genetic diseases and this is an oversimplified view of human disease but it's a useful one for what I'm gonna describe to you. There really are two classes of diseases to think about all diseases have a genetic component associated with them some to a greater degree some to a lesser degree but all diseases have a genetic influence but there's one class of disease that are fundamentally rare rare across the human population but these are genetically simple because they're simple because they really involve one gene also our Greg Mendel gets the name associated they're also referred to as Mendelian disorders so these are diseases where the predominant risk is a change of mutation in a single gene yes there might be other genetic variants that influence the severity of disease and yes there might be some environmental contributions that influence the disease but fundamentally it's mutations in a single gene cause that particular disease but these are rare these are not what fill hospitals and clinics around the world they don't represent the major healthcare burden they're important but they pale by comparison in terms of overall health burden worldwide compared to these diseases these are common diseases oh by the way, so diseases like this of course are sickle cell disease, cystic fibrosis and Huntington's disease and so forth but these are diseases that all of us have or all of us have family members that have it's hypertension, it's diabetes, it's heart disease, mental illness, it's different kinds of cancer and so forth and these are the more common diseases unfortunately they're more complicated because they involve multiple genes they're non-mendeleon because not a single gene disorder instead it's usually a series of genes that are involved each with the genetic variant that confers risk that all conspire together with what is typically a larger influence of the environment to confer overall risk for getting that disease so these are sort of the two major classes now I wanna point out cause people often when I give talks like this will say oh you're only talking about the genetic contributions of disease and there are these important mental, environmental contributions of disease so I just wanna emphasize before I get that criticism is that there is absolutely a role for both the genome and the environment in human disease that's why I represent the pie charts the way I do the fact of the matter is on the genetic side there have been remarkable genome analysis technologies that have evolved in the past five years in particular 10 years we certainly have been significant advances in environmental monitoring technologies but nobody could argue that the last decade has brought significant more advances in the technologies for analyzing genomes than environmental monitoring so I'm gonna emphasize the genomic side of this equation but it's not out of disrespect for the environmental contributions that's critically important it's just I don't particularly have much expertise in that and I also don't have as much to report based on technology advances in the last decade so what's happened with rare diseases and common diseases since the end of the genome project well what I can tell you is that there has been an explosion in our ability to identify the genetic basis of single gene disorders since the end of the genome project actually even since the beginning of the genome project so here's a cumulative graph that shows the number of genes that have been identified that are basically when mutated causes single gene disorder note the genome project began here there were only a handful of successful examples before the data from the genome project became available earliest maps, clones and eventually sequins and then it's just taken off ever since and I didn't update the slide yet for 2011 but it absolutely continues to trend upwards it has been remarkable and unpredictable that it's a relatively simple path to be able to go from having individuals with rare genetic diseases nowadays to be able to figure out the genetic basis not in every case but in general you can see what the trend has been and what this has resulted in is a fairly impressive accumulation of knowledge because we now know the molecular basis of something like 3,500 rare Mendelian diseases and traits so that is absolute and you can see before the genome project was like five so that's pretty impressive now that is absolutely the glass half full there is a glass half empty side of this pie chart and that is that there still remains about a couple thousand where we know the disease but we don't yet know the molecular basis for it and then there's another couple of thousand where we think it's a single gene disorder trait but we don't yet know the genetic basis so this is the glass half empty remember the slide you will see it later in my talk so that's success in many ways and with that comes tremendous knowledge about gene function we now can attribute specific functions to individual genes because we have individuals with defects in that gene and we can see what its cause is when mutated what about common genetic diseases? well what I will tell you is a lot of skepticism about would we ever be able to line up enough analytical and laboratory based horsepower to be able to unravel the complexities of common diseases with all their minor contributions from a whole lot of individual variants but the idea behind the HapMap project was to simplify that process I'm just gonna give you a very quick review of what happened but one of our speakers gonna give us in greater detail the fundamental idea was with knowledge about these haplotype blocks across every human chromosome the idea was could we line up individuals with common genetic diseases such as hypertensive individuals take 1,000 people with hypertension 1,000 people without hypertension and scan across each of their genomes and all those individuals and figure out are there particular variants that are inherited more often than those with hypertension than without hypertension and would that then give clues of where to look to see where there might be genetic variants that are causing greater risk for hypertension but doing that across millions and millions of variants was simply not approachable at the cost of doing genotyping but since you knew about these haplotype blocks could you imagine just taking not all the markers such all these little black lines up here correspond to individual places of variation across this particular human chromosome but instead of taking all let's say 1,000 or hundreds of people or hundreds of markers across this particular block just pick one or two and have those one or two variants be proxies for this entire haplotype block this inverted red triangle is a given haplotype block on a given chromosome and could you do that systematically and you simplify the process of now just having to look at hundreds of thousands of markers and having those service proxies for their original haplotype block so what do I mean by that the simple experiment is you do this and you take an individual and let's say you take a marker then for simplicity we'll say it comes in a green flavor and a purple flavor and you do it from this haplotype block and those with hypertension are here those without hypertension are there and just by eyeball you can see there's no correlation you would rule out this block as being relevant to the disease but what happens if you looked at this block and now the variant you picked happened to come in an orange flavor and a blue flavor wow those with hypertension tended to get the orange block more than the blue block or the orange marker more than the blue marker perhaps therefore somewhere within this haplotype block might be a variant that might end up conferring risk for hypertension it may not be the orange one it may be one a little over but it's just basically correlating the inheritance of this block and of getting hypertension so you'd rule out regions like this you'd rule in regions like that if you did this across the entire genome this is called, this is genome wide and what we're basically doing here is an association study associating this haplotype block with being hypertensive and ruling out this haplotype block with hypertension so it's a genome wide association study that's called the GWAS what I will tell you is I just gave you the one minute version it is really complicated, what goes on it's not a simple PowerPoint like this so Karen Mulkey we're gonna bring up here from University of North Carolina she's gonna explain it to you far more sophisticated than I just did what I'm gonna do is tell you that this has been an impressive success because what has happened is PowerPoint slides like this are easy to make but actually demonstrating scientifically that the strategy works was a question mark but the good news is that it did work in fact this was the first example of it this sort of became the poster child for GWAS studies age-related macular degeneration a genetically complex disorder that some of the earliest hat map data was used and demonstrated that in fact a region on chromosome one actually had a gene that had a variant in it that conferred risk for getting this particular disorder at NHGRI we actually started cataloging this we were very interested in monitoring this field as it evolved so what we started doing is every time a successful genome wide association study was published in the literature we would survey it our office population genomics would curate it and in this particular case would mark the place in the genome whereby that association had been demonstrated sticking a little lollipop with a particular region of the chromosome that was that success story in 2005 in 2006 there were a couple more by 2007 it became quite crazy seemed like every single time you would open an issue of nature genetics or science or nature or increase in the human molecular genetics or plos genetics and this continued throughout 2008 you'd find paper after paper after paper reporting successful genome wide association studies in each case sticking one or more lollipops in discrete regions of the genome now it's important to emphasize that they didn't necessarily know the exact genetic cause but what they are doing is basically going and getting it down to an individual neighborhood of a chromosome that would need to be searched in greater and greater detail to actually figure out what the causative variant might be and these phenotypes that are associated with these lollipops are all these common diseases that are filling hospitals and clinics around the world this trend absolutely continued throughout 2010 and also 2011 oh and I should just pause there just to see you can see our genome is littered with lollipops with all these successful regions being demonstrated to perhaps having a variant relevant for an important human disease where once upon a time there was essentially no successful genome wide association studies you can see that already the threshold of a thousand successful publications was crossed last year and has left behind a tremendous amount of work to be done because you still don't know the genetic basis but you now have much more limited search to try to figure out what's going on now this is once again glass half full lots of successful genome wide association studies there is a glass half empty side of the story there's a couple actually glass half empty side of the story new challenges the other thing we've learned which is sort of interesting is that as we've really made successful forays into understanding the genomic basis of rare disease and common diseases is that there's a pattern that's emerging and that pattern is when it comes to rare genetic diseases single gene disorders the great majority again not all of them but the great majority of them turn out to be coding mutations there changes in the protein coding portions of genes but the exact opposite is turning out to be true in these common complex diseases where again it's not exclusive but the majority of them seem to be out in non-coding portions of the genome remember that purple stuff which I told you we barely understand and we have a lot more to learn regulatory regions and so forth that seems to be where the variants are residing associated with these very important class of common diseases. Now there is another glass half empty side of the story because despite the fact of having a thousand successful genome wide association studies and lots of knowledge of where to look to it's still not accounting for all the heritability associated with these common diseases. So there is still a lot of mystery associated where all Karen Mulkey were described as heritability and it could be that it's just not the common variants that we're familiar with working with and increasingly there's lots of people who believe that a lot of these variants that are conferring risk for complex diseases are very very rare variants but together across the population each of us harbor some very rare variants that have not turned up yet in any of these variation studies and those are the ones that are conferring risk. And what all that is pointing to whether it's doing the next set of analyses for genome wide association studies to sort of drill down into these neighborhoods and find out all the variants and figure out which ones are causative or the recognition that indeed you need to go in deeper and get more and more rare variants from all those people with hypertension. Either way you're going what it's pointed to is we need to sequence a lot of people's genomes. We need to go through those thousand people of hypertension to sequence their whole genome or sequence at least all their coding regions. And so this leads us to the fourth major step along this journey which now has really become the dominant force in genomics. And that is we need technologies to routinely sequence whole genomes. We knew it was necessary. We knew it was necessary back then when the genome project ended. But what I will tell you is we never thought we were gonna be as successful as we've turned out to be. What do I mean by that? Well, when the genome project ended in April 2003 and I sure I put out a new publication that described a vision for the future of genomics research and we said, wow, we have the sequence in hand. What do we need to do with that? And we described all sorts of crazy things we wanted to do. Some were even more crazy than others. And one of the craziest things we said, I was one of the authors on this so I could really make fun of it because at the time I can't believe we really put this into press. But we actually put it to print in nature of all places that we absolutely needed technological leaps that seemed so far off as to be almost fictional but which if they could be achieved would revolutionize biomedical research and clinical practice. And we didn't just stop there. We really had to go sort of the next level and even be more audacious because we said as an example, we need the ability to sequence DNA that costs that are lower by four to five orders of magnitude than the current cost allowing the human genome to be sequenced for $1,000 or less. This is the first time put into print the idea of getting the cost of sequencing a genome down to something that was quite affordable. $1,000 was the marker we put in the sand. $1,000 seemed like a very reasonable price for a clinical test and that's the reason why we picked that number. But why were we sort of a little exuberant and a little overambitious? Oh, because genome sequencing at the time we wrote that was still quite expensive. For example, sequencing that first human genome by the human genome project cost something on the order of a billion dollars. And what we were putting into print was basically the idea that having now done this one time, one time, one human genome sequence, $1 billion, that somehow in the not too distant future we were gonna develop fancy technologies that would lop off a lot of zeros off that billion and deliver something a genome sequence of $1,000. Well, this became a bit of a rallying cry in the community. In fact, the phrase the $1,000 genome sort of became the battle cry for technology development. Our institute put out lots of grants to try to stimulate this field that were actually quite successful. Fortunately, the private sector got quite involved in this. Many companies sprouted up and an incredibly intense effort to develop newer and newer, better and better technologies came to the forefront. Because the idea was to just get rid of these factories that had generated that first reference sequence as part of the human genome project and developed something really fancy shown here in icon form, some nano this, some micro that, some mini channel, whatever. Something that would be so efficient and so scalable that would allow you to sequence an individual patients genome, individual clinical subjects genome for something like $1,000. Well, I can just tell you, there's nobody and anybody who tells you otherwise they're just, they're not telling you the truth. Nobody expected things to happen as well as they've happened over the past eight or nine years. Because it's not just one or two or three or four different new technologies but it's really more like five, six, seven, eight, nine new technologies shown here are just some of the platforms that you can now and purchase or get in your own laboratory. These are what are referred to as next gen or next generation DNA sequencing technologies and not even necessarily talking about any one of them. In fact, I'm not gonna talk about these at all. That's why we're bringing Elaine Martis here from the WashU Sequencing Center to talk about this and describe it in great detail. These technologies are fast evolving. They're incredibly sophisticated and they're remarkably efficient. As an example of a couple of these machines on here one in particular, in one week can generate a sequence of human genome. That's something that took 10 years and thousands of people to do as part of the Human Genome Project. It is now routine in many places around the world including even here at NIH. By the way, the reason it's particularly exciting is this slide will not be used in the next time I give a lecture in the series it'll have to be a new slide because there's new technologies that are coming. It is like sitting in an airport looking out on the horizon. Yes, you have 10 planes on the ground but you know what, in six months there'll be another plane. About a year later another one, maybe two years, three years. Just yesterday there was a flurry of email fact Wall Street Journal wrote an article and I think other journals or other newspapers wrote articles because one of the companies came out with a new technology and they're commercializing and they say they're gonna cross a thousand dollar threshold this year and there just aren't many more technologies. I've heard more and more about nanopores and this just featured as one example of the cover in nature that maybe three, four years from now we'll be now commercialized and again we'll just continue to step down the cost of sequencing. Well, has it materialized? Do we have a thousand dollar genome? Have costs gone down? Where are we at? Well, we know about this a lot because at least in NHGRI we fund three very big centers that do a lot of sequencing they did for the genome project they still do now and we give them money and then they give us data and every three months they tell us how many genomes they sequenced or how much DNA they sequenced and how much money they spent and we've tracked that for like over a decade. So let me show you what real data looks like. So costs for sequencing a human genome and before I tell you that let me tell you about Moore's law. So Moore's law is a law of the computer industry that basically says that compute power doubles every 24 months or so and nobody keeps up with Moore's law say technology development people except for the computer industry. So there your benchmark you try to keep up with them if you can. So here's our data. So shown in white notice the Y axis is logarithmic shown in white is Moore's law in orange is the data provided by our sequencing centers dating back to the 2001 or so. So from here to here they were using that old fashioned method of dideoxy chain termination sequencing developed by Fred Sanger in 1977. This was the method that was used for sequencing the genome and the human genome project and they used it up until this point. Remarkably while they were using it they were actually keeping up with Moore's law. So that was pretty impressive in and of itself. But right here they switched to next generation sequencing platforms and ever since then and up to the present time they've blown Moore's law out of the water. So we exceed Moore's law which in many ways was unprecedented and in many ways is actually incredibly impressive. And if you want to continue to follow this trend you would just we're gonna continue to update this slide on our website and we continue to believe that it will go down and down. So where are we right now and our quest towards a thousand dollar genome? Well if you had to tell me right today where we are we're somewhere around there. So not quite at a thousand dollars but really close to it. There are actually shortcuts where you could just sequence the exome just the coding sequences that's below a thousand dollars now pretty much. A whole genome depends who you ask depends on the accuracy depends on if they're telling you the truth. It's three, four, five thousand but rapidly heading towards a thousand. It's not a big deal anymore. This is not what I stay up at night worrying about. We will get to a thousand dollar genome and it's just not the big problem. Before I tell you about the big problem because we have big problems let me tell you one other thing to think about because I think it's very relevant including to an audience like this. How are we gonna be generating genome sequences over the next decade? So everybody gonna buy one of these instruments put in their lab? We're gonna have centers set up and do this. So I don't know. I actually can't predict that completely. What I do know is that market forces will step in when this becomes a commodity. And in fact it really is a commodity. At this point genome sequencing can be obtained through companies. Just show a couple here. All you have to do is open the journals and read the advertisements and there's other companies. By the way look at this company here and notice their price because I took this maybe a few months ago. We're gonna come back in a few slides to a point. And here's what I was telling you about. You can get whole exome sequencing done commercially for just under a thousand dollars. So I don't know what it's gonna look like in the future in terms of whether we will be sequencing genomes in our lab or whether we'll be outsourcing it. It's becoming a commodity and that's actually a good thing because we have far more important things to worry about than generating data if we have a big challenge of what to do with that information. So that is actually a great segue into our fifth step and the last one I'm gonna describe before I start describing the future. Because a fifth step sort of is a little bit of cold water in the face kind of thing. It's actually a little deceptive for me to tell you that sequencing a genome is getting close to a thousand dollars because that's just getting you the data. The real bottleneck now a days in genomics is not getting the data. The real bottleneck is dealing with the onslaught of the data that comes flying out of these machines such as sort of shown here in a humorous fashion. These sequencing instruments are far able to generate data faster than we could possibly assimilate it and it has put genomics right in the middle of a situation we really were never in for a long time and that's the big data. When the genome project was going on we didn't have big data yet. We were just trying to generate data but now all of a sudden we find ourselves smack in a significant set of issues associated with having a big data circumstance that has created a pretty substantial bottleneck. I refer to this as a computational bottleneck. That bottleneck has several elements associated with it. There's hardware issues, just enough storage capacity, enough processors to analyze that data. There's lots of issues around software, being able to deal with the onslaught of data and interpret it and of course there's all sorts of issues around work for us just having enough people trained to deal with this. There is a reason why Andy and Tyr are giving three lectures total is to deal with all of these issues around the computational analysis of data because this becomes sort of the biggest issue right now in genomics. So you'll be hearing two lectures from Andy, one on Tyr and it sort of addresses. So that's sort of a computational bottleneck hand-in-hand with that which also overlaps with what Andy and Tyr will talk about. It's just a sheer informational bottleneck. The fact is that let's say we get you by the idea of generate the data, you can assimilate the data, you can analyze the data, you can even get using these fancy technologies, your sequence of individual genomes, of an individual patient, individual subject and let's even say you get to the point of being able to analyze it and filter it and get to the point where you just have your list of three to five million variants in that particular person who's sitting across the room from you, for example. What do those variants mean? I mean, you see these changes. Are they detrimental variants? Are they innocent variants? And if you did it on a patient, for example, here in the clinical center and you had that genome sequence, when you had that list of three to five million variants and you rounded on that patient in the morning, is this how you'd feel? Would you just sort of stare at that list and wonder what it all means? Probably you would, at least right now. There's also that informational bottleneck, simply knowing what the sequence means when you have individual variants and individual patients. I can't help under this circumstance to quote Harold Varmas, known to, I'm sure, all of you, former director of NIH, current director of NCI, who wrote a commemorative article about the genome sequence of 10 years where he said, physicians are still a long way from submitting their patients full genomes for sequencing, not because the price is high, but because the data are difficult to interpret. So that's the circumstance we find ourselves in and this is where I just last week saw this advertisement. I couldn't help but throw it in. Same company, notice the prices come down since the last time I took their ad. But they talked about Ben, Ben Franklin, obviously, and he didn't have, he didn't have an informatics bottleneck making fun of the fact that we do. And of course, they're a company that wants you to give them money and they'll help you solve that bottleneck and we'll see, but we need to solve this bottleneck and but it is interesting that even their price went down since the last time I scanned one of their ads. So those are the five steps I wanted to tell you about in going from the Genome Project until today. Now there are other steps that I could be starting to talk about are certainly relevant for the future, developing new diagnostics, much more relevant to genomic medicine, anything I've talked about. Obviously therapeutics, preventative measures based on genomic information. Of course, there's probably other steps that we're gonna have to journey our way through to eventually realize genomic medicine. What we have at the present time is just a tremendous amount of data. We have great technologies for analyzing genomes and we have such as shown here for the first time in many ways. Incredible opportunities to apply these data, apply these technologies to clinical circumstances. Clinical research immediately, hopefully eventually clinical care. This makes us remarkably well poised for a revolution that brings about genomic medicine but with this comes just inordinate numbers of challenges that we all have to face. This is why Bruce Korf is gonna come up here and just specifically talk about genomic medicine in the series later in April. But what I wanna do now and I'm sure will compliment much of what Bruce is gonna talk about is to now just spend the last 20 minutes or so just now let's gaze into the future. Because what I've described at first was up through the Genome Project, since the Genome Project, now it's really all about the future. And what we believe the future is gonna bring has come about from a strategic planning process that NHGRI did on behalf of the field of genomics that went on for several years and then just about 11 months ago was published in the 10th anniversary issue of nature commemorating the 10th anniversary of having the sequence, the human genome in hand and described. And this was the reprint that was available to all of you and if you didn't pick up on your way in, please pick up on your way out and if there's extras back there, take them to other people in your lab or take them home, they're great stocking stuffers for next Christmas if you want. But we don't want any more of them, so you just take them because we don't wanna take them back to the offices. But it's on, oh also, and if you want a PDF version you could go to this website and in fact you could read all about our strategic planning process that went on. This is very much about the future and nature was kind in giving us the headline on the front cover that the future is bright and in many ways we do think the future is bright. So let me describe to you what we sort of derived based on consultation with hundreds of people around the world in the field of genetics and genomics and beyond in trying to formulate this 2011 vision for the future of genomics. And based on lots of workshops and consultation and iterative processes of writing documents. And it's all described in great detail, the reprint which you're crazy not to read from end to end if you're gonna participate in this lecture series because so much of what you're gonna hear about the other lectures is described at least superficially in that document. What we heard from the strategic planning process was that it was an exciting time in genomics to be even more specific and more sophisticated in describing the journey from base pairs to bedside or if you prefer the metaphor from helix to health. But in doing so we can now start to divide this work into a series of domains that both reflect our history but importantly also reflect our future. For example, you could think at the more proximal side of this, a domain of research activity that involves understanding the structure of genomes. Sounds familiar, makes sense, that's what we've done for a while. Also a set of research activities that get you to the biology of genomes, understand how genomes work and then increasingly start to apply that knowledge to use genomics to understand the biology of disease. Makes a lot of sense based on what we've done but what becomes more ambitious is now thinking about the future more and more is using that knowledge to advance medical science, the science of medicine and also being cognizant that just goes you have some great medical advance doesn't mean you change the practice of medicine because you also have to do research that will actually demonstrate that you improve the effectiveness of healthcare based on those genomic advances. And so this became sort of five domains of research activity that provide an outline if you will for our strategic plan. I will tell you it actually provides an outline for basically everything that our institute is doing as we think about our genomics program. Now it's not the only thing we're doing, it's not the only thing in genomics because there are important cross-cutting elements that are also very relevant. You heard about one of them and you'll be lectured on some of these. Obviously computational biology and bioinformatics pervasive important for all these domains of activity as is education and training, this lecture series if nothing else is an example of an outreach education effort to educate people across all these domains. And then there's lots of genomics and society issues historically we describe them as our ethical, legal and social implications research program but it includes lots of other things including behavioral research and other areas that fall under the general umbrella of genomics and society. What was very useful though thinking about these cross-cutting elements but first returning to these five domains of activity is that it's very helpful in planning and projecting to think about these five domains of activities and think about what has been accomplished over the last 20 years and then what's gonna happen over the next 20 years as we predict it. What do I mean by that? Well, we find a useful way to represent this is by hypothetical genomic accomplishments that are graphed as density plots such as shown here with each blue dot representing a hypothetical genomic accomplishment and then when they pile up on each other they change color until they get red. So what do I mean by that? Well, take the time interval of the genome project which I told you about. Well, basically it was all about this first domain it was all about understanding the structure of genomes. Yeah, we learned a little bit about how genomes work and maybe even a couple of things about disease but really the real density was right here in that smack on that first domain. I led you through five steps of what's taken place since the end of the genome project and that's reflected here because we continued to learn a lot about the structure genomes but since the genome project we mostly were spending our time learning about the biology of genomes and starting to dabble in the biology of disease rare diseases, increasingly common diseases. Yeah, maybe there are even a few home runs out here in the more clinically oriented domains. The center of gravity though was firmly placed on the first two domains. But people wanna know about the future. What's the next decade gonna bring? Within the next decade it's gonna look something like this. We believe the center of gravity will shift more and if anything the next decade is gonna be about refining our knowledge about how genomes work but increasingly applying that knowledge to understand the genomic basis of disease. With that will come many more opportunities for advancing medical science and even more home runs than previously seen for improving the effectiveness of healthcare. But being realistic center of gravity is gonna remain on domains two and three. We're optimistic that beyond 2020 you'll see the change in the practice of medicine first by advancing medical science and eventually improving the effectiveness of healthcare. But this is gonna take decades realistically. It's not gonna happen in the next five years or 10 years. But we really believe we're on a trajectory that we will see the center of gravity of these accomplishments shift rightward over time. Now this is huge research areas. This is not just about NHGRI. This is not just about NIH. Let's be frank, this is not just about the United States. What we describe in this document and use as an organizing framework by this figure, which is figure two, is a far more expansive view of genomics. It is absolutely a world view of genomics that I don't claim for a minute to just be about one institute or one agency or one country. But with that said, and we absolutely believe that there's gonna be significant accomplishments that contribute to those five domains of research activity coming from across the world. The same time we think a lot about what we wanna do here at NHGRI. Well we wanna help have happened here at NIH. And so I thought I would just spend the last few minutes just glimpsing a little further into the future. And specifically telling you what I think are some of the most compelling opportunities in genomic medicine and things that we are doing to try to accelerate them. We actually have this as outlined in one of the text boxes. It's actually text box number two. We call it imperatives for genomic medicines. The subtext to that was no-brainers. These are the things that are so obvious. We absolutely wanna support. And they really represent the future. We think the immediate future because we think these are opportunities that we absolutely could facilitate over the next decade. What I will tell you about this glimpsing of the future is that technology drives it. And maybe that's not a surprise. I think the history of science has shown that technology advances drive science. I think we're gonna see that more than ever over the next decade. Just like the telescope just drove astronomy and the microscope drove cell biology. And then the different imaging technologies drove radiology. Absolutely, these sequencing instruments are driving genomics and they're driving the field forward and we're gonna continue to see these technology advances. And with that will come the sequencing, no, not of hundreds of people, no, not of thousands of people. It's really tens of thousands, hundreds of thousands of people we can imagine over the next decade, million or more, maybe many more than that that are gonna be sequenced. And when you start to deal with those sorts of numbers, you start thinking about how that'll be done in clinical research contexts, such as here at the clinical center. It'll be done, perhaps, over the next decade and in some ways as part of clinical care and we need to do research to understand that. But I was gonna point you to an additional article to read about this general vision for the future, specifically around clinical applications. I was asked and partnered with Terry Minoliore Institute to write a perspective about this in cell last year, talking about how genomics is gonna reach the clinic and how these basic discoveries are gonna drive that forward through technology advances. One of the specific areas you're gonna see this in, one of the things that we are directly supporting that you can absolutely expect to see in the coming handful of years, all again driven by these fancy sequencing technologies. For example, I told you you'd see this pie chart again. We wanna fill in this other half of the glass. I just spent the last day and a half meeting with a new consortium of centers that we've now put together, sequencing groups whose charge is simply gonna be to use genomic sequencing technologies to identify the genomic basis of the remaining Mendelian disorders for which the gene is not known. We think we can industrialize this and start to identify these genes particularly and we've put this consortium just formed over the last couple months and we've now just spent a couple days strategizing with them to move this forward aggressively. Similarly, earlier this week I met with our large sequencing centers who are gonna have a major hand in sequencing tens of thousands of genomes in the coming years, probably hundreds of thousands perhaps, and they are gonna be among many things tackling the challenges of moving from information about regions of the genome that might confer variants for complex diseases to actually getting down to the causative variants and doing this by industrializing the sequencing of individuals of particular phenotypic features and diseases and so forth. So I would expect major strides. Then the other major disease area, specifically focusing on one disease area which is absolutely a no-brainer and I'm sure you've heard a lot about is in cancer. And here, cancer fundamentally being a disease of the genome. We've gotten out of the gates already here at NIH through the Cancer Genome Atlas, a joint venture between our institute and the Cancer Institute, really is a prototype for applying genome sequencing for in this case a perfectly appropriate target of different kinds of cancers. And of course this is such a no-brainer, it's not just the NIH that's doing this, this TCGA or the Cancer Genome Atlas is just one of many projects that now fall as part of an international effort. And in fact an entire consortium has formed of groups, many countries are now involved in tackling different tumor types and using genomics to sort of develop catalogs of changes that take place in tumors and use that knowledge to better guide diagnostic and therapeutic development. And so what is that future gonna look like? I think a lot about diagnostics in particular, it's probably because I was trained as a pathologist, but right now thinking about how we look and deal with cancer, it's mostly histopathology, looking under a microscope and looking at tumors. And in the future, sure, we'll be doing that in the future, no question, but with it'll come an augmented set of knowledge about individual genome analyses of the specific tumor that you're looking at and its rearrangements. And I am convinced, and we already have data to know it, that that will provide a much more robust diagnostic tool for predicting the nature of the cancer, the prognosis for that cancer, and perhaps treatment options for that particular cancer. We're gonna have a very special lecture, that's sort of a special part of this course because Bert Vogelstein will be lecturing in this time slot on February 29th, actually as part of a separate lecture for our institute, an annual lecture that we give, but Bert's gonna come down from Hopkins to give this. And I guarantee you the vision he will articulate will align very much with this slide as a real pioneer in the area of cancer genomics. By the way, it's these technologies, these sequencing technologies, I've mostly spoken about how they can be used for sequencing human DNA, but we shouldn't forget the fact that these technologies can also be used to sequence other DNA. And the DNA in particular that I'm thinking about are the DNA of microbes that live in us and on us. And it turns out the whole community of microbes that live in us and on us is known as the microbiome and just two quick statistics to make you a little uneasy. You're outnumbered by microbes in terms of cells 10 to one. So your little body ecosystem is only 10% human, 90% microbes. And another thing that should make you feel uneasy is of those microbes only about 10% of them have ever been isolated and studied in a laboratory. 90% of it we've been blind to. What is that relevant to genomics? Well, we can sequence those microbes now, we can sequence the microbiome using these fancy technologies. And we can catalog that community and learn about that community and figure out if that community has any role in health and disease. And so where once by the time we were blind to our microbes or many of them now all of a sudden we can monitor them, we can study them. And this has led to a whole area of research of microbiome analysis and NIH has a common fund project called the Human Microbiome Project. And once again, it's just one component of international efforts, our one microbiome project interdigitates with a whole international consortium of investigators and countries involved in doing human microbiome research. And we look out the future, especially through pathologist's eyes. And nowadays we pretty much deal with petri dishes and gram stains and try our best to diagnose the microbes that are associated with disease even though we know we can only culture a small fraction of them. And in the future we'll be doing the same thing, but wow, if we can get some sequence data on samples and see microbes by sequence data that we can never see in the laboratory otherwise you've gotta believe it's gonna bring insights about their role in health and disease. And so Julie Segre from our institute will come and describe microbiome analysis and some of the work she's done and the community has done and I think you're gonna be amazed at the idea that these new technologies have just sort of are changing the face of clinical microbiology. But there's other things beyond that that are gonna be brought by these new technologies. One could certainly imagine that the idea of genome sequencing of newborns, I mean all newborns born in the United States get genetic tests done, just small number of genetic tests. Could you imagine doing a more comprehensive survey by sequencing their genomes or some part of them? Wow, there's lots of questions to think about and we're thinking about research to sort of help answer those questions and sign it to consider. Of course the whole idea of the interplay of genetics and drugs and the genetic basis of drug response becomes very important. Why is that? Well, we don't all respond to drugs the same. Just like a lot of things in life, everybody responds a little bit differently to lots of things and all of us respond differently to medications. All the medications that come out to market, that they all work, they just don't work in everyone. And the whole notion of pharmacogenomics, understanding the genomic basis of drug response has really now taken on a very exciting phase where we're getting at the genetic basis of drug response and perhaps might lead to better ways of managing medications for individuals. So Howard McLeod, another individual from North Carolina, we keep bringing it up from UNC, he's gonna come and give a lecture as a real world expert on pharmacogenomics and I think you will enjoy that lecture tremendously. But there's other challenges that come with this technology, lots of genome sequences being generated, but these are gonna eventually be generated on patients. We're gonna have to communicate that information to patients. And it's pretty complicated stuff, we don't understand it ourselves yet and yet we're gonna have to communicate this and by healthcare professionals try to describe what their unique genome might bring with respect to disease, with respect to drug response, with respect to their children. So there's a lot of issues associated with communicating that information and there's a lot of science behind trying to think about how to sort of create that future in a very productive way and Colleen McBride from our institute will come and give a lecture that'll be very relevant to this particular area. And I think that'll be very important one for you to consider as well. And then meanwhile we need to communicate one to patients but we need to disseminate this knowledge as it accumulates out to the healthcare professionals and I can tell you that as I talk about this a lot, individuals are very concerned about will we have robust enough clinical genomic information systems that will allow healthcare professionals, physicians, nurses, genetic counselors, pharmacists, physicians assistants and so forth be able to interpret this tsunami of new information as it becomes available. And at least in the United States and we're playing catch up compared to some countries this will all interdigitate likely with acceleration in the use of electronic health records which in some ways might be good because genomic information might flow in nicely to health records if we organize it properly but maybe it won't, we have to deal with that but of course we also just need tools that are gonna be readily available to healthcare professionals that allow them to look at the three to five million variants of a given patient, figure out which ones of those should we do something about which ones should be ignored and that's just gonna be a lot of information. So we're very much involved in trying to think through what should be developed to try to help professionals deal at a practical level with this onslaught of new clinical data. So I'm sort of at the end of this journey I knew it would take me the full hour and a half I had a feeling. These five domains I think you'll see in various forms in the coming lectures I just wanna remind you that some of this is actually new for the field of genomics and that's why some of the last things I talked about are actually very new areas for us. Sort of think about these domains the first one, two, two and a half are really basic science endeavors and I'm sure many of you regard yourself as basic scientists but as we've sort of thinking about moving more and more into medical science it starts to deal with what's called translational science really thinking about the application of genomics to medical problems and medical circumstances but even some of the last things I was talking about with getting this information out to healthcare professionals starts to become implementation science which I think is very new to many of us actually demonstrating the effectiveness of healthcare it's really about implementing things and changing the practice of healthcare. So I think among the lectures that you hear you're gonna hear samplings across this full spectrum of different scientific endeavors basic translational and implementation. And finally what I would say is I hope I've given you the impression that there's been incredible accomplishments incredible optimism and just remarkable successes in genomics but at the same time there's a lot of big challenges especially some of the ones I alluded to earlier this is gonna require just a herculean effort by not just people in genomics but actually more broadly as it gets disseminated and there's no reason to believe this is gonna be a simple journey I didn't mean to imply that even when I was telling you about some of the good successes can't help but provide enclosing sort of a quote I found from Winston Churchill that I thought was very appropriate a pessimist sees the difficulty in every opportunity but an optimist sees the opportunity in every difficulty one thing I will tell you is that Tira and Andy in particular have enriched in this lecture series for optimists so you're gonna hear about some incredibly exciting opportunities that are coming with all the difficulties that are moving this rightward but with that said at a practical level we should all and I hope you find it inspiring yourself to recognize that this is gonna require a community of scientists and healthcare professionals to really see this vision through and we've gotta stay optimistic but we also have to realize there's a significant heavy lifting ahead of us so up against the end I will just stop there at a practical level I know people have appointments to go to I'll stick around and just take questions from the platform and thank you for your attention bye bye so