 Okay. I've been asked today to talk about genomics and population studies, and I think we're all interested in understanding the link between we now have the complete genome sequence. We have a large number of variation in the genome sequence. The so-called single nucleotide substitutions or polymorphism snips as they're called, and we're interested in correlations between the genotype and phenotype, but we know that this has not happened in the absence of environmental modifiers, and this is perhaps one of the most important aspects that we face is how not only to have enough individuals to study to look at genotype-phenotype, and I will say that many people are performing studies of a thousand cases, of thousand controls. I think we need more on the order of 10,000 cases and 10,000 controls to get there, and we really don't have significant power yet, but we're finding lots of things which is really very important and is clearly stimulating the development of more of these types of studies. Now, I'd like to say a few words about genomics and the fact that we do large-scale projects, and most often technology drives the development of the project and makes new things feasible. For example, finding variation in the human genome made it possible to genotype millions of variants that develop new maps that are now being applied and used in these genome-wide association studies. These are collaborative projects very similar to epidemiological projects where many groups, independent groups contribute to the efforts. Genomics meet frequently. We all contribute to the work, and we're all authors. We often don't pay attention to who's first and last because all the authors are considered full participants in the study. Data sharing. We always drop our data in the database. That's a must to be a genomicist. And I think it's a must to be a scientist of the future. I think this has benefits to all. If we didn't have the human genome database that we have today, if we didn't have all the variants, we wouldn't be able to do these GWAS studies. I think that having databases and being able to mine new information is extremely important. People are doing in silico computer experiments now that they wouldn't even dream were possible five years ago because of all the data in the database. So I would encourage data sharing. I can say as somebody who's dropped information in the database for years that people are extremely respectful of this type of activity. I often get contacted by colleagues that say, I'm going to work on your data set and this is what I'm going to do. Is this okay? And I always say yes, even if I'm working on it, it doesn't matter. I mean, everyone has access to the data and can mine new information out of it. It's just really important. This brings new analysis tools and insights. We now have an idea of the genes in the genome. And I will say that we only, genes are just one part of the function of the genome. There are many non-coding sequences. In fact, the human genome is mostly non-coding sequences that control the expression of the genes and their translation. And I think we need to know more about these non-coding sequences. You'll see that many of the studies, if you go and read these GWAS studies, genome-wide association studies end up in non-coding regions of the genome. There are no known genes there. It doesn't mean that there aren't genes there. We just haven't been able to find them yet. We don't know what the function is or what it does. Also, the variants that actually are associated with the disease or phenotype are not necessarily the SNPs that are functional. They are associated with other SNPs, perhaps nearby, that actually have function. So understanding the genes, the variation in a function, are important to genome scientists on a broad scale. And this is extremely useful for population studies because they do give rise to the basic information that can be mined by all. I'm going to talk a little bit about genetic analysis strategies, not much. What we know about sequence variation in humans, because it is somewhat biased and you should know that, something about the HATMAP and its impact. Obviously, the big wave is true. Terry has told you about all the various associations, how we're going to talk today about replication. I'm just going to bring it up a little bit, but I also want to talk about translational impact because I'm not sure that's going to be highlighted today and the fact that we can often make diagnostic and predictions about disease susceptibility or phenotypes, but that's different actually than developing treatments and the function, the data that we need for identifying function is not necessarily there and will take time. And close with a little bit on whole genome sequencing and what might be available in the near future. Obviously, genetic strategies have been applied to human diseases for many years and linkage was the way that people started looking at simple inheritance where they segregate in families. There were single genes with major effects and rare variants in the population, so you had to go out and identify a person with disease. Today we're very interested in common diseases that have complex inheritance. They don't show clear Mendelian patterns of inheritance. They aggregate though in families. There are multiple genes with small contributions and there are environmental contexts that modify the expression. The variants are common associated with disease, we believe, and we'll talk a little bit about that, but to look at populations, rather than looking at families, we need large numbers of markers and we're talking on the order of a million SNPs that are carefully, these single nucleotide polymorphisms, that are carefully chosen from the human genome. This is not just a million that we go out and randomly find and that's what we've learned. I want to talk a little bit about that. But how much variation is there in humans? I think that people talk about 10 million SNPs and the fact that there are so few, there are 10 million variants in the human genome, that's not really true actually. If you think about how much variation there really is in the human population, if you take into account the population size, mutation rate, and the number of hits, actually every base has been hit or mutated 240 times in the last few generations. So that means that every variant is compatible with life exists if we were to sequence everybody across the planet. But what we need to remember is most of them are vanishingly rare, only found in a couple of individuals, they're not common among an old among human populations. So I think when we say something about how much variation is there in the human genome, we're often quoting what's common in the population. I want to put this in context so you can remember this. If your average size gene is 20 KB and that's about what it is for the human genome, if what we see is what we'd expect to see based on what we know is about 100 SNPs. And this translates into about 15 million SNPs that are common across the whole genome. And this is estimates. There are 40 common SNPs inside of a gene and about 6 million common SNPs. These are SNPs that have minor allele frequencies greater than 5% in the population. And if we expect to see 5 coding SNPs, half of which change the amino acid sequence. So this is your typical gene. This is what your expectation is. And you can multiply this by however or however much sequence you like because genes are not that different from intergenic regions in the genome. So this gives us an idea of what we have to work with. Well, how did we find the actual SNPs that we're using right now? This is the way we actually mined genomic data. We got overlaps where we found differences between where we sequenced the clone resources, where we overlapped shotgun sequences, or where we obtained sequences from CDNA libraries or mRNAs over time. And with that, we've identified more than 11 million SNPs. Some of those are very common and some of those are really rare. And the way that we determined what was common and rare is we found if we see more than one example of each allele, these so-called validated SNPs, that these tended to be more frequent than the population tended to be more frequent. Not always, just by chance, who would get some infrequent showing up twice. And we have seen that. What you should know is that if you just compare to chromosomes, you will find a lot of, because this is the minor allele frequency, this is the fraction of SNP discovered, you will find a lot of common SNPs. You will also find some rare things. But if you go to the level of what we think we have now for the imogenome, eight chromosomes are so covered on average across the genome, you're finding a large fraction of the common variation in the imogenome, and this was used for the HATMAP. You're not finding the very rare or know-a-lot about the very rare variants in the genome. And I just want to make this clear because it is very important. Now with this six million SNPs that we had that were validated, an international, this is a typical genome type study where people participate not only from the United States but from foreign countries to bring new insights into a project and it was to produce or genotype the six million SNPs on a common set of samples. In genomics, we all contribute to a common set of samples. That's quite different in EPI, but we can think over time that if everyone in epidemiology would work together with their samples, that would become the common set of samples. And that's actually what people would like to work towards and I think will work towards over time. This population represents individuals of European ancestry, African and Asian ancestry. And at the time that this project was carried out, we had no idea how to type a thousand SNPs, let alone how to do a million SNPs that we could do today. So by engendering a project like this, new technologies were developed that allow us to do the genome-wide studies that we're doing today. Now if we've typed six million SNPs, we obviously are not typing all six million, we're typing a subset of them. How do we determine that subset and I just want to give you something about that. You can just pick them randomly. If you pick enough randomly, you will do it. And that's one of the approaches that Affymetrix uses but you can also look for correlations amongst the genotypes that can help you pick. And I want to give you an example. This is just a visual genotype, so it's a genotype of the sites in a gene and interleukin one. And the individuals that were sequenced, common allele as homozygous is blue, heterozygous is red, and homozygous, rare allele, alternative allele, is yellow. Now this doesn't look like it has very much pattern to it actually. And the reason I'm showing you this is a very interesting gene. Most of the SNPs in here are common. It's a gene that shows selection or natural selection over time for common variation in the human genome, but rather than arranging the SNPs in the order that they're found on the chromosome from starting with the promoter SNPs down to the SNPs in the untranslated region or the end of the gene, the start and end of the gene, if I order them based on associations, I get a brand new pattern and this is it. You can see right away, although you didn't see a pattern just in the visual genotype, that many of the SNPs in this gene are highly correlated, right? And what you can notice is actually there were only three patterns in this gene and I would really only have to type three SNPs to capture the information in this gene. Well think about this across the whole human genome and think about simplifying the patterns so you just type a handful of SNPs and get more of the information. And that's also what's been done to capture SNPs in the human genome is to tag. Now do I need all of the SNPs in the human genome to be able to tag? No, I don't because obviously here there's a huge number of sites that are correlated. If I just capture one of them, I've captured everybody else in that particular correlation bin the same way here and here. And so there are things that are being typed that are parts of big bins, there are lots of SNPs associated, and there are things that are parts of smaller information and as we go up in the amount of SNPs that we type, we also go up and increase the amount of genetic diversity we capture from the human genome. This is just to make you realize that there are differences in human populations in terms of what's common and what's not. There are a number of things that are similar and in common and a number of things that are different. In order to type multiple populations, we have to capture all those fractions. And that's another important thing that's changed as we've increased the amount of information that's on the genome-wide association studies we carry out today as we're carrying them out for multiple populations. There are many formats and you'll hear more about them today. What's amazing to me is that as a genomicist, I knew that 100K and 300,000 SNPs didn't capture but maybe 30% of the variation in the human genome. What amazed me is we got hits off of that 30%. And we continue to get hits and what I want you to realize is that when you apply these, they're not capturing 100% of the variation in the human genome, they're capturing a fraction of it. And what is phenomenal is that we are finding so many new insights into associations to common human diseases when we know our tools are not perfect and will only get better over time. This is very, very exciting and it just shows that we're at the tip of the iceberg of what could come with getting better and better tools to type for common variation. Obviously it works, applying it, Teri showed you all of the new studies that are coming out. What's interesting is we are getting misses, a big, big one in the Wellcome Trust study was hypertension. I was just talking about that. Is this because we don't have complete coverage of variation in the genome? Is this because they're environmental? Is it because of the phenotype? We obviously need to continue to look at these things. Is it all rare variant and not common variant? Because these GWAS studies are only covering common variation. I think there's a lot more hits in these data sets. You're seeing a big hit on chromosome 9 and cardiovascular disease. Is that the only hit? No. It's the one that everybody's focusing on, but we don't know how to get at the other hits yet. And it's only with combining data and looking for new hits that will get that analysis. We have no idea how to analyze these data sets. These are billions of data points. We'll tell you that we think we know how, but these are just the first steps. And there are single markers, one marker against many different phenotypes. Marker-marker interactions, we have no idea how to do that. We're thinking about pathways, but we don't really understand all the pathways in the human genome, so we're kind of getting that information as we apply it. What is important is that we need replication. When we get a hit, it's usually with SNPs and with apple types in that region. And replication tells us what's important about this, whether there may be environmental, whether there are differences in the populations that are being typed. By replicating number and number of times, we can look at what replicates and what doesn't and get an idea of what matters. And this is extremely important in human genetic studies. And there was recently a publication in Nature on looking at what the levels of replication should be, and a whole genome association study in Terry contributed to that. But replication is a must. When we do get replication, we go back to some region of the genome that we're really interested in, and we end up coming back to genes, or maybe not genes, maybe we'll be in some desert, some place that obviously has function, but we have no idea what that is. And then we're going to be back at the sequencing level again, because we know that the SNPs that we have on these chips are not necessarily the functional SNPs. We're very interested in looking at those regions of the genome. I would just want to show you an example of something because I want to translate it from a hit, and I'll show you a hit. And that's what vitamin K epoxide reductase, which is a target gene for warfarin. And what was found in this particular study is you could predict warfarin dosing. This is in 2005, with either SNPs or haplotypes, didn't matter which way you did it. SNPs were associated, and haplotypes were associated. But it explained at least 25% of the dose required to treat an individual. This is one of the largest pharmacogenetics hits that's known. Can we use this to predict warfarin dosing in individuals? Yes, we can. Is it being used yet? No, it's in prospective testing. A lot of the things that we're going to be finding are going to be like this. This is a few years back, and yet it's still not in standard routine clinical treatment. People are typing it to see if it can be used prospectively in a study to actually help with warfarin dosing. The other thing is we have no idea of the mechanism. None of the SNPs are in coding regions. They're all in non-coding regions. They could be in the gene or outside the gene, but obviously there are some regions that are evolutionary conserved. Although we can clearly genotype this region and predict somebody's dose that they should go on for warfarin, we have no idea how this translates functionally, and that's the difference between knowing something that's predictive or diagnostic versus knowing the information that actually identifies the functional allele that maybe could lead to new treatments. Now when we get these regions of the genome, and this is what I want to end with, that are in these non-coding regions, how we know more about what's going on. This is a new study that came out actually just a couple weeks ago in nature called the INCODE project, and it took 1% of the human genome and through every single bashing thing we could think of to understand both the coding and non-coding regions of the genome. The large-scale types of tools that we need to go on to look at transcription. We know that many areas of the non-coding sequence are transcribed, we have no idea what they do, we know where chromatome structure, replication, we know where this conservation and so forth, all of these are going to be placed across the entire genome. This is really important because if you're going to be doing these studies, you want to know when you get to a non-coding region, has anything been known about this region? Is there anything of potential interest here, or am I going into new territory where no one's found a way to predict function before? I mean, it's a big genome, 95%, but we have no idea what it does, so it's possible that we will make hits in these regions. The other thing is, this is another project that's coming down the pike, is the structural variation project. What we found when we typed lots of SNPs in the genome, and what we found by looking at structure of the genome over the years, is we found indels, insertion deletions. We don't know actually all of the sequence, we still have holes in the immune genome sequence. There are inversions, duplications, translocations. Some of these are quite large on the order of up to a million base pairs that are inserted, deleted. There is intermediate scale on the order of 500 to 100 kV in size, and fine scale, one base in or out, versus 500 bases. The structural variation project is one focused on getting at what is the structure. I will say that this is also known as the copy number variation project. It's being incorporated to structural variations to cover all the different types of variations that exist now. But we are getting maps across the genome. We know that more than 10% of the human sequence is involved in these structural differences. We don't really have good ways to type this, although we can infer this when we type from the human genome. Some of those things that don't come out in Hardy-Warnberger are actually a part of some of these regions in the genome. We're beginning to know something about frequency. We can't even determine their frequency yet, because they're not easy to type. So we're trying to get that information, put it on the genome. But what we do know is they do involve genes. There are genes in and out in individuals. Is this really important? We know from knockout mice that we don't have to have all the genes, but we do know that some when knocked out are more important than others. And so having an idea of the structural variation in the human genome and documenting in more detail is really important. So you should be looking not only for SNPs, but thinking about structural variants in the human genome. We have many examples of this impacting phenotype. Well, obviously we've done a lot of linkage analysis. We're doing a lot of association analysis, but there's a lot of people that believe that common disease is associated with many rare variants. And this would be in this area where there are weak effects and low allele frequencies. And we're not testing that now by these genome-wide association studies, but we could test this, and people are testing it by looking at people at the extremes of the populations of tails. And there are many examples emerging, particularly from the group in Dallas, Helen Hobbs and Jonathan Cohen, that are showing that by sequencing the tails at a distribution they're able to find new variants in the human genome that explain heritability of low, for in this example high-density lipoprotein, low-density lipoprotein variables. What they found is by sequencing at the tails for genes known to be involved in HDL, and there's a number of papers in the literature on this now, that they found a higher number of non-synonymous or protein coding changes in HDL, low HDL versus high HDL, and that they were able to actually show some functional analysis in cell culture. I think that our ability to do this right now is looking at the tails. But in the near future, I think you've all heard there's a new sequencing machines on the way, and that $1,000 genome is down the pike. I'm not really sure how far down the pike it is, but a few years ago I would have never said today I would be typing a million snips either. So I'm willing to bet that within a few years that $1,000 genome is a reality. What I can tell you, this is just an example of technology, SELEXA that's available that people are using, and genome science is usually never complaining about data, and all I've heard people do is complain because for $3,000 you can get a gigabyte of sequence data in a couple of days, but it generates a terabyte of data, more than a terabyte. They have no idea even how to get it off the machine fast enough so they can start their nest experiment. But I think that this is the kind of thing that will allow us to look at genomics of both common and rare at the same time, and it's possible that these chips that we're applying today are not going to be what we're applying five years from now or even three years from now. We may be sequencing everybody and getting it both the common and rare. What an analysis nightmare that will be. And I just will say that to summarize that obviously genomics can give new insights in the variation. It's not the complete picture, but it seems to be pretty good. What's really important and we need a bigger handle on is function and how to know more about function. So we can do that. Obviously new technologies are emerging, but I think what's really important about genomics that exists in epidemiology and there is sharing for sure. But I think common interactive projects, if all of the projects in an area and we're beginning to see this in cancer projects, for example, all the people working in breast cancer, all the people working in colon cancer, getting together, meeting and pooling data, I think this is the kind of thing that is really important to do. I understand the barriers that exist, but I will say from being a long-term genomicist that people do are very respectful about sharing. So I would encourage this and I thank you.