 Good morning, everyone. Welcome to Week 7 of Current Topics and Genome Analysis. This week, we're honored to have with us Dr. Lynn Jordy from the University of Utah School of Medicine, where he serves as the HA and Edna Benning Presidential Endowed Professor and also the Chair of the Department of Human Genetics. Dr. Jordy received both his BA and PhD from the University of New Mexico. His lab has been involved in the studies of human genetic variation, mobile element evolution, and the genetic basis of human malformations, as well as the genetics of common diseases such as hypertension and inflammatory bowel disease. He served on several advisory panels for the NSF as well as the NIH, and has also been an expert witness in a number of court cases involving DNA evidence. Finally, Dr. Jordy has received 12 teaching awards from the University of Utah, as well as one from the American Society of Human Genetics. I'm pleased to have been bringing his excellence teaching style here at NIH this morning, and I'm sure you'll enjoy and learn a lot from this morning's talk, which is intended to provide you with an overview of the field of population genetics. Please join me in welcoming Dr. Jordy to NIH this morning. Well, thanks very much, Tira. It's a pleasure to be here again with you this morning. One thing I like to do in my presentations is to allow questions during the presentation. I think it helps to break things up a little bit, so if you have a question as we're going along, don't hesitate to ask. I have to disclose that I have no relevant financial relationships with the content of today's lecture. What I will talk with you about this morning, we'll start with a discussion of patterns of human genetic variation. We'll talk about variation among populations, as well as variation among individuals. We'll talk about some applications of those studies. One is, I think, to enhance our understanding, to illuminate our understanding of a very controversial topic, that of race, and what the biomedical implications of our enhanced understanding are. Finally, we'll talk toward the end about linkage disequilibrium, the HAPMAP project, and now the 1000 Genomes project. How does the new sequence data in particular illuminate our understanding of human genetic variation and biomedical applications? So that's what we'll be going over. I would like to stress that there are several areas in which human genetic variation can be applied. First of all, we use it to decipher human history. We'll talk a little bit about that. We can infer individual ancestry. I'll show you some examples of that. We use it in forensic applications. We won't really talk about that today, but more than 25,000 criminal cases each year in the United States now involve forensic evidence. So it's routinely applied in that context. And finally, and I think here at NIH, this is perhaps the most important application, finding and understanding disease-causing genes. And we'll talk about how our studies of population genetics help us to do that. So of course, the fundamental source of human genetic variation is the process of mutation. We estimate the human mutation rate to be somewhere between one and two and a half times 10 to the minus eight, or one in 100 million, per base pair, per generation. So what that translates to is that we, each of us, every time we reproduce, transmit somewhere between 30 and perhaps as many as 75 new variants with each gamete. Now, the upper estimate, two and a half, comes from phylogenetic comparisons of human variation with variation in the chimp. More recently, we've been able to estimate directly the human mutation rate by simply comparing DNA sequence in parents and offspring. And in a paper that we were involved in a couple of years ago, that's the Roche paper, we estimated the human mutation rate directly to be about 1.1 times 10 to the minus eight. So on the lower end, and when the thousand genomes estimate came out a little bit later in nature genetics, it was exactly the same as ours. So we have multiple estimates now based on direct evaluation of sequence in families that suggest the mutation rate is somewhat lower than the previously estimated phylogenetic rate. Now, here's a quote from Lewis Thomas that I especially appreciate. The capacity to blunder slightly is the real marvel of DNA. Without this special attribute, we would still be on aerobic bacteria and there would be no music. So I think we should appreciate our mutations because it's from those mutations that the genetic diversity that we see among us is derived. But of course occasionally those mutations also cause disease. But understanding the processes that give rise to variation naturally is going to help us to understand the basis of genetic disease. So given that mutations are being transmitted in every generation, the natural question to ask is, well, how much do we actually differ? So if we look at sequence difference in terms of aligned DNA-based differences, of course identical twins for all intents and purposes differ at none of their base pairs. That's not completely true, but close enough. For unrelated humans, we differ at about one in a thousand base pairs. And of course I think this is a very important statistic. The fact that we are 99.9% genetically identical at the level of DNA, the most fundamental unit of our biology, I think says some very important things about us. Now if we compare ourselves to our nearest relative, the chimp, we differ at about one in a hundred base pairs. In other words, we are at the DNA level about 99% chimp for aligned base differences. Several times more than that if we include copy number variation. If we compare ourselves to the mouse where there's about a 70 million year divergence, we're substantially more different. And thankfully if we compare ourselves to broccoli, we are mostly different from broccoli at the DNA level. So even though we are very similar at the DNA level, 99.9%, that still means that given our three billion DNA base pairs, there will be about three million differences, single nucleotide polymorphisms, between any pair of haploid human DNA sequences. So there's still a substantial reservoir of diversity in human populations and it's that diversity that we're especially interested in. Now if we look at copy number variants, we see that they, and we're defining them here as variants greater than a thousand base pairs or so, they do account for about as much variation as do SNPs. And so by copy number variants, we simply mean that we have extra copies of chunks of DNA or missing copies of chunks of DNA, and generally 500 or 1,000 bases or larger in size. And what we found in the last couple of years is that each human is heterozygous for at least 100 of these copy number variants. And that accounts for about another three million bases of variation. So about the same amount of variation as is accounted for by SNPs, just much larger chunks of DNA, smaller numbers, but affecting about the same amount of actual overall DNA. So another important source of genetic variation. So one of the questions that we can ask is, well, how much do populations differ? We've seen how much individuals differ, but if we look at populations, how do they vary genetically from one another? So I'm going to show you data from a large series of populations. For some reason, it's showing up there. You can see the major continents. This is a series of populations that we've collected over time, representing quite a quite a diversity of human genetic variation. So how do we do this? Well, one way that we look at genetic variation in populations is by tabulating the frequencies of SNPs in populations. So here we're showing three human populations. These could be continental populations, say Africa, Asia, Europe. And we're looking at allele frequencies for the major alleles of three SNPs. We see that there is variation in the frequencies of these SNPs from one population to another. And what we want to do is to assess that variation. This is one way that we do it. This is a statistic that's used widely in population genetics called FST. And by the way, I think I'm going to only show you two equations today, and they will involve nothing more complicated than addition, multiplication, subtraction, and division. So I think it will be quite non-intimidating. But FST, we can think of it easily as the difference between the total heterozygosity or total variation in our sample minus the average variation within each population. So if you've got several populations, what we're asking is, is there more variation overall, if we look at the entire sample, than the average within each population. So HS is the average heterozygosity within each population, say within each continent. So you can see that if there is just as much variation within populations as the total, then FST will be zero. This will be equal to this. FST will be zero. And what that says is that all of the variation occurs within populations. There's no variation between populations. They don't really differ from one another. On the other hand, if FST is at the other extreme one, then that says that all variation exists between populations, and there's no variation within populations. So that's a measure of the extent to which population differences contribute to overall genetic diversity. And if we look at FST for human populations, and these are a number of different kinds of systems, different kinds of loci that have been assessed, short tandem repeats, ALU insertion polymorphisms, L1 insertions, a 250K SNP chip in the populations I showed you, what we see is that very consistently for all of these different kinds of loci, roughly 90% of variation is found between individuals within the major continents. And only about an extra 10% or so is seen between continents. So one way to think about this is that if we assessed the diversity in Europe, we would have 90% of the diversity in the entire world. We only get another 10% by looking at the rest of the world because there isn't that much difference between populations. Most of the variation occurs between individuals within populations. And we see this very consistently for various kinds of genetic systems. Now we can compare that with FST for an observable trait, a phenotype, skin color. We can estimate FST for skin color in different continents. And there we see that 90% of the variation occurs between continents because there is substantial difference between continents in terms of this trait, skin color. It's been highly selected in human populations in different climates and latitudes. So we see substantial difference among populations for this observable trait. So it's kind of interesting to look at the contrast between variation for this observable trait and variation for actual genes and to see that it's much less for large series of genes, large series of DNA sequence variation. So what that implies is that most variation that we see is likely to be shared among populations. And here's a tabulation based on the 250K chip results looking at four population groups, Africa, Europe, East Asia, and India. And what we find is that the minor allele of each SNP is present 79% of the time in all four groups. In other words, nearly 80% of the time a given SNP allele is shared among all four groups. 88% of the time in at least three groups, 92% of the time in at least two groups. About 7% we see just in Africa. So there are some polymorphism, some SNPs, where we see the allele only in Africa. Only a tiny proportion is found only outside Africa. So more unique variation in Africa than outside of Africa. And we'll get back to reasons for that in just a moment. In that analysis, no SNPs were fixed present in one population, fixed absent in another. And so what that means is that there was no SNP out of those 250,000, not even a single one that you could look at and say, well, if you have this variant, you must be from this continent, you can't be from that continent. And what that's telling us, again, is that most of these variants, the vast majority are shared among multiple populations. Part of the reason for that is that they're relatively common SNPs. And the more common they are, the higher their frequency, they tend to be older in populations, therefore more likely to be shared. This is just a diagrammatic representation of the same idea, showing that for the most part, we see overlap among these continental populations for SNP alleles. But now, what if we look at sequence variation? What if we look at less common SNPs? So here we see common SNPs that had been identified in DB SNP. We see that the overlap, again, now millions of variants, more that are unique to Africa, and that's why our eye is an African population from HapMap. This is an Asian sample. This is a European derived sample. But you can see that most of the variants are shared. But if we look at new, rarer SNPs, lower frequency SNPs, most of them are not. And this makes sense if a variant arose just in the last few thousand years, it's likely to be less common. It doesn't have as much time to attain a higher frequency. It's also more likely to be population specific. So for rare SNPs, we see much, much less sharing, a lot more difference between populations as we would expect. For these common SNPs, the average allele frequency difference between populations is roughly 15%. But for the rare frequency alleles, less than 5%, because these are very low frequency variants, they can't differ in frequency that much from one population to another. They're usually population specific, usually low frequency. So in fact, they don't affect the overall FST value very much. Even though there are many of them, they're so low in frequency that they account for a relatively small proportion of overall genetic diversity. But we think it's an interesting part of that diversity. And to the extent that rare alleles contribute to disease and common disease, that suggests that there should be some substantial population differences in terms of those alleles and their risk effects. So how do we actually quantify these genetic differences? Well, I'll show you just a simple genetic distance measure. And again, this is very, very simple mathematics. We're estimating a distance statistic, D, and it's simply the difference in frequencies of alleles in two populations that we label I and J. So if we go back to our little table that I showed you earlier of SNP frequencies in populations one, two, and three, we can estimate a distance between populations one and two as the difference between these two SNPs. So just a very simple subtraction. And then we would average that over all of our SNPs. So this pair, this pair, this pair, take the average and we have an estimate of the genetic distance between the two populations. If we have a million SNPs, we would just do this process a million times. So it's very straightforward. There are lots of variations on this theme, but they all, in one way or another, involve looking at this kind of a difference. So we can use those distances to build a display, a population network, how similar our populations to each other. So let's take just one SNP in three populations. So we take our difference in frequencies here and we connect the two populations at a level corresponding to that difference. And then we can just take that average value, that is the frequency of these two populations, P1 and P2, take its average to represent this node, and then we take the difference between that and this frequency, telling us in this little diagram that population three is a little bit more different from populations one and two than they are from each other. And of course, here you can see that just by looking at the allele frequencies themselves. But imagine if you have a million of these SNPs, the display becomes very useful. So that's basically how we go about making these displays of population similarity. So here's an example in which we did this for 100 ALU insertion polymorphisms some time ago. And we have a network showing the similarity of populations in Africa, Asia, Europe, and South India. And what patterns do you notice in a diagram like this? What kinds of things sort of jump out at you? Sorry? Yeah, well, and the African samples here tend to be more similar to each other. The European samples more similar, the Asian samples more similar. So geographic location does affect genetic similarity, which is not a real surprise. If people live only a few kilometers apart, they're more likely to mate, especially historically, than if they live 5,000 kilometers apart. So we see a correlation between genetic location between genetic distance and geographic location. Any other patterns that you notice in this diagram? Yes, sir? Yeah, so we see a strong indication of greater variation in the African populations than in really the rest of the world. And we'll come back to that. Also, we see that the ancestral population is closer to the African populations, again, suggesting that these would be the parental population, the source population for the rest of the world. These are what we call bootstrap support values, their percentages. And what they tell us is that statistically, we can have 100% confidence in this grouping 97% here 97% here. So we have basically enough data so that we can be statistically quite confident in these results in these patterns. Now here we're looking at 250,000 SNPs in a similar collection of populations, just a larger collection of populations. Again, we see African populations here, European populations here. Here's a sample of the Iraqi Kurds. These are South Asian populations, Nepal, East Asia, the South Pacific, and the New World. So again, the same kind of pattern where there's a strong correlation between geographic location and genetic similarity at the population level. With even more populations, we added in the human genetic diversity project populations, we again see the same pattern. Now here, this is a different collection of populations with a different SNP panel, as well as a CNV assessment. And again, we see the same general patterns. So there's a reassuring degree of consistency among different studies with different samples using different collections of loci, but all giving us essentially the same patterns. Yes. Okay, so with the I showed you that for the ALU insertion polymorphisms, and there we can unambiguously determine the ancestral state, because ALUs are mobile elements that insert into the genome. So the ancestral state is absence of the ALU. The what we call derived state would be having the ALU. So we can essentially construct the ancestral population as being one with no insertions. That's one of the advantages of those of those markers for this kind of analysis. Good question. Okay, so similar patterns. Here we're looking at diversity, and specifically haplotype diversity by a haplotype, we mean a series of SNPs that are very closely linked together on the same chromosome. So we're looking at the decline of haplotype diversity with geographic distance from Africa. So we have a suit from actually East Africa. So we have the highest diversity, as you can see, in African populations, lower diversity in Central and West Asia, then Europe, then East Asia, then Polynesia, and finally, Native American populations. So there's this general pattern that we see over and over again. The further you get from Africa, the less diversity that we see in populations. So African populations clearly having the greatest degree of genetic diversity, and this actually has interesting implications for things like donor matching for transplants. Because of that diversity, there's a greater genetic diversity among donors in those populations, making transplant matching somewhat more challenging. Now, if we look at sequence data, what I've been showing you is SNP data, but we can also look now directly at sequence data. We're just starting to do this now as it's as it's becoming more possible to get whole genome sequences from large series of individuals in populations. And this overcomes some problems that we have with micro array SNPs. Those SNPs generally are selected for high frequency and diversity in European populations, because most of the mapping efforts have been undertaken in those populations. In contrast, if we look at complete DNA sequences, we've got an unbiased assessment of genetic diversity in each population, and we get information not just about the common variants that are on SNPs, but about rare variants. So this is a paper that just came out of the 1000 Genomes Project. What it shows is the essentially estimates of size of African populations. These are the European sample, European and East Asian populations. This is time on this axis. And what this event represents is the out of Africa movement about 50,000 years ago where a portion of the African population came out, underwent a substantial population bottleneck. So this is the effective size, only about 2000, and then populated the rest of the world with expansion of populations. And also migration between those populations and the African source population. But what's, I think, very important about this diagram is that for the first time, we're able to estimate quite accurately these dates, these events, the extent of the bottleneck, because we now have unbiased sequence data rather than SNP data. So we're able to really test models of human population movements and population sizes. And so in general, the model is consistent with the notion that modern humans, anatomically modern humans, people who look just like us, first arose in Africa roughly a couple hundred thousand years ago. They accumulated genetic variation as a result of mutation, drift, and selection. And then something like 50,000 years ago, a small subset of that population went out to colonize the rest of the world. And eventually, something like 20,000 years ago, they got to the new world. Polynesia just a few thousand years ago. But we can now make these kinds of estimates with some real confidence. And of course one of the interesting questions is as humans, as anatomically modern humans came out of Africa, they encountered archaic humans, people like Neanderthals. And what happened as those populations met? Well, we now have evidence that there was mixture between modern humans and archaic Neanderthals as they met. Some years ago, it was possible to look at mitochondrial DNA sequences from a number of Neanderthals' skeletons. That showed no evidence for mixture, for shared polymorphisms. But mitochondrial lineages tend to go extinct very rapidly. But just a couple of years ago, there was a major paper in Science in which Neanderthal skeletons, several of them, were sequenced at low coverage, but providing good evidence that our modern human DNA contains about one to four percent Neanderthal DNA. Now it's interesting that only non-Africans share DNA with Neanderthals, suggesting that this event, this mixture event, took place as anatomically modern humans went out of Africa and to the rest of the world. So we see that sharing only in non-African populations. And of course one of the questions, one of the interesting questions that hopefully we will have answers to at some point is whether any of those shared sequences, any of those genes, actually had adaptive significance as humans moved into new climates. Did they derive certain adaptive alleles from Neanderthals, alleles that many of us still have today? So I think some very interesting questions about our history that can now be addressed with sequence data. Of course we may have to throw all of this out. This is something I ran across in a supermarket a few years ago and I was surprised actually to hear that Adam and Eve skeletons had been stolen. I didn't realize they'd even been found. But because there were more amazing photos inside, well I decided I had to buy it. So I bought the Weekly World News and I learned that all that was left was Eve's leg and that the identity of the perpetrator appears to have been established. So some of the things you can learn from supermarket tabloids. Well these kinds of discussions I think very almost inevitably bring us to the question what does genetic variation, what do these patterns tell us about what many people call race? I put that term in quotes as you see because I don't actually use the term race in my own writings. I think it tends to generate more heat than light. But it's still as we all know used in many contexts. So this is an opinion piece from the New England Journal about ten years ago concluding that race is biologically meaningless. This was a responding piece by Sally Satel who's a psychiatrist in the New York Times who said I am, and I think this is deliberately controversial, I am a racially profiling doctor. And what she was saying was that she uses a self-identified race to help establish dosages of medications and so forth based on empirical experience. This is a statement from the American Anthropological Association back in 1997 who said that genetic data show that any two individuals within a particular population are as different genetically as any two people selected from any two populations in the world. In other words, that there's really no patterning. So with this real diversity of opinions, I think we need some data to help address the question. A few years ago in Scientific American, this cover appeared, does race exist? And I love this, science has the answer. We're all scientists and I think we appreciate how seldom we really have the answer, but I think we have some data that increase our understanding. So one way that we can do this is to simply tabulate DNA sequence differences among individuals and ask what do the patterns look like. And I thought it was appropriate to use some familiar faces especially this time of year. So let's imagine that we've sequenced Rick Santorum, Mitt Romney, Hillary Clinton, and of all people John Edwards. What we start with is a matrix of differences among these individuals. So we just tabulate how many sequence differences do we see, for example, between Santorum and Romney, and we see two. So we put that two in our matrix. We ask how many differences do we see between Santorum and Clinton, we see more. We see six or five. We put that in our matrix and so on. So now we have a distance matrix, DNA dissimilarity among these four individuals. And here with just four people, we can pretty much look at this directly and see the pattern. It looks like this. This is a hypothetical DNA sequence, but it just turns out that this pair is more similar at the DNA level than this pair. I'm not sure if anyone would want to compare their DNA to John Edwards right now, but we see that in our hypothetical example we have a clear pattern representing the DNA sequence differences between each pair of individuals. Now you can do this for any kind of data. And this is an example that a pediatric gastroenterologist who works with us and became interested in population genetics came up with a few years ago. Steve Guthrie ran across an article in the New York Times that showed essentially a distance matrix that is the number of disagreements on major decisions between each pair of members of the U.S. Supreme Court at that time. And so again, you can see some of the patterns here just by looking at them. For example, Scalia and Thomas almost never disagree, whereas Thomas and Stevens typically do. But there are more data here. It's harder to see the pattern by just looking at the numbers. But you can make one of these networks, we call it a neighbor joining network, from this matrix. And Steve was learning to do population genetics at that time. So he made this network and the patterns become extremely clear. We have the conservative wing of the court here, the less conservative wing here, and Justice Kennedy sort of in the middle. So this is a very convenient way of seeing patterns in what can be very complex data. So now let's apply this to sequence data. This was a 14 KB sequence from the angiotensinogen gene, part of the angiotensin pathway. We tabulated sequence differences, sorry, in Asian, European, and African populations, individuals. And one of the things that you see in a display like this is that for this 14 KB sequence, sometimes an individual from Africa is more similar to people from Asia or Europe than to other Africans. And we see this typically if we look at single genes or single DNA sequences, that there is evidence of sharing and mixture. And in fact, this is a very important feature of human history that throughout our history there have been migrations, there's been mixing of populations. So all human populations, at the DNA level, when we look at genes, we tend to see evidence of that historical mixture. I think it's important that there is no such thing as a genetically, quote, pure population, whatever that's supposed to mean. We have a history of mixture. And this is something actually that Charles Darwin was aware of a long time ago. He said it may be doubted whether any character can be named, which is distinctive of a race and is constant. So Darwin, in his observations of phenotypes in populations, was well aware of this. But now let's take those same individuals. So each of these tips, by the way, represents one individual. And now we're looking at a couple hundred polymorphisms. And we see that once we look at a large number of polymorphisms, and these are neutral polymorphisms, people do tend to fall into groups that correspond to their continental origin, Asia, Europe, and Africa. Now the fact that these are very long branches, tell us again that most of the variation is found between individuals within Asia, between individuals within Europe, between individuals within Africa. But with a larger number of variants, we get enough information to learn something about the geographic ancestry of these individuals. So the analogy I like to use is when we look at a trait like height in females and males, and if we look at just one trait, well of course there will be a fair amount of overlap between the two samples. You can't determine someone's sex just by looking at their height. But if we add another variable, waist-hip ratio, well there's less overlap, there's still some. But we have a better discrimination between the two groups. And that's all we're doing here. We're adding more variables, more loci that give us information about ancestry. Here we're looking at 11,000 SNPs. This is another neighbor-joining network and you can see that there appear to be groupings here. And if we put labels in, those groupings correspond to a series of specific human populations. So with that much information, thousands of SNPs, we do get information about ancestry in each of these individuals, corresponding generally to their population of origin. Here are some complete sequence data just published last year. These are ten complete human whole genome sequences doing the same kind of exercise. Again, we see European, Asian, and African individuals forming groups, as we would expect, based on their geographic origin. Now this was kind of interesting because these two sequences are actually the same individual, just sequenced on two different platforms. There were about 500,000 differences in the sequences from that same sample, sequenced on Illumina versus an ABI solid. So about the same amount of difference as between these two different individuals sequenced on the same platform. So kind of a cautionary note that we can see a surprising number of differences depending on which sequencing platform we're using. Now here's another way of looking at variation. This is called a principal components analysis and you see these in genetic population genetic studies a lot. And basically what we're trying to do here is we want for as much variation in individual differences. So each of these dots is an individual. This is a color key of the populations over here. And we're displaying that variation actually here in three dimensions. There's a first dimension, a second one, and then actually vertically a third one. And the point here is that for these 467 individuals, if we're looking at variation to really tell us how similar pairs of individuals are to each other. If we look at 100 SNPs in the same individuals, well, we start to see some suggestion of patterning, but not that much. Now if we look at 1000 SNPs, if we have more data, we start to see groupings. And these groupings correspond essentially to continental origin. And here we're looking at 250,000 SNPs. And we actually can see in three dimensions, so here's the first dimension, the second dimension, and then the third dimension, that individuals tend to cluster together depending on their population of origin. Although there's still a fair amount of overlap, especially for closely related populations. But what it's telling us is that there is information about our ancestry if we have a large collection of SNPs. Yes, sir? Yeah, so we're just taking 10 random SNPs. And what we're seeing is that with 10 SNPs that's not enough information really to decipher anything about ancestry. Yeah, it depends on, so these are called ancestry informative markers. And these are SNPs that have extreme differences among populations. And if you look at a million SNPs, you can find hundreds of SNPs that differ widely. Instead of that average 15% frequency difference, you may see 40 to 50% frequency differences. And with a panel of just a couple hundred of those ancestry informative markers, you can get certainly continental discrimination quite easily. And in some cases with panels of a thousand or so, even within Europe, you can start to discriminate one population from another at least to some extent. Yeah, so we can take subsets of these that are especially informative about ancestry. And those are the ones actually that some of the companies use to try to estimate ancestry. We'll talk about that. Okay, and by the way, these we included the HapMap samples in this analysis and we see that the European, African and East Asian HapMap samples plot where we would expect them to. This is a two-dimensional plot of just Eurasian populations and what you start to see is that really this becomes a map of Eurasia. Here's Northern Europe, Southern Europe, East Asia, here's a Nepalese sample, South East Asia and South Asia. So really a map of the world given by these SNP frequency differences now among individuals. But again, we see that there's overlap where certain individuals here from Nepal are more similar to Pakistanis, others more similar to Thai. So an important point here is that although we get information about ancestry, there's also quite a lot of sharing and overlap among individuals from different populations. So what this tells us is that if we look at multiple polymorphisms, we can learn something about population affiliation in many individuals. We learn about the parts of these circles that don't overlap, that distinguish one population from another. But I think very importantly this is I think a real take-home lesson, we can't really go in the other direction. In other words, we can't based on self-identified population, we can't infer which allele a given individual is going to have. So the inference goes just in one direction because these alleles tend to vary just in frequency from one population to another. So that leads to the question can we classify everybody if we have enough genetic information? Let's go back to that 10K, 11K snip diagram that I showed you earlier and what we've done now is to add African-American individuals to the sample and some plot closest to African samples others closer to European samples reflecting the complex history of that population. Here's a network in which Puerto Ricans have been added some plot with people from Spain, others with people from Africa, again a complex history and many human populations as we know have this kind of complex history where we can't fit individuals neatly into a given into a given population box. This is another principal components analysis based on 134,000 SNPs for American populations, African-American, European-American, Asian-American and Hispanic, and what this shows you is that for some of the African individuals each dot is an individual, they're more similar to members of other populations than they are to other Africans. So in this case, self-identified population groupings can sometimes be quite misleading. And that leads us to what I think is a real fallacy the fallacy of thinking of humans as falling into specific types of typological thinking because there is really so much overlap among populations that rather than discrete groups, what we're looking at is very highly overlapping groups of individuals, especially for any given gene. So this is an interesting example. This is a man named Wayne Joseph. He was a high school principal in California. He was raised in Louisiana as an African-American. He became interested in his ancestry, sent a saliva kit off to a direct-to-consumer testing company and got these results back that at least according to their testing, he was 57% European, 39% Native American and perhaps a little bit East Asian, but no trace of African ancestry. So in his case, his self-identified race would have been, as far as we know, completely incorrect. Now he chose, of course, to maintain the same cultural identity because that's what really mattered to him. But you can see that if his self-identified race were used in a biomedical setting, it could lead to some mistakes. And I think that really looking at individual ancestry rather than what is called race can be considerably more accurate and valuable. Consider that someone with this ancestry would self-identify almost certainly as African-American, but someone with this ancestry might also self-identify as African-American even though genetically their ancestry would be very different. We now have the tools to assess ancestry in each individual. I think we should use those rather than self-identified ancestry or population affiliation. So I got interested in my own ancestry. I sent a saliva kit off to one of the companies just to see what I could learn and it was kind of interesting. I'll share the results with you. One of the things that they type is your Y chromosome and you learn about your Y chromosome haplogroups. So I happen to have this one and it is especially common in northern Europe especially Scandinavia so I think my two grandfathers were correct after all they did come from Norway. The other thing I learned from this is that this haplogroup is shared I share this haplogroup with Jimmy Buffett and Warren Buffett. It hasn't done anything for my singing ability or my investment prowess but it's kind of interesting to know that. You also learn about your mitochondrial ancestry, your maternal ancestry. Again, I have a haplogroup that is most common in northwest Europe so consistent with what little I know about my own ancestry. They also use ancestry informative markers, a genome wide panel, to infer your autosomal ancestry what they call ancestry painting and I was a little disappointed to see how boring my ancestry is here. It appears to derive entirely from Europe but again that's consistent with what I know about my own ancestry. Now here's a little bit more interesting ancestry pattern. This is one that is publicly available. This is a Berber woman the Berbers are a population in North Africa so they're Africans but this woman as you can see at least from the ancestry estimation is 86% European, 12% African, a couple of percent Asian but she is an African but with mostly European ancestry again telling us that that continental designations in some cases could be misleading. Here's an African-American self-identified African-American male and you can see that if we go chromosome by chromosome mostly his ancestry 64% is European 33% African again portraying a complex history of mixture in the ancestry of this person but he self-identified as African-American. Now imagine that an important gene that affects a phenotype of interest is located right here where he has European ancestry. Well for that segment for that gene his ancestry is European even though he would be self-identified as African-American and to the extent that the trait influenced by that gene lets say response to a given drug to the extent that self-identified race would be used for drug prescription well this person right here is European, not African so I think that the ancestry can tell us important and biomedically significant information about individuals. So what generally do these findings imply? Well we've seen that large numbers of independent polymorphisms do inform us at least approximately about population history about individual ancestry but I think it's important to realize that responses to a lot of therapeutic drugs may involve variation in just a few genes and those alleles tend to be shared across populations and we can't predict very effectively which allele you will have based on population affiliation because of the substantial overlap that we see among populations. Here's a good example this is a paper published on response to ACE inhibitors a few years ago very large sample many thousands of individuals and the question was is there a difference in the decrease in blood pressure in response to ACE inhibitors in African-Americans versus European-Americans and there is a small difference in the millimetres of mercury in the two groups so a little bit less response in the African-American group than in the European-American group in terms of systolic blood pressure but I think the important point here is that there's a tremendous amount of overlap there would be many many African-American patients that would respond better to an ACE inhibitor than would European-American patients. Here's another example these are epidermal growth factor receptor inhibitors used sometimes in the treatment of non-small cell lung cancer so drugs like Gifidinib or Ladinib they inhibit dyracine kinase activity in the EGFR and they've been seen to be effective in about 10% of Europeans about 30% of Asians so three times more effective in Asian populations than in European populations that suggested that perhaps population affiliation could be useful in deciding who gets these drugs but if you look at the gene itself, if you look at EGFR there are somatic gain of function populations seen much more frequently in Asian individuals than Europeans and what's been seen is that 70 to 80% of people with somatic mutations respond to Gifidinib fewer than 10% without the mutations respond so the important point here is that looking directly at the gene of interest in somatic mutations in that gene that's much more informative than simply looking at population affiliation one other example seen here this is the calibrated warfarin dose using the standard clinical algorithm where there are population differences but here is the calibrated or the estimated warfarin dosage in the three populations if there are no variants predisposing to rapid metabolism in these two genes V-core C1 and CYP2C9 and you can see that much more variation is accounted for by looking at these two genes than by using population variation and in fact most of the population variation goes away when we're looking at directly at those two genes so again looking directly at genes gives us I think a much better prediction of response than using population affiliation so I think what we can say about genetic variation and race is that we we see a clear pattern in which genetic variation is correlated with geography but it tends to be distributed continuously across space that means it's hard to designate specific boundaries between populations so race as conventionally defined may not be biologically meaningless but may be very imprecise I think that looking at individuals at their ancestry is going to give us much more medically useful actionable information than a self-identified category and what I would like to do at this point the reason I'm showing you this pretty picture this is from one of my favorite hikes in Utah so very close to where I live I would like it to be a long time for us to all sit still so I'd like you to think about this pretty place and stand up for about a minute just stretch and then we'll do the last 30 minutes of the talk so stand up, stretch, enjoy yourself and we'll resume in about one minute sure so that's 14KB across the entire genes that includes both introns and exons no not really because really you're just looking at information all the way across the gene both introns and exons there tends to be a little bit more between population variation for intronic snips than exonic snips probably as a result of selection against variance in the exons what we call purifying selection does that answer your question? sure most of these are just individuals so it would be one individual their sequence of variation how that compares to other individuals okay well with that little refresher we'll move on now what I want to talk about next and this is the last part of the talk is the way that some of the ways in which our studies of genetic variation population genetics have informed our understanding of linkage disequilibrium commonly used in genome-wide association studies and we'll also talk a little bit about how now whole sequence variation is being used in the identification of disease-causing genes again informed by our understanding of population genetics so let me ask how many of you are familiar with the concept of linkage disequilibrium maybe a third or so so I think it's probably worth what we mean by linkage disequilibrium we define it as the non-random association of alleles at linked loci well what do we mean at equilibrium, if we imagine two loci, a and b with alleles big a and little a and big b and little b at equilibrium we're going to see all of these we're going to tend to see all of these combinations in a population as we look at copies of alleles in genome-zones and if we imagine the frequencies of the alleles at these two loci to be these so that the frequency of big a is 60% the alternative allele little a is 40% and then big b and little b are 70% and 30% under linkage equilibrium the haplotypes that contain these alleles should have frequencies by multiplying the respective allele frequencies together so 42% of the time if we're looking at say chromosome 5 in an individual we expect to see a haplotype that has big a and big b together 42% because that's the frequency in our population of big a 60% times the frequency of big b 70% and if they're independent if these loci are independent of each other if they're in equilibrium then we can estimate accurately the haplotype frequency simply by multiplying the two allele frequencies together that's our principle of independence similarly a haplotype containing big a and little b that's going to be 60% times 30% or 18% so we're seeing all of these combinations at the frequencies that would be predicted if the two loci are independent of each other but what if we see this distribution of haplotype frequencies we see a and b much more commonly than we would expect based on these frequencies we see little a and little b much more commonly than we would expect based on these frequencies and you can see that in the diagram here when we see big a we tend to see big b and b are in a linkage disequilibrium the alleles at these two linked loci are non-randomly associated with each other so how does this happen well if two loci are very close together like b and c here through time there isn't much opportunity for recombination to break up the combinations so that b and c big b and c are seen together most of the time little b and little c are seen together most of the time but for this pair of loci a and b they're further apart recombination crossover during meiosis has more opportunities to break up the combination to put big a in combination with little b so what that's saying is that over time many generations for recombination to occur we're going to see this combination together more often than this combination a and b in other words there's more linkage disequilibrium between these two loci than between those two loci so that's what we mean by linkage disequilibrium and it's going to be a result of two things time and the process of recombination now linkage disequilibrium gives us some advantages in mapping genes because we first of all don't necessarily need family data we can assess linkage disequilibrium in population data we can assess it using microarrays snips every 3 kb or even denser than that so association studies in which we employ linkage disequilibrium effectively incorporate many many generations of past recombination and that allows us to narrow a candidate region potentially very effectively so if we compare the situation with traditional linkage analysis where usually at best we're going to have three generation families and we assess recombination directly by looking at affected individuals and families and simply counting recombinants but the limitation here is that we can really only look at recombination in two or three generations in contrast with linkage disequilibrium we're looking effectively at all the recombinations that have occurred since a disease causing mutation or variant originally happened in a common ancestor many generations ago so we're looking at correlations between alleles at linked loci to assess how much recombination there's been in other words how far apart these two loci are so essentially we think of populations as one big complicated pedigree for a given variant ultimately we're going to trace back to a common ancestor some generations ago this is what sometimes called the coalescent in population genetics now it's kind of interesting this is just a graph of linkage disequilibrium articles from 1981 through 2008 and you can see that back in the early 80's and this is when I first became interested in linkage disequilibrium there were only about 20 papers a year published on this topic so you could read a paper every couple of weeks and you knew everything there was to know about linkage disequilibrium by 2008 that figure had gone to about 1600 papers per year so something like 30 papers a week and it's maintained steadily that number since then so this has become a very popular topic because of its applications its many applications and many applications in human genetics now the challenges come with the fact that there are a lot of different things that can affect linkage disequilibrium patterns it is a population genetic process there are also genomic factors chromosome location so that close to telomeres where there's more recombination you tend to see less linkage disequilibrium you tend to see less linkage disequilibrium outside of genes than within genes and that's a pattern that's been seen now quite regularly as a result in the HapMap project there are also DNA sequence patterns things like GC content that can influence recombination and therefore disequilibrium we now know that the human genome is peppered with recombination hotspots every 50 to 100,000 bases we see in regions where recombination is elevated 10-fold or so relative to the rest of the genome and we now know about specific sequences degenerate 13-mer that's bound by the product of this gene that's associated with at least 40% of hotspot activity and recombination among individuals varies as a result of variation in this gene so we're learning some interesting things about the distribution of recombination across the genome and then there are evolutionary factors that cause linkage disequilibrium to vary among populations things like natural selection for specific combinations gene flow can generate disequilibrium mutation and gene conversion can affect disequilibrium patterns as can genetic drift so it's a complex process one that requires really quite a lot of study and inference so things like population age that is how long ago was a given population founded that can affect haplotype structure and therefore disequilibrium so that in a population founded a long time ago say the African populations well there have been many generations for recombination to occur we tend to see much smaller blocks of haplotypes because combinations have been broken up over a long period of time in contrast in a population that was founded more recently let's take some of the isolated Finnish populations well there have been fewer generations for these recombinations to occur so we see much larger haplotype blocks so each of these is a SNP allele and we tend to see a larger number of SNPs that are associated together because recombination hasn't had as much time to break up the combinations so that means if a disease causing mutation occurs here later on in time it's going to be found in association with a large number of other SNP alleles in a younger population but if a disease mutation occurs in the same place in this population long ago we'll see it in combination with many different SNP alleles because the haplotypes tend to be smaller they've been more broken up by recombination so we expect to see quite different patterns of linkage disequilibrium in these different populations and in fact if we look at linkage disequilibrium data here we're going back to the angiotensinogen locus this is a plot of pairwise linkage disequilibrium each of these units up here is a SNP a single nucleotide polymorphism and what we're looking at is the pattern of disequilibrium for all possible pairs of SNPs the analogy I like to use is this is kind of like a mileage chart so if this were New York and this were San Francisco then this would be the distance between them well for a given pair of SNPs for this SNP versus this SNP this is the linkage disequilibrium between them and in these plots red indicates high linkage disequilibrium technically it's a correlation or r squared value greater than 0.8 so we can see that for SNPs that are close together in a series of Africans we do find blocks of linkage disequilibrium but they're relatively small in contrast in Eurasian samples we see much larger chunks in linkage disequilibrium that's because these are more recently founded populations as you saw in that early diagram they underwent a bottleneck reducing haplotype diversity so this is really the pattern that we would expect so a question we can ask is well how general are these kinds of patterns because up until a few years ago we were looking at specific regions and specific populations and what we really want to know is well are the kinds of patterns that we've seen really general throughout the genome and throughout the world and I would say that about 10 years ago the knowledge of linkage disequilibrium patterns in human populations were kind of like this map of the world in 1544 and you can see that people knew about some of the major continents at that time Asia, Africa, some of Europe South America, North America wasn't even on the map well this was kind of like our understanding of linkage disequilibrium about 10 years ago we knew about patterns in some populations in some regions of the genome but we needed a much more general understanding of linkage disequilibrium patterns throughout the genome so this was really the basis of the HapMap project in the original project 600,000 SNPs about 1 every 5 kb were genotyped in 200 individuals from 3 different populations the Ceph Utah which represented Northwest Europe, 30 trios the Yorubans from Nigeria 30 trios and then 90 East Asians so this was by no means a complete sampling of human diversity but it gave us some idea of diversity in 3 major populations and then the idea was to evaluate patterns of linkage disequilibrium and haplotype structure to see how much variation there is in different genomic regions and how much variation is there among different populations linkage disequilibrium was to be used as a gene mapping tool and I think the map improved a lot looked probably more like this map of the world in the late 1600s where now you can see the continents old world continents are pretty well mapped out as are most of the new world continental regions although for some reason California was missing from this map but a much better knowledge of general linkage disequilibrium patterns and that allowed us to understand human genetic haplotype diversity much more effectively also the definition of recombination hotspots the detection of genes that have experienced strong natural selection and I'll show you an example of that because it helped us to understand gene function and finally and perhaps most importantly the detection of disease causing mutations so this I mentioned hotspots earlier the basic notion here is that there are regions where recombination is elevated at least tenfold in a restricted one to two kb region so that we would expect much much less linkage disequilibrium for pairs on either side of this hotspot so this is an illustration here using again that linkage disequilibrium chart where we have essentially haplotypes here that are strongly associated in other words if you have a g here you have a c here if you have an a here you have a c here but on the other side of the hotspot there's really no association there's no association between these combinations and these combinations so that represents that recombination hotspot where a lot of crossover is resulting in these combinations showing no association with these and so when we plot that out we see strong correlations among these no correlation between these and those so how frequent are these hotspots well as we saw they're located about one in every 50 to 100 kb and in fact about 60% of crossovers in the human genome occur in only 6% of our DNA sequence so these hotspots really contain most of the crossover activity in our genome and one of the interesting things that came out of a comparison in hotspot location in human and chimp is that even though for a line sequence where 99% the same as chimps our recombination hotspots are completely different so these evolve very rapidly and in fact the prdm9 gene I mentioned earlier that's associated with recombination activity is not active in chimps so very very different patterns of recombination even in these two quite closely related species now another thing that linkage disequilibrium allows us to assess is evidence for positive selection in certain regions of the genome so the idea is shown here if you imagine a mutation occurring a new mutation it's going to occur on a haplotype background so each of these blue stars is a SNP and initially of course that mutation is going to be found in strong linkage disequilibrium with a whole series of SNPs spanning some distance across a chromosome that is every time you see this mutation you see specific SNP alleles but through time as that if that mutation increases in frequency while recombinations are going to occur redistributing the more distantly located SNPs so the background haplotype associated with this mutation becomes smaller and smaller as the frequency of the mutation increases through drift there's strong selection for that mutation what if it has an adaptive advantage then it's going to reach high frequency quite quickly and it's going to maintain disequilibrium with these nearby SNPs so that we'll see regions of elevated linkage disequilibrium associated with a selectively advantageous allele and that's evidence for recent positive selection when people have scanned the genome they found a number of such regions we won't go through these in detail but things like G6PD and malaria protection one of the SIP genes sodium retention an enhancer element associated with hereditary lactase persistence skin pigmentation loci and this is some work that we've been involved in members of the HIF pathway the hypoxia and dismal factor pathway and response to hypoxia in high altitude populations so what people have been looking at then is regions in the genome where there is elevated linkage disequilibrium as a result of natural selection for specific adaptive variants I'll give you one other example this is very recent work that we published just a few months ago and this is evidence for recent positive selection in a region associated with Crohn's disease and this disease as you probably know is an inflammatory disorder of the intestinal tract it affects around one in a thousand maybe a little higher individuals a number of regions have been associated with susceptibility to Crohn's disease including one called IBD-5 and this is a 250,000 base haplotype seen in Europeans and associated with disease susceptibility and a number of GM-wide association studies have shown an association between Crohn's disease and SNPs in a specific gene called OCTN-1 in the IBD-5 region and in fact a specific variant 503F in OCTN-1 is very common in parts of Europe especially northern Europe where it attains frequencies greater than 50% we estimated that this variant arose about 12,000 years ago about the time that agriculture came to Europe it has a high frequency in Europe you don't see it outside of Europe and the variant is a gain of function mutation that increases the substrate efficiency for a substance called ergothionine by several fold here's the evidence for positive selection highly statistically significant values in different European populations so we have good evidence that natural selection occurred here that's indicated by disequilibrium patterns but this is what's interesting OCTN-1 association and OCTN-1 by the way transports ergothionine and ergothionine has neuroprotective effects as well as antioxidant effects but there was no clear functional association between OCTN-1 itself and Crohn's disease so what we hypothesized is that because there's a lot of linkage disequilibrium here there could be a genetic hitchhiking effect where what is selected is this variant in OCTN-1 because of dietary differences as humans developed agriculture in Europe but it carried along with it variants at IRF-1 a gene that's a couple hundred kB away but that's involved in innate immunity and the clearance of intracellular bacteria which we know are important in Crohn's disease so if you look at haplotypes that contain both the IRF-1 selected variants and specific IRF-1 variants OCTN-1 variant those have a strong association with Crohn's disease if you look at haplotypes that have only the OCTN-1 variant and not the IRF-1 variants there's no association telling us that the disease association is due to this gene that has hitchhiked in frequency alleles have hitchhiked in frequency along with OCTN-1 which was selected strongly in these populations so essentially, susceptibility to Crohn's disease is a side effect in this region of selection for this gene that has nothing to do with Crohn's disease we did expression studies showing that IRF-1 is expressed much more highly in intestinal tissue from Crohn's disease patients than control tissue and none of the other genes in this region show that difference in expression patterns so here we've been able to use linkage disequilibrium to identify a pattern in which selection for one trait OCTN-1 has actually carried along with it alleles that are associated with disease and we think that this may in some cases be responsible for genetic susceptibility to other common diseases as well now another very important thing that came out of the genome-wide disequilibrium studies is the fact that if you have SNPs that are in disequilibrium with one another of course many of them are redundant and that means you don't have to type all of them so that if we have this SNP here that is associated in a given individual with these other SNPs and across the population in other individuals well that means that all we have to do is type this one a TAG SNP we don't need to type these others and you'll hear more about this in her lecture but the bottom line is that for these genome-wide association studies we can get complete coverage with a reduced number of SNPs because of this pattern of linkage disequilibrium but those patterns vary from one population to another this is something I'm sure Karen will show you but this is I think one of the successes of those efforts the many published GWAS now of 1500 or so as of the middle of last year for hundreds of different traits and you'll hear more about that from her but what I want to get across is that it is our understanding of population genetics which allowed us to apply linkage disequilibrium effectively in these kinds of studies so many, many different traits for which significant associations have been established now finally I want to mention population genetics is also really helping us to develop new resources new tools for the analysis of whole genome sequence data if we look at the 1000 Genomes Project in which 1500 people have been sequenced at about 4x coverage one of the things that this does it provides control sequences when we're doing variant analysis which many of us are now doing sequencing patients to try to find rare variants associated with disease well one of the things we need is good control data if this appears to be a rare disease associated variant well how rare is it projects like the 1000 Genomes give us background frequency so that we can assess whether a given variant is actually rare or not and remember that these rare variants vary a lot among populations that is they often are specific to certain populations that's why it's important to do this kind of sequencing not just in a few populations but as many as we can because those background variants are going to differ the rare ones from one population to another and we saw that in one of the 1000 Genomes papers those rare alleles typically are not shared among populations population genetic theory also allows us to evaluate when a variant is functionally significant this is always an issue if you have 10,000 non-synonymous variants in a whole genome sequence the question is which one of those would actually be contributing functionally to disease well one of the things that we can look at to help assess functionality is evidence of purifying selection in a given region that is has natural selection been conscientiously eliminating deleterious variants that means that's a region of functional significance and we've incorporated that in software that we recently developed and published called VAST and this is a tool for analyzing whole genome sequence variation we've used it to successfully identify disease causing genes we incorporate population genetics in our estimates we look for purifying selection as an indication that a sequence or a variant is functionally significant and of course we also look at evolutionary conservation among species any region that is highly conserved very similar across many species again more likely to be functionally significant and of course this is especially useful when we're looking at non-coding DNA and we can't otherwise directly assess function so I think that now as we're starting to analyze hundreds and even thousands of whole genome sequences and exome sequences we're going to see that population genetics again will make very significant contributions to our understanding so finally this is my wrap up slide what I've told you this morning is that we can learn I think some very interesting things about population history, our origins our similarities and differences from our studies of genetic variation I think that genetic variation studies give us a much more nuanced view of the topic of race and population affiliation by showing us how similar we all are to one another and how much overlap there is among populations for any given relevant gene this kind of analysis as I've shown you has also been very important in our understanding of linkage disequilibrium, its application in genome-wide association studies and finally I think we'll see that population genetics will become even more important as we begin to analyze whole genome sequence and rare variants their role in disease and I hope that I've convinced you that population genetics is even fun, that it's interesting that it has relevance to many many different areas of biomedicine and with that I will acknowledge the people at the University of Utah who contributed to some of the work I showed you also my colleague at LSU Mark Batzer who's done a lot of work with us on mobile elements and finally this is my backyard the lovely Wasatch Mountains and I really couldn't leave without acknowledging those beautiful mountains I hope you have a chance to come out and share your attention and if there are any quick final questions would be happy to answer them yes sir so that was a very good question what about allelic heterogeneity basically so if you have multiple mutations that are associated with disease that have happened on different haplotype backgrounds does that make it more difficult or even the same mutation at the same spot let's say you have a mutation hotspot but occurring at different times at different backgrounds absolutely because what you're looking at is those SNPs in the background so you could imagine that one SNP allel would occur in association with one mutation event a different allele with a different mutation event so that particularly both allelic heterogeneity as well as locus heterogeneity make it more difficult to identify genes using a genome-wide association study now the methods I mentioned right at the end of VAST actually looks at a whole gene and looks at all variants within that gene and essentially sums their effects together so that if you have multiple mutational events in a gene it will recognize all of them and essentially lump them together so you really reduce the allelic heterogeneity problem by doing that and what we're able to show and we showed in the paper we published last year is that it has much higher power to detect susceptibility loci for things like Crohn's disease because you're essentially summing the effects of all potentially disease-causing mutations within a locus that's a great question