 Okay, well thanks very much, Andy. It's always a pleasure to be here at the NIH. I'll first disclose that I have no financial relationships. What I will be talking with you about this morning, several general areas pertinent to population genetics. First of all, we'll talk about patterns of human genetic variation, how we assess them among populations and also at the individual level. And as we go through, we'll be talking about how various evolutionary factors influence human genetic variation. One of the interesting questions that we can now address with genetic data is the issue of race, which I put in quotes, and the implications of what we now understand about human variation biometically. And then we'll talk about another application of our studies of human genetic variation, and that's in the area of linkage disequilibrium and disease gene identification, and how our knowledge of human genetic variation can really inform us more effectively in our search for disease-causing genes. So a long time ago, 1956, Sewell Wright, one of the great geneticists of the last century, identified what he called the four major factors of evolution. And I think it's useful to keep those in mind as we think about what influences patterns of genetic variation. First of all, as we know, mutation can be considered the author of variation. That's where genetic variation ultimately comes from. Natural selection can be regarded as the editor of variation, selecting in favor of variants that are beneficial, selecting against variants that are harmful. We can think of genetic drift, the third major factor, as the randomizer, the stochastic element in evolution. Populations that are very small can experience substantial changes over time in gene frequency as a result of this random effect. And finally, we think of gene flow, the transmission of genetic material from one population to another as the homogenizer in evolution. So as we go through the talk today, we'll talk about various examples of each of these processes. So we've now been able to directly estimate the human mutation rate. As of about six years ago, that was the first estimate, one that we were privileged to be involved in, looking at whole genome sequence in a human family, comparing parents in offspring, and finding that the human mutation rate from generation to generation is on the order of about one in a hundred million base pairs per generation. So we transmit, with each gamete, about 30 to 35 new DNA variants. That estimate has been now confirmed in a number of subsequent studies, so we feel that we have a pretty good estimate now of the rate at which new variation enters the genomes of humans. And this is a quote that I've always enjoyed from Lewis Thomas, referring to genetic variation and mutation. He said, the capacity to blunder slightly is the real marvel of DNA. Without this special attribute, we would still be anaerobic bacteria and there would be no music. So this is really sort of a testament to the value of mutation, of genetic variation as we adapt to a changing environment. Now one of the interesting things in studies of mutation that's come to light is that most new mutations occur in the male germline. This helps to explain a pattern we've known about for a long time, the fact that with advanced paternal age, the risk of having children with various autosomal dominant conditions increases several fold. By looking directly at mutation rates in families, we know now that each year, an additional two or so mutations are transmitted beyond age 30, probably as a result of mitotic division of spermatogonia over and over again as fathers age. So now we know that males are really the cause of most single gene mutations in humans and in many other species. So we males can take credit for that. So one question that we can address looking across individuals and across species is how much do we differ? If we look at aligned DNA bases, how much do individuals, how much do species differ from one another? Now identical twins, as nature's clones, differ essentially at none of their DNA base pairs, at least at conception. There are somatic mutations that can take place later, but we can say that for all intents and purposes, identical twins have zero DNA base differences. We know as a result of our sequencing studies that unrelated humans, any pair of unrelated humans differs at about one in a thousand base pairs. And I think this is an important take home message from our studies of human genetic variation, the fact that we humans are about 99.9% identical at the DNA level at this most fundamental unit of our biology. We are all really quite similar to each other. If we compare ourselves to our nearest relative, evolutionarily, the chimp, we are about 99% similar to the chimp for aligned DNA bases. If we include structural variance, that figure goes down to about 95% similarity. Now if we go out a little bit further evolutionarily, we differ by about one in six to one in three base pairs from mice. And finally, if we compare ourselves to broccoli, thankfully we're mostly different from broccoli. But with three billion DNA bases, even though we differ at only about one in a thousand base pairs, that means that for any haploid sequence, there are about three million differences. So that's a lot of genetic variation, what we refer to as single nucleotide variants or SNVs. If we compare humans to other species, other great apes, we see that we have two to three times less variation than the other major great ape species. So humans, compared to other great apes, relatively lacking in genetic diversity. Now another category of variation, something that we've become really quite interested in now over the last decade or so, are structural variations or structural variants. These are deletions, duplications, sometimes duplicated multiple times, more than 50 base pairs or so. So that the idea is illustrated here, that whereas typically we have two copies of any given gene, in some cases we can have more than two, in some cases we can have only one. So these structural variants, which are more difficult to identify than single nucleotide variants, but we now estimate, and this is from some of the recent work from the Human Genome, from the Thousand Genomes Project, that in the average haploid human sequence, at least nine megabases are affected by structural variants. About three and a half megabases in the average genome are affected by single nucleotide variants. So what that says is that these structural variants, even though they occur less frequently in the genome because they're much larger, account for more differences than do single nucleotide variants. And if we look at copy number variants, where the genetic segments can differ by multiple copies, each human is heterozygous for about 150 of those CNVs. So a substantial amount of variation at the structural level, something we're beginning to understand better and better. So we can address the question, how much do human populations differ? We've talked about how much individuals and species differ, but we can look at population variation throughout the world. So a set of samples I'll be talking about for the next few minutes are shown here, geographic locations distributed throughout the world, representing 800 individuals, 40 different populations. Here you see some of the phenotypes, some of the phenotypic variation observable in these different populations. So we can look at variation at the population level in terms of allele frequencies. And these are frequently used in population genetics. So here we have an array of single nucleotide variants, one, two, and three, assessed in three populations. So we simply count the number of alleles in each population. And I'll make a distinction here. We're using the more general term single nucleotide variants. We also, and I'm sure you're familiar with this term, single nucleotide polymorphisms or SNPs. The distinction is that conventionally, the polymorphism has a minor allele frequency, that is the less common allele, greater than 1%. So the kinds of variants that we assess using microarrays tend to be more common. They're usually termed SNPs, but the more general term, including variants of all frequency, would be single nucleotide variants. So for each of these loci, we can assess the heterozygosity, that is the proportion of heterozygos individuals in each population, simply by counting alleles. That is how often do we see heterozygotes, how often do we see homozygotes, we can then average that across loci. So if one in a thousand base pairs varies on average between a pair of individuals, how is this variation distributed between continents? A unit that we often use to group populations into. So this gives us some idea of how much variation there is between major populations. And to assess this, we use this statistic FST, which has been around for a long time. FST is the amount of genetic variation in the entire population that is attributable to population differences, attributable to subdivision. So one simple way to measure this is to take the total heterozygosity, the total variation in our sample, so that's this quantity, and then we subtract from that the average heterozygosity within each population, in this case, continents. So you can imagine that if there were as much variation within each continent as there is in the total sample, then this quantity would be zero. Now the word subdivision creates no additional variation. If this quantity, let's say, were zero, then FST would be one. So FST then measures the proportion of variation in our population due to population differences or subdivision. So here's a table summarizing for those populations that I mentioned. FST values among continental populations for a variety of different kinds of genetic systems, and some of these go back in time to when we typically looked at fewer than 100 polymorphisms, and then more recently, larger numbers of SNP array polymorphisms. The take-home point here is that for any of these kinds of genetic systems with different mutational mechanisms and so forth, in all cases, our FST value is about 10 to 15%. That is, the great majority of variation that we see in these populations can be seen within populations. And only an additional 10 to 15% occurs between populations, telling us that in general, human populations tend to be fairly similar to one another. Now it's interesting to compare these systems with an FST measure based on skin pigmentation, because we can do this for quantitative phenotype measures as well. And of course, skin pigmentation has been used in classification of populations for a long time. And it's interesting that we see the flip, the opposite situation for skin pigmentation, where 90% of the variation is actually seen between continents, only 10% within. So our genetically based measures show much, much less variation between continents, something like skin pigmentation, which has been under intense natural selection in human evolution. We see much greater variation. So if we ask the question, what percentage of these SNPs are shared among major regions of the world, Africa, Europe, East Asia, and India, and this is for about a thousand samples using a 250K chip. We see that about 80% of these relatively common SNPs, so these are SNPs with minor allele frequencies greater than five or 10%, about 80% are seen in all four of these groups. 88% are seen in at least three groups, more than 90% shared in at least two groups. About 7% are seen only in the African subset, only 0.5% seen in any non-African group. So more variation in the African samples, a commonly observed feature. But the important point here is that for these common SNPs, which are relatively old, because they have to have a certain age to get to a frequency of five to 10%, those tend to be shared widely across the world. And we see that same pattern here. These are SNPs from the 1000 Genomes Project. Here are the European ancestry populations, East Asians and Africans, the Yorubans from Nigeria. We see that the great majority of relatively common SNPs are shared among all three continental groups. And the average allele frequency difference between these major populations is about 15%. But more recently with whole genome sequencing, it's been possible to look at less common variants, the single nucleotide variants. For these, we see that there's substantially less overlap among populations. In fact, for these three major continental populations, the great majority of single nucleotide variants are in fact population specific. For those with frequencies less than 2%, fewer than 5% of alleles are actually shared between any two pairs of continents. And intuitively this makes sense because these rare variants arose relatively recently. They arose typically out of the, later than the human migration out of Africa. So they're more likely to be population specific. And as we analyze rare alleles in our genetic studies and our disease related studies, we have to keep in mind that many of these, most of these rare alleles will tend to be population specific, rather than shared across populations. The same thing is true of structural variants. So this is a figure from the final 1000 Genomes paper, one of the final 1000 Genomes papers. Looking at allele counts, so that for structural variants, these would be variants that are seen only once, variants that are seen 10 times, 100 times, 1000 times. And you can see for the very rare ones, they tend to be specific to one of these major populations, Africa, America, East Asia, Europe, or South Asia. For more common variants, we tend to see more frequently that they're shared. And once we get into the relatively common variants, that is those with more than 100 or so counts, they tend to be shared in two or more populations. So that same pattern is seen for structural variants as we see for single nucleotide variants. So how do we actually measure differences between populations? Well, I'll show you a simple genetic distance measure to give you an idea of how we assess these differences. So these values, p sub i and p sub j are allele frequencies in two different populations, i and j. The distance between i and j, between those two populations, is simply the difference between those allele frequencies and we can take the absolute value. So going back to our little array of single nucleotide variant frequencies in three populations that I showed you earlier, our distance between populations one and two can be measured first by subtracting these two frequencies from one another. That gives us a distance measure of 0.08 between these two populations and then we would just average that same difference across all of our single nucleotide variants. So if we have a million of them, we would have an average of a million of these differences between populations. So really quite straightforward measure of genetic distance. We can then display these distances in a network. So if we take one of our single nucleotide variants, we can array our three populations like this. We can look at the first pair of populations one and two. The distance between them, which is given by the difference between these two numbers, can be graphed like this. We can then average their allele frequencies because they're the closest to each other. We average them, we subtract that from this allele frequency and that gives us the joining point for this part of the network. So we've now displayed the relationship among these three populations in this simple network. And then of course, once again, we would average these distances across all of our single nucleotide variants. But this gives us a handy graphical display of the genetic relationships among populations. And we call this a neighbor joining tree. Now we can also do this at the individual level. And I like this example. It's directly analogous to allele sharing, but in this case what we're looking at is how often members of the Supreme Court agree on decisions. Every so often the New York Times publishes this. This matrix of percent agreement was published in 2014. And so it shows for each pair of Supreme Court justices how often they agree on decisions. And if we stare at this matrix for a while, we can start to see a pattern. We see that Justice Ginsburg agrees with Justice Sotomayor most of the time, less so with justices Alito and Thomas. But even with just nine people, you have to stare at this for a while to get the idea. Well, you can do what I just showed you using percent agreement and graph a neighbor joining network like this, and you immediately see the pattern of voting in the U.S. Supreme Court. We've even kind of color coded it from red to blue, appropriately. And we see that one wing of the court, these justices tend to agree with one another most of the time. These justices tend to agree with one another most of the time. Justice Kennedy is a little bit out in the middle. So this gives you a very convenient graphical display. Now imagine if we have a thousand individuals and we're looking at percent alleles shared for all pairs of those thousand. If we had a thousand by thousand matrix, you would have to stare at that for days to really get the pattern. You can do one of these networks very quickly and immediately see patterns. So this is a convenient way of distilling a lot of information into two dimensional form. Now, these neighbor joining networks require a little bit of interpretation. One of the things that they don't tell us, especially when we're looking at human populations, is when populations actually split. It sometimes misinterpreted that way, but because human populations have gene flow continuously, these branches, unlike species networks, don't tell us anything necessarily about divergence times. Now another way that we can represent these kinds of differences at the individual or population level is through something called principal components analysis. And in population genetics, we use principal components analysis all the time. If you look at the recent thousand genomes papers, they always have these PCA plots. So I want to briefly explain what these actually mean. So let's imagine that we have a series of individuals. We have a graph of their similarities. And what we're trying to do with principal components analysis is to define the major axis of variation in this collection. So it's essentially a regression technique. Our first principal component is a line that goes through that series of points, like our Supreme Court agreements, and tries to account for as much variation as possible along a single line. And then each individual here has a score along that principal component, either down here or up here. That gives us one axis of variation. We can then subtract out the effect of that. We then get what statisticians call the residuals. And then we can run a second principal component through our collection of data to try to account for the second greatest proportion of variation in the data. And this component is statistically independent of this one. And again, each individual will have a score on this second principal component. And that's the basic idea behind principal components analysis. Try to account for as much of the variation in your sample as you can in a limited number of dimensions. Now, if we go back to our differences between individuals, if you think about it, the distance, the genetic distance between two individuals can be described with just a simple line. Just one principal component, if you've got two people, that describes all of the variation in your data. So the percentage of alleles shared here, let's say is 90%, a line will tell us that. Now, if we have three people in our sample, we need an extra dimension. We can describe now all of the variation with a plane. So three lines defining a two-dimensional surface showing that the distance here is smallest because the percentage of alleles is largest. If we had four individuals, we need three dimensions to account for all of the variation. Five individuals, we need four dimensions, four principal components, and so forth. But that gives you an idea of what we're doing in principal components analysis, reducing our data down to a limited number of dimensions so that we can see patterns more readily. And if we do a principal components analysis, one of my graduate students did this for fun just a few weeks ago on our Supreme Court decisions, we again see really pretty much the same pattern that we saw on that neighbor joining tree. One wing of the court over here, Justice Breyer is a little bit more separated on dimension two and then the other wing over here and Scalia way down on this component. So again, giving us a convenient display of similarity among these nine individuals. So going back to human populations, here is a population tree that we put together using autosomal ALU polymorphisms. So ALUs, mobile elements are really convenient genetic markers because they insert various places in the genome. If two people share an ALU insertion in the same place, they must share a common ancestor in whom that insertion first occurred. So they're very convenient evolutionary markers. Here we grouped our samples into various populations coded according to region of origin. And we see some interesting patterns here. We see first of all, that populations do tend to group according to their geographic origin, reflecting the fact that through most of history, you are more likely to mate with somebody five or 10 kilometers away than somebody 5,000 kilometers away. We see more population variation in our African collection of samples. We can also, because ALU systems have an ancestral state and a derived state, either absence or presence of the insertion, we can denote an ancestral node falling closest to the African group. That's one of the pieces of evidence for an African origin of all modern humans. And we also can, in an exercise like this, assess statistical significance. And we see that these bootstrap support levels, 100%, 97%, are really quite stable, telling us that these major divisions are supported statistically. Now, expanding that to a larger number of single nucleotide variants in 40 populations, we see, again, very much the same pattern with just a larger set of populations. So populations basically grouping according to their geographic location. This is a completely different set of samples assessed a few years ago, both for SNPs and for copy number variants. And again, completely different set of samples we see very much the same general pattern. Where geography is correlated with genetic similarity. Now, here's a principal components plot done on those samples. So now we're looking not at populations, but at individuals. We have enough data, a million SNPs, so that for these 800 individuals, we can plot each one. We see that for the first principal component, the biggest source of variation is African versus other populations. The second source of variation going up and down here is basically a west to east climb. And we can see that, again, populations, individuals from populations, although there's overlap, generally are arrayed according to their geographic location. And here we took a subset of about 500 of those individuals just for Eurasia. And what you see is basically a map of Eurasia. So here we have Northwest Europe, here we have East Asia, here we have South Asia. And you can see that these individuals tend to be arrayed according to their geographic origin, telling us that geographic distance does have an effect on historical mating patterns. But also telling us that for any of these populations, there is overlap where individuals from one population overlap with those from another. Now another pattern we see when we look at these data is that the diversity of haplotypes that is linked groups of SNPs, SNPs close together on the same chromosome when we look at them together and form a haplotype, diversity is highest in Africa, lower in Asia and Europe, still lower in Polynesia, still lower in the Americas. So basically, as we proceed out of Africa, the amount of diversity tends to become smaller and smaller. And this is consistent with what we call a serial founder effect, a form of genetic drift, that randomizing component I mentioned earlier, so that with distance from Africa, there is increasing genetic drift, because as populations came out of Africa, they were a subset with smaller population size, there's greater genetic drift, less diversity. And as we proceed further out of Africa, subsets of that subset colonized other parts of the world. So we refer to this as a serial founder effect, after founder effect, after founder effect, as we go across the world. And all of this is consistent with this hypothesis of a recent African origin of anatomically modern humans, which is now pretty well accepted in the population genetics arena. The idea that about, sorry, about 40 to 80,000 years ago, anatomically modern humans, people who looked pretty much like us, came out of Africa, colonized Eurasia, and ultimately later colonized the new world, and even later, Polynesia. And as I'll mention after a little bit, there is evidence of mixture with other more archaic populations as humans came out of Africa. But the basic idea humans, anatomically modern humans, arising in Africa about 200,000 years ago, accumulating genetic variation, then a subset of that population going out to colonize the rest of the world. Now, this is an alternative model, one that I ran across in the supermarket, oh, a decade or so ago. I saw this headline that Adam and Eve's skeletons had been stolen. Now, I didn't even know that they had been found, but that aroused my curiosity. So I actually bought this copy of the Weekly World News, because as it says here, there are more amazing photos inside. What I learned was that all that's left is Eve's leg, and the identity of the perpetrator may have been established. So as I said, an alternative model, but one that our data don't support very well. So we can use principal components to actually finally distinguish among populations, even populations that are relatively closely related. This is from a recent analysis we did that includes a couple of populations from Tibet, two different linguistic groups from Tibet, located just a few hundred miles from each other, but with principal components analysis, we can actually distinguish them from each other pretty well. Again, each dot represents an individual. Here are two different populations of Mongolians, one high altitude, one low altitude. So with enough data, we can distinguish individuals from various populations with some degree of accuracy. This is a similar analysis that was published a few years ago in Nature by John Nobombra of 3,000 Europeans, doing again a principal components analysis. So here's the first principal component that essentially gives us a northwest to southeast climb, and then a second independent principal component. And what's interesting about this is that these individuals are to a large extent arrayed according to country of origin. Now, there was a stipulation that three of four grandparents had to come from the same country, so that essentially limited the effects of recent gene flow, but you can see that essentially what this gives you back is a map of Europe. Again, with overlap among populations for these individuals, but in general, a fairly good map of Europe. In fact, on average, people could be traced back to their place of origin within about 300 kilometers. Now, I just have to show this slide. This is a principal components analysis that we published more than 30 years ago, showing essentially the same thing using only 15 loci. This is at the population level. Had we used individuals, we wouldn't have been able to see much of a pattern, but by looking at allele frequencies in these populations for just 15 loci, and these were old fashioned blood groups and protein polymorphisms, we were able to essentially recreate that map of Europe. Since we were in Utah, we also looked at the Utah Mormon or LDS population and showed that they're actually quite similar to the populations from which they were derived, indicating a lack of genetic drift in that population. But the bottom line here, genetic distances for the most part recapitulate ancestry, geographic location, and history. Now, the data that I've been showing you so far been primarily SNP array data and other systems. Now that we can do whole genome sequencing, we can learn a lot more about population history. One of the problems with most micro arrays is that the polymorphisms were selected initially for higher frequency and diversity, primarily in studied European populations. There are some micro arrays that attempt to get around that, but for the most part, this is the case. In contrast with complete DNA sequences, we have an unbiased representation of each genome, and we get not just common variants, but also rare ones. And we can use techniques like the coalescence method to infer things like population sizes in the past, and I'll show you a simple example of that. So this is a paper published by Andy Clark more than a decade ago, showing that effect of ascertainment bias on allele frequencies when using micro arrays. So from the HapMap samples, if we look at what we call the allele frequency spectrum, so basically we say what proportion of SNPs have minor allele counts of one, two, three, four, and so on. So this represents the proportion of SNPs that would be very rare in our data set, and you can see that for HapMap, the proportion of rare SNPs is relatively small. It's underrepresented compared to what we would expect at equilibrium between mutation and drift, and it's substantially less than what we see in complete sequence datasets, like one from Perlogen and one from the NIEHS. So bottom line with micro array data, there's an important part of the spectrum of variation that is substantially underestimated. So more recently, we've been able to look at exome data, this paper published several years ago in science, and now what we see for both African-American and European-American individuals that they looked at, in this bin where minor allele frequencies are 0.5% or less, there's actually a substantial excess of variation, and what that excess tells us is that human populations have undergone a massive recent expansion. So how does that work? Well, and in fact, if we look at the percentages, what that study indicated was that 73% of all protein coding variants and even more, 86% of deleterious SNVs have arisen in just the last five or 10,000 years as a result of massive human expansions. So to understand why expansion gives us an excess of rare alleles, think about a family. If we have this small family, a quartet, and a new variant arises on this chromosome copy, there's actually a good chance that that variant will simply go extinct. There are only two offspring, so the chance that neither of them will get the variant is one quarter. So even though a new variant has occurred, it's immediately lost. Now in contrast, if this were a very large family, that variant occurs, the extinction probability is only one half to the 10th, that is one in a thousand. Chances are at least one offspring will inherit that new variant. And if they have a lot of offspring, chances are it will be transmitted again. So families like this would be seen in a rapidly expanding population. In a population like that, we're going to see an excess of rare alleles that would in a constant population tend to be lost due to drift. So that signature that we commonly see in these human allele frequency spectra is a good reflection of this large expansion in human population that took place largely after the advent of agriculture. So the most complete summary we have now of sequence data like this comes from the 1000 Genomes Project. The final paper was published in Nature just last year, based on 2,500 individuals from 26 different populations. Here you see how they're distributed across the world. And this has turned out to be a very, very useful reference for human genetic variation. I won't go through this very large table in any detail, but I think it's a very useful summary of human genetic variation for a whole series of different kinds of polymorphisms. You can see SNPs, indels, copy number variants, mobile elements, and so forth. One of the patterns that emerges here is that in African populations, there is about 20% more variation than in others. In Native American populations, the variation tends to be somewhat less. This graph summarizes for each individual the number of variant sites per genome. So here we see kind of at the lower end members of this population. So this is a Finnish population, Great Britain. This is the Ceph from Utah. And you see that they tend to be at the lower end of the spectrum of variation. These are Native American populations that tend to be mixed from various sources. So some of them have relatively low variation, others more. And then these are African populations where we tend to see the highest level of variation and for African American populations, depending on the degree of African contribution to the genome for any individual, there may be more or less variation. But this is a very convenient display, I think, of human genetic variation across the world from the 1000 Genomes data set. Now this is another big study of human genetic variation, designed very differently from 1000 Genomes. And we've been involved in this one with David Reich at Harvard, the Simon's genome diversity project, where 300 people from 142 different populations across the world have been sequenced and at a fairly high depth at 40x sequencing. The average depth for 1000 Genomes was about 7x. As you can see, these populations very, very widely distributed across the world. So we think that this gives us a very good indication of genetic diversity in a much broader sample of human populations. These papers are just starting to come out. This one came out in Science last year. There's another one on single nucleotide variation, currently under review. But the science paper looked at copy number variation in these samples. And these are principal components plots showing essentially the patterns similar to what we've seen in studying other kinds of systems. This plot looks at heterozygosity in copy number variants for deletions and duplications. The basic pattern here, greater variation in Africa than elsewhere. We also see that there's a quite strong correlation in heterozygosity for single nucleotide variants versus copy number variants. So the two different kinds of systems, even though they have different mutational mechanisms, give us quite a similar pattern of variation across populations. So I mentioned that we can use the coalescence method with sequence data to estimate important parameters of population history. So the basic idea behind coalescence is that we can look at a sample of individuals in the present day and we can assess for any particular allele. Here we have three copies of an allele in this sample. We can estimate the coalescence time, that is how far back in time a common ancestor for these individuals would be found in whom that variant arose. So for these two, we have a coalescence here and then for the other one, we have a coalescence further back in time here. But what we're looking for is where in time we can find the common ancestor for a given allelic variant. So all three of them would coalesce back some number of generations in the past. Now if you think about this intuitively, if we have a very large population with a long history, these coalescence events will tend to occur many, many generations back in time. Whereas if we have a small population with little variation, coalescences will tend to occur relatively recently. So we can analyze the pattern of coalescences in a series of genetic data to infer previous population sizes. And we can also infer exchanges, that is gene flow between populations by looking at shared coalescence events and shared genomic segments. So this allows us to make a model of human population history that looks like this. So here we have a demographic model of the history of our species. The line width here corresponds to the effective population size. And this goes back 150,000 years ago. And essentially what we're seeing here is a larger African founding population. And then a small subset with a size of only 1800 or so, going off to colonize the rest of the world, undergoing substantial bottleneck in size of a bottleneck for European populations of about 1,000, for Asian of about 500, and then subsequent rapid expansion of these populations after 20,000 years or so ago. So with sequence data, we can actually infer these parameters with substantial accuracy. How large was the previous human population? What were the major patterns of gene flow among populations? So this really helps us to understand and interpret patterns of genetic variation in populations, including the variation that gives rise to disease. So the population bottleneck that we observe here explains the reduction in human genetic diversity that I showed you earlier, humans compared to other great apes. And the recent expansion explains the excess of rare alleles, some of which are disease causing that we see in humans. Now another thing that some of my colleagues have been able to do is to compare ancient Neanderthal sequences with those of humans. Neanderthals and anatomically modern humans diverged about a half a million years ago, but then as I showed you in that graph earlier, as modern humans came out of Africa, they intermixed with Neanderthals and now by comparing human and Neanderthal sequences, we know that on average, non-Africans have about one to 4% Neanderthal DNA. And some of that DNA is involved with things like skin pigmentation, there are some immune response genes. So these are things that some of our ancestors probably got from Neanderthals that may have actually had adaptive significance. And if you send your DNA off to a direct to consumer testing company, they will actually estimate your own proportion of Neanderthal DNA, which is kind of amusing. Now we can also do this kind of analysis if we have sequence data from specific populations. So one of the heavily studied human populations is the Ashkenazi Jewish population. And so this is a diagram going back in time, showing essentially the effective size of ancestral populations here at 20,000, expanding the Ashkenazi population. Recently receiving a lot of gene flow from European populations, estimated at about 50% of that population, that is 50% of Ashkenazi DNA coming from nearby European populations. This is work that Harry Oster recently published. And then very recently, about 700 AD, the Ashkenazi population, estimated to have undergone a bottleneck, reducing the population to only about 300, the effective population size. Now this is pretty remarkable that the subsequent population, which has then expanded substantially, would be derived from a founding population effectively of about 300 people. That helps to explain the high frequency of several disease-causing variants in the Ashkenazi population. For example, about 140 Ashkenazi individuals has one of three founder mutations in the BRCA1 or BRCA2 genes, one in 200 in the general population. There is an APC mutation causing colorectal cancer seen in 6% of that population. And of course, everyone is familiar with the lysosomal storage disorders like Tay-Sachs and Neiman-Pick, Gauchay, that are again relatively common in that population. All of this can be ascribed to that extreme bottleneck that occurred about 1300 years ago in that population. So we have, I think, a good explanation for the variant frequencies of these conditions in that specific population now that we've been able to look at whole genome sequences. And conversely, there are a number of diseases common in other populations, rare in this one, because drift, of course, works in both directions. So I wanted to talk a little bit about what genetics can tell us about this, I think, always controversial concept of human race. I put race in quotes because I don't actually use the term myself in my own writing, but it is used and it is debated. And there have been a whole series of articles debating the utility of the concept. This was a paper in the New England Journal now about 15 years ago, asserting that race is biologically meaningless. This was a response in the New York Times from a psychiatrist who uses racial categories in helping to decide dosages of psychotropic drugs. Very recently, there was a nice piece in science advocating taking race out of human genetics. And then this piece in Scientific American by my former trainee, Mike Bombshod and Steve Olson, a science writer, asked the question, does race exist? Now the thing I found amusing about this was that here it says, science has the answer. I'm always a little skeptical when it's claimed that science has the answer. Usually we have more than one, but I think that science can tell us something about this concept and can illuminate our understanding of the concept of human race. So this is an exercise where we looked at sequence variation in a single human gene, the angiotensinogen gene involved in the renin angiotensin pathway that regulates blood pressure. So what we did was to sequence just that gene and then compare individuals in Asia, Europe and Africa. And what we found was for that single gene, sometimes an individual from say Africa, so each of these tips represents one individual. Sometimes you can see that someone from Africa is actually genetically more similar to someone from Asia, someone from Europe, more similar to someone from Africa than to other Europeans. So for a single gene, and we see this often for individual genes, people from completely different continents can be more similar to each other than people from the same continent. And this actually, essentially, rediscover something that Darwin said more than a century ago. It can be doubted whether any character can be named, which is distinctive of a race and is constant. In other words, there is no single character that we can use, a gene or anything else, that is always present in one population, always absent in another. And this reflects the shared history of humans and the fact that no human population has been completely isolated for a long period of time. We are a complex mixture of populations going back through time. Now, we repeated this exercise using, at that time, 190 polymorphisms. And now what you see in this neighbor joining network is that individuals from East Asia, from Europe and from sub-Saharan Africa do fall into three groups. Now these branch lengths are very long. Again, repeating what we said earlier that most variation is found within these populations, but there is enough detectable variation between populations, that 10 or 15% that we talked about, so that we can see three groups. Now these are, I think it's important to point out, geographically separated sub-Saharan Africa, Europe, and then East Asia. So that tends to make them fall into groups. But what we see is that with more markers, with more variation, we do see some reflection of geographic distance and ancestral history. And to, I think, clarify this, if we use a simple example, we can look at height in females and in males. And if we simply look at one character, there's going to be a substantial overlap between males and females. If we add another character, like waist-hip ratio, that overlap is going to tend to decrease. So looking at more characters, we learn more about population history, and we tend to see that reflected in our genetic data. Here's another exercise that I think brings the point home. Here we looked at about 500 people. If we just use 10 SNPs and then do a principal components plot, the kind that I told you about earlier, here we're doing it actually in three dimensions. So we have a third dimension that kind of comes up out of the paper. We really don't see any pattern here with just 10 SNPs. It's really very little information. If we use 100 SNPs, we start to see some organization by population affiliation. If we look at 1,000 SNPs, we can actually see, again, these major continental groups, individuals essentially grouping together. And with 10,000 SNPs, the pattern is even more clear. So if we have enough information, we can begin to discern something about ancestry. So with multiple polymorphisms, we can to some extent predict population affiliation because there is enough distinct variation to allow us to do that, but only with a lot of data. Now I think a very important point here, and this really gets back to the controversy, is that population affiliation can't in turn predict individual genotypes or traits. So we can go in one direction, but we really can't go back in the other. And that's because these traits, genotypes, do tend to be primarily shared across populations. They differ in frequencies, but there are very few that would be present in all members of one population absent in all members of another population. So I think this is one of the fallacies that has to be avoided is that if we have a self-described population affiliation, we can't make inferences necessarily about genotypes. This is a principle components analysis that we did just very recently using the 1000 genomes data. So here we have a series of populations from Europe, from Africa, from Asia, and then we also have the African-American individuals included in the 1000 genomes project. The important point here is that there are a number of African-Americans in this plot that would be more similar genetically to the Asian or European populations than to African populations because of the complex history of mixture in that population. And so that tells us that there is a lot of genetic variation in the African-American population and that you really can't necessarily ascribe individuals to a specific population group and that individual ancestry would be really much more informative here. And that really, I think, underscores the fallacy of thinking typologically when we think about human populations. If we think about humans as belonging to types or quote races, we tend to put them into discrete boxes when in fact, our studies of genetic variation tell us that for the most part, variation is overlapping among populations. And I think it's more informative really to think of each of us in terms of our individual ancestry. For example, here, an individual who has a genetic constitution that is 90% African, 10% European would probably self-identify as African-American but somebody with a more complex ancestral legacy would possibly also identify as African-American but genetically, they're very different and I think that illustrates why individual ancestry rather than the traditional concept of race is going to be more informative, especially as we deal with individual people, individual patients. I wanted to just give you an example of individual ancestry using my own ancestry. I sent my DNA off to one of the direct-to-consumer companies just for fun a few years ago and it was interesting to get the results back. How many of you have sent your DNA to one of these companies? Well, a few people have done it. Okay, well, it is interesting. They call it recreational genomics so it has to be taken with a grain of salt but it is kind of fun to look at the results. These are my Y chromosome results so I have a haplogroup that is seen with greatest frequency in Northwest Europe. That's where my grandfather said they were from. They said they were from Norway so this agrees with my own family history as far as I know it. Now, one of the interesting things is that this Y haplogroup, I share it with Jimmy Buffett and Warren Buffett. Hasn't done anything for my singing or my investing ability but a little interesting factoid. Now, my mitochondrial genome was also examined. Highest frequency for that again in primarily Western Europe but you can see that that mitochondrial haplogroup is fairly generally distributed across Europe and into Asia and Africa. Now, another thing that is done and we can do this with our DNA sequences is something that they refer to as ancestry painting so essentially for each chromosome segment in a person, in this case in me, we look at which alleles are present and then we ask the question in which continent is that allele most frequent and so that gives you sort of a paint across the chromosomes. I was sort of disappointed to see that I have a pretty boring genome with at least according to this all of my ancestry from Europe and on a finer level, almost all of it from Scandinavia but as far as I know my history, that is consistent but it tells us that our DNA having reflecting or reflecting all of the events that have occurred in our past, migrations, bottlenecks and so forth can tell us something about our ancestral origin though as I said we have to take this with a grain of salt because the reference samples are somewhat limited, populations have moved over hundreds and thousands of years but it is sort of interesting to see the pattern. Now we can contrast this with the pattern seen for a self-identified African American male so we see the ancestry painting and one half of the chromosome would be paternal, the other half maternal and we see that for this person about 33% of ancestry is traced to Africa about 64% to Europe. Now the important point here is that for let's say a medically relevant locus, let's say one pertaining to hypertension, this person at the individual level may well be European rather than African and what that tells us is that we should really be looking at individual ancestry rather than self-described population affiliation to assess more accurately someone's genetic inheritance. And so that's one of the implications of these findings I think for biomedical research. Certainly if we look at a large number of DNA polymorphisms we can learn something about ancestry, about population history though it can be rather approximate but the variants that we're looking at typically just differ in their frequency across populations and there is as we've seen substantial overlap among populations. So this is one of the implications. This is an interesting meta-analysis published of blood pressure response to ACE inhibitors and here we see the decrease in systolic blood pressure in two populations, European American and African American and we see that there is on average about a five millimeter difference between the two groups in the amount of blood pressure decrease after the administration of an ACE inhibitor but the important point here is that there is substantial overlap between these two distributions so many persons in this population could benefit more from an ACE inhibitor than persons in this population. Again, stressing the importance of treating each patient as an individual rather than a member of a self-described self-defined population. Here's another example, EGFR inhibitors used in the treatment of non-small cell lung cancer so both Gephinanib and Erladanib are small molecule inhibitors of EGFR tyrosine kinase activity. It's interesting that they have been found to be effective in about 10% of Europeans with non-small cell lung cancer about 30% of Asians so there is a population difference in response to EGFR inhibitors but if we look at the gene directly at somatic gain of function mutations in EGFR we see that those gain of function mutations for reasons that aren't well understood are more common in Asians than in Europeans. And in fact, about 70 to 80% of patients who have those mutations respond to Gephinanib fewer than 10% of those without the mutations respond to that drug. So looking at individuals and looking at their own sequence differences at EGFR much more predictive of response to this drug than looking at population affiliation. So I think this is again a good example of individualized, personalized medicine looking directly at genes rather than using population categories. So I think for the issue of genetic variation in race we see that genetic variation is indeed correlated with geographic origin but it tends to be distributed often continuously across space. That means it's hard to define precise borders between populations. So I think what it says is that race while it may not be completely meaningless biologically we can see differences at the DNA level among populations but it's biologically very imprecise. It's a blunt tool. We can do better with genetic analysis and by looking at individual ancestry I think we can get ultimately much more useful medical information. So I think in that way genetics has increased, has enhanced our understanding of differences and similarities among populations and of course there's nothing in our genetic results that would suggest that one population is in any way superior or inferior to another. Now the last topic I want to mention today another application of our studies of genetic variation pertains to the use of linkage disequilibrium in disease gene mapping and let me just ask the audience here how many of you are familiar with the concept of linkage disequilibrium? Okay about maybe a third. So let's go through a quick definition. Basically linkage disequilibrium refers to the non-random association of alleles at linked loci. So if we imagine in a population that we have two loci we'll call them A and B and they both have alleles big A and little A, big B and little B. At equilibrium we're going to see pretty much a random assortment of haplotypes containing either big A and big B, little A and little B, big A and little B and so forth. Whereas under disequilibrium we see a preferential assortment of haplotypes here big A and big B, little A and little B. And we can actually quantify this using allele frequencies. If the frequencies of big A and little A in our population are 60 and 40% of big B and little B, 70 and 30% then we would predict under equilibrium if there is no preferential assortment of these linked alleles on chromosomes. We would expect that if we surveyed a population in 42% of our chromosome copies we would see big A and big B together on the same copy of a chromosome because that's simply the product of their allele frequencies, 60% times 40%. We would expect to see big A and little B 18% of the time, 60% times 30% and so on. That's at equilibrium. There our population frequencies of haplotypes is predicted exactly by the individual allele frequencies. If they are independent we can simply multiply them together. But let's suppose we see this pattern instead where instead of 42% of our haplotypes having big A and big B it's 60% and instead of 12% of our haplotypes having little A and little B it's 30%. Well that's a substantial deviation from what we would expect under independence. That would be an instance of linkage disequilibrium, the preferential association of these two alleles, the preferential association of these two alleles. And what that typically reflects is the distance between loci because if you think about it over time, over many generations loci that are further apart like A and B have had more time in which recombinations can occur to shuffle the combinations. These two loci being very close together have had less time for recombination to occur. So ultimately we're more likely to see an association of alleles between these two very closely linked loci in between these two more distantly linked loci. So linkage disequilibrium essentially reflects this pattern of recombination occurring over many, many hundreds of generations especially for closely linked loci. So we can use it to infer the distance between closely linked loci. Now there are a number of factors that affect these patterns, chromosome location. For example, we know that within genes there tends to be less linkage disequilibrium than outside of genes. We know that DNA sequence patterns, things like GC content influence linkage disequilibrium, less disequilibrium, more recombination where we have a higher GC content. Also, alloy elements have been shown to increase local recombination rates. And we know now that every 50 or 100 KB or so there are recombination hotspots where recombination activity is elevated about 10 fold over the general level of recombination in the genome. One of the factors involved in that is a zinc finger protein called PRDM9 that's associated with close to half of hotspots and actually varies among populations and accounts for some of the variation in recombination that we see among human populations. And of course evolutionary factors affect linkage disequilibrium. All of the factors that we mentioned earlier, selection, gene flow, mutation, genetic drift can affect linkage disequilibrium through the history of populations and also the time that has elapsed since a population was founded. Populations that were founded a long time ago, more time for recombination to occur in general, less linkage disequilibrium. And this is born out in the 1000 Genomes data. This is the most recent version showing that this group of populations has a more rapid decay of linkage disequilibrium between SNP pairs. So here is the distance in KB between each pair of SNPs in the population and then this is a measure of linkage disequilibrium between each pair. And we see that for all of the populations from Africa there is a more rapid decline of disequilibrium with physical distance than in other populations. And for example, the Finnish population a somewhat less rapid decline reflecting the more recent founding of that population. So this is a nice illustration of how the essentially the age of populations influences patterns of linkage disequilibrium. But the bottom line from these studies is that because many SNPs in the genome are in linkage disequilibrium they're redundant. So in our genotyping studies we only need to type a subset. So for example, if this person has this allele C at this position they're more likely to have T and A at this position. Whereas if somebody has G at this position they're more likely to have C and C at those positions. So the alleles are in linkage disequilibrium and that means that we really only have to type this one in order to know what these most likely are. So we can designate those as tag SNPs and something that our studies of genetic variation have told us is that we can get relatively complete coverage of a genome in a genome-wide association study with something like one and a half million SNPs for African derived populations because there's less linkage disequilibrium there and something like a half million to a million SNPs for non-African populations because there's more linkage disequilibrium. So our studies of population history then can inform, help to inform our design of these genome-wide association studies our design of SNP microarrays. And that successful design has led to these kinds of findings. I think you'll hear more about this later but the many thousands of significant associations now seen between various SNPs and traits in human populations. Now recombination hotspots are also, we've been informed about these through studies of linkage disequilibrium. As I mentioned, there's one every 50 to 100,000 base pairs in the human genome. And we estimate now that about 60% of all recombination occurs in just 6% of the genome at these hotspots. And very interestingly, hotspots aren't actually congruent in humans and chimps. They're very different from one another indicating rapid evolution of hotspot activity in primate species. Now natural selection creates regions of strong linkage disequilibrium. So we can use linkage disequilibrium actually to learn something about natural selection in populations. And this diagram illustrates the principle. If we imagine that a new variant arises here on a chromosome background, there will be SNPs nearby that are going to be in strong linkage disequilibrium because whenever we see this variant, we're going to see these SNPs. But through time, recombination is going to shuffle these so that a smaller and smaller haplotype through time is going to be associated with that variant. So under neutrality, that is where there is no selection, this variant will increase in frequency only very slowly so that by the time it attains a frequency of say 10%, it's on a relatively small associated SNP haplotype background. But if there's been rapid positive selection for that variant, it will essentially drag the nearby SNPs along with it and you will see a region of high linkage disequilibrium because this variant has evolved quickly to high frequency so quickly that recombination hasn't had time to reshuffle these nearby SNPs. So this is one of the signatures that we look at in genomes to indicate that that region has been under recent rapid selection. And we now have some good examples in humans where there are extended regions of disequilibrium and homozygosity that are the result of recent rapid selection, G6PD from malaria, one of the cytochrome P450s for sodium retention, lactase enhancer for hereditary lactase persistence, several skin pigmentation loci, and this one that I'll just talk about for a couple of minutes here at the end, a high altitude hypoxia response. Two genes in the hypoxia inducible factor pathway. So if we look in Tibetan populations, and this is work that we've published over the last few years, we see that Tibetans have regions of elevated linkage disequilibrium, extended homozygosity for these two genes that are both in the hypoxia inducible factor pathway and also for oxygen sensing genes, in particular this heme oxygenase two gene. So the yellow indicates the ancestral allele, the red is the more recent derived selected allele. And what we see here is that for individuals, so each row here is an individual and each column is a SNP, we have large regions of extended homozygosity, of extended linkage disequilibrium. This is a signature of recent positive selection, in this case for genes that affect response to hypoxia. And these populations, Tibetan populations live at an average elevation of 14,000 feet, where they have about 30% less oxygen than we have here at sea level. These selected haplotypes are associated in Tibetan populations with reduced hemoglobin levels. Now that might seem paradoxical because you would think at high altitude you would want to make more red blood cells, more hemoglobin in response to low oxygen and that that in fact is what we do. Our acute response to high altitude is to increase erythropoiesis, make more red blood cells. The problem with that is that that makes us susceptible to high altitude pulmonary edema, high altitude cerebral edema. It clogs up our circulation. Tibetans have evolved so that they can live at high altitude with reduced hemoglobin levels, protecting them against polycythemia, against increased numbers of red blood cells. And this is an experiment that we did, putting the Tibetan specific mutations in the prolihydoxylase gene into erythroid progenitor cells. This is the wild type under normoxia and under hypoxia in these in vitro systems. The wild type actually increases activity. In other words, erythropoiesis is going up. This is what we would do at high altitude. Here are the Tibetan mutations cells with the Tibetan variants in prolihydoxylase at normoxia under hypoxia activity decreases. So recapitulating the Tibetan phenotype in a cell cultures system. So I think this is a nice example of how we can go back, look at sequence variation in populations to better understand how they have adapted to interesting environmental conditions. In this case, very, very low oxygen availability. And this is from a paper just under review now. So this is a genome wide selection experiment taking advantage now of whole sequence data. And what we see is that these two genes in the hypoxia inducible factor pathway are by far the genes under strongest selection in the Tibetan population. And by the way, this one, E-PASS1 was contributed to that population by an ancient population called the Denisovans, which were sister species of the Neanderthals. So this was one of those genetic adaptations in this case to high altitude that came into this population from a completely different source. So I think a very interesting story. So population genetics to sum up is I think helping to guide the development of new sequence analysis resources. The 1000 Genomes Project is a great example where those data are used as reference sequences or control sequences in thousands of analyses. We've learned among other things from those data that rare variants tend to be population specific as we saw. We're also learning a lot about the functional significance of these genetic variants because we know that functional regions in the genome, whether coding or non-coding, tend to show more evidence of purifying selection. And we can actually use those to more effectively identify functional regions of the genome. So to sum up what I've told you about today, genetic variation does contain, I think, useful information about population history, about individual ancestry. Our studies of individual variation, I think, do give us a more informed, a more sophisticated view of the concept of race and its relevance or lack of relevance to medicine. Population genetic analysis has informed us, I think, in very important ways about linkage disequilibrium, the effect of evolutionary factors on it, and how we can most effectively apply it in disease gene mapping. And I think that our analyses of population genetics are going to become even more important now as we come to understand the role of rare variants in disease, now that we can relatively cheaply obtain whole genome sequences. And finally, I hope that you've seen, I hope you agree with me, that population genetics actually can be quite a lot of fun. We can learn some really interesting things from it, and with the avalanche of data now coming in, I think over the next decade or two, we're going to learn even more. So thank you all very much. So, given the hour, we'll just take questions at the podium. Thank you all for coming. We'll see you all again next week.