 Okay, I was asked to address the topic of ancestral diversity. So this will be a little bit of a change of pace from the last couple of talks. What I'll be addressing this afternoon is, first, a little bit of background on human genetic variation, essentially how we've been thinking about human genetic variation over the last, well, really, several decades and what the general patterns look like. And then I'll talk about how sequencing, just in the last couple of years, has changed some of these views and how in turn that affects and pertains to some of the questions that we've been discussing over the last day. So one of the most basic questions we can ask is, how is genetic variation distributed among populations? And here we'll use a convenient, not always the best or most appropriate, but a convenient unit of subdivision, and that's the major continents. And what we see is for a variety of different kinds of genetic systems, whether they're short tandem repeats, the old restriction site, polymorphisms, mobile elements, or SNP data here, if we look at the apportionment of variation, most of it occurs between individuals within these major populations, and only a relatively small proportion, typically 10 to 12%, is due to differences between populations at that level, and that's very consistent across different kinds of genetic systems. Now we can, to put that in perspective, compare that with, say, skin pigmentation, which varies a great deal among continents and has been subjected to a lot of natural selection, where the great majority of variation actually is found between continental populations. Another way of looking at this is for SNPs from a 250K chip to ask what proportion of the minor alleles are shared among populations, and we can see that the great majority typically are, and the only continent that has very many population-specific SNPs, 7%, is Africa. Now another way of looking at this variation is to map it out on a tree that looks like this, and of course these trees can be misleading to some people because they're sometimes interpreted as meaning that populations split and then had no subsequent contact, and of course as humans there's always been subsequent contact, but it's a convenient way of showing how similar populations are to one another, and what we see is that populations do tend to group according to their geographic location, and that's really quite consistent across many different kinds of studies, and in fact with a completely different data set, the Human Genome Diversity Project data set, we see again the same pattern, both for SNPs and for a smaller number of copy number variants, and I can't resist pointing out that this has been known for a while. This is a review published 20 years ago based on data generated in the 70s and 80s, blood groups and protein polymorphisms, and it's interesting that with just 29 of those loci you see very much the same pattern, so you really don't need that many loci to get the basic pattern of interpopulation genetic variation, but one thing that our high density SNP data allows us to do is to construct haplotypes, groups of closely linked polymorphisms, and to ask how haplotype diversity changes as we go from Africa to Central Asia to Europe to Polynesia and to the Americas, and you can see that there's a substantial reduction of haplotype diversity as we sample populations further and further away from Africa, and that's an indication of sort of a serial founder effect as human populations left Africa, and all of this is, all of this data I've shown you is consistent with a recent African origin of anodine populations. Now one interesting question I can't help but at least mention is what happened as humans went out of Africa and they encountered other groups? Well they acted like typical humans, they exchanged DNA. A consequence of that is that for non-Africans today we see about 1 to 4 percent contribution in their DNA of their Neanderthal ancestors. So we can, another thing we can do with the high density data is to look not at populations which are always somewhat artificial in one way or another, but we can look at individuals. So on this two-dimensional display of genetic relationships, it's a principal components analysis, each dot is an individual, and what's rather remarkable about this is that you can see that this sort of reconstructs a map of Eurasia, again showing the correlation between genetic similarity and geographic location. But another thing that it shows is that there is overlap among populations in terms of their membership. So populations, while they do tell us something about people, they are not by any means perfect categories. So the microarray-based SNPs as we see do portray population relationships pretty accurately, but they are biased. They were typically selected for high frequency and diversity in Europeans, whereas complete DNA sequences are essentially unbiased, and of course they include a lot of information about not just common, but also rare variants. This plot that Andy Clark generated now a few years ago from some of the early sequence data shows the effect of that bias on allele counts. So this is the proportion of SNPs at each allele count, and what you can see is that for the SNP-based hat map results, there's a real deficiency relative to sequence data in the low-frequency alleles. It's also interesting that compared to what's called equilibrium, we see excess rare alleles in the sequence data from Perligen and NIHS, and what that is is a signature of rapid human population growth fairly recently, and that was mentioned in the Tennyson paper that was handed out to everyone. And the way to think about this is that when rare alleles arise, if you have a stationary population or a small population, a lot of the time they're lost due to genetic drift. But if you have a rapidly growing population, they tend to be retained. If you're transmitting an allele to, let's say, 10 offspring, you're more likely to do so than if you're transmitting to only two offspring. So let's take a look at some of the sequence data and how they are displayed, again, in a network diagram. So this is some analysis that a graduate student of mine did, Wilford Woo. So these are individuals. They've been completely sequenced by complete genomics, 54 individuals. So each of these tips is an individual from a specific location. So here are the 1000 Genomes Yorubens, the Luya from Kenya, Chinese, Japanese. These are samples from Mexico. These are the Utah SEF samples, Tuscan, Puerto Rican, and Gujarati Indians, so that, again, with the sequence data, as with the STIP data, we can see groupings of individuals. Now, not everyone falls easily into those groups. Most of these samples are African American that fall outside this cluster and sort of in between the two groups, non-African and African, reflecting their population history. Now, another thing that Wilford did was to compare sequence results in the complete genomics data, and 34 of those individuals were also sequenced as part of the 1000 Genomes Project. So you can see that there are two tips for each of those individuals representing the two different sequencing results, and there are some differences. The individuals do group with themselves, as you can see here, but on average, the between platform difference is about 348,000 variants, reflecting in part differences in depth of coverage, as well as differences inherent to the platforms. So some interesting interplatform heterogeneity. So here's a way in which sequencing data really differ from the earlier SNP data, that is the earlier microarray data, and that is that they're much more likely to be, those SNPs are much more likely to be population specific. So here's a plot showing overlap among major populations for essentially mostly common SNPs identified earlier in DB SNP, where you see, as I mentioned before, a lot of overlap among populations, that is most variation is shared, whereas for the rarer SNPs identified now by sequencing, we see much, much less overlap. Most of these SNPs are in fact population specific rather than being shared, and in part that reflects the fact that here, the average allele frequency difference is around 15%. There's a lot of latitude for variation and sharing. So we can understand this by looking at this graph, and I won't take you through this equation, but it's a simple way of estimating the age of a SNP allele based on its frequency and an estimate of effective population size. And so you can see that as the frequency gets larger, we estimate the age of the allele to be greater. And for an allele whose frequency is about 5%, the age is estimated to be roughly 150,000 years. So it's been around for a long time before the out of Africa event likely to be shared among many populations. Whereas for an allele that has a frequency of 1%, the age is much smaller, post out of Africa, more likely to be population specific. So this gives us a way of understanding those results. Now with the rare variants, the many rare variants that we see in sequencing data, we can ask the question, well, each time we sequence a genome, how many new variants do we see? This is work published a couple years ago from David Goldstein's group showing that the number declines down to something over 100,000. We've extended that with 1,000 genomes data to 200 individuals, and you can see that it declines, continues to decline, but very slowly to around 100,000 novel variants for whole genome data each time a new person is sequenced. Now we can also compare this between populations, which is kind of interesting. European versus European is shown here, and we see a lot more novel variants when we're comparing individuals between populations, like European versus African, as opposed to within, telling us that we're going to see a lot more what appear to be novel variants if we're comparing two populations with different history. Less so for populations from the same continent, so these are Tuscans, Great Britain, Finns, and so forth, but still some difference in the rate of decay, the rate of fall off for within population versus between population comparisons. And what that suggests is that if we use the wrong reference population in inappropriately matched reference population, we're going to see a lot of potentially false positives. And this is an exercise that we published in our vast paper last year outlining a methodology for finding disease causing loci. What we did was compared 30 individuals who were of European ancestry, they were our disease cases with mixtures of controls of Europeans and Africans. And what you see is that if the control population is mostly matched, we found very few of what would be false positives because there was no actual disease here. We shouldn't see any signals, but once we start getting a larger portion of non-European individuals in our control sample, we get hundreds, even over a thousand false positive results. So in this rather extreme example, it's telling us that if we don't properly match our case and control samples when we're looking for rare variants, we're going to have hundreds and hundreds, if not more, of false positives. Now, the last thing that I wanted to mention just briefly, because the issue of power has come up a number of times here, this is a comparison that Chad Huff has done. This work is about to be published where we are looking for a disease-causing gene in a multi-generational pedigree using VAST. What this shows is that the power using sequence data to get genome-wide significance, so this is the genome-wide significance level, is about the same as a traditional LOD score of three in a family if we're using 450 control genomes. But if we simply increase that, this is a simulation, if we increase that to 4,500 control genomes, we can get genome-wide significance with sequence data, where we incorporate functional data and other things, with a LOD score less than two. In other words, having a large sample of controls, which helps us to identify which alleles are truly rare, substantially increases our power. So to conclude, I see I'm right at 15 minutes here, what we do see from sequence data now is that there are a lot more population-specific rare alleles, that certainly implies that we'll need more sequencing and more populations. But I think to help guide that, one of the things we need to do is to conduct experiments essentially to determine how closely do we need to match case and control populations to minimize false positives? What FST level, for example, would be small enough so that we can minimize the number of false positives without sequencing too many different populations? And I think this is work that needs to be done now that we're getting more and more sequence data from more and more populations. Thanks. Great. Thank you. Discussion? Go ahead, Derek. Large cohorts, thank you, sorry. Large cohorts that we're not going to be identifying a priori, you know, cases and be able to match to control. So do you think in the absence of the ability to truly match, and I'm taking you literally, we'll be able to, in a post-hoc way, adjust for differences in substructure using some method as opposed to matching a priori? Well, I'm sure that some of the same kinds of methods that have been used to do genome-wide control in GWAS could be applied as well with sequence data. I haven't thought about that too much, but it should be possible to do those kinds of adjustments. There's always some loss of power when you do that, but I should think that would be possible. Just for local structure or adjust for, just on average, this person, sort of an average, either PCs or FST. By local structure? Around the gene. Just in that local genomic region. I would think you'd still want to do it genome-wide, but I'm curious as to why you're thinking you might restrict to just around the gene. Maybe I could just ask the question a different way. If I'm studying a rare adverse drug reaction in Tuscany, can I use Estonians as my controls? Or do I have to use Tuscans as my controls? I think that's one of the things that we need to establish. That's why I said I think we need to do these experiments to figure out what level of differentiation will produce a given level of false positives. I don't think for sequence data we know that yet. But I think it's pretty clear you don't want to use Estonians as controls. I'm trying to stay within the Caucasian super family, but Tuscans and mid France or something. There'll be lots of rare sequence variants at which Tuscans are different from Estonians. Actually there'll be reasonable levels of sequence variants at which they'll be different from people who live in southern Italy. I think you need to figure it out. The question is really what the local context is. I think that question was just asked of you a second ago, Lynn, and I would raise that question whether if you were interested in a particular drug reaction, could you look in the Estonians locally at that part of chromosome 4, if you saw that there was shared haplotypes in similar structure there, would that permit you to do it? So there may be a bit of gradation there, but I think you know that the question of the difference between sort of globally doing this and looking more locally, because there may be some regions in which you do have more conservation and less admixture, you know, and there are some groups that are looking at this with respect to like p53 mutations and the like that have been very interesting in being able to capture and see a consistency across what otherwise looked like very disparate populations within the European set. So if you look at cases from one region, Tuscany and controls from Estonia, or I suspect even southern Italy, and the signal you're looking for is a rare variant that differs between the cases and the controls, you'll never know whether it's because of the fact that the cases and controls or because they're from different regions. I think we need to be really careful. Yeah, I think we're probably thinking in terms of the haplotypes we can look at with say SNP data, where you can sort of trace the origin of that region of a chromosome in an individual, but with rare variants, that variant may substantially postate the rest of the variation on the haplotype. The follow-up of that is then we really can't, it makes it all the more difficult to use rare variants in any way to adjust like you would use in eigenvectors and the like for sort of genomic controls. Is that what I'm hearing? Because there are people who are advocating and also refuting that particular position saying that could you use rare variants to and develop eigenvectors? I'm not sure you can globally, but the question is the local questions that are different, but globally I think that that's really going to be very problematic. So Judy. Yeah, just a couple points. One is kind of the correction for the local differences. Wouldn't that be more relevant for a recently ad-bix population as opposed to comparing different clades of European ancestry individuals? That's kind of one point or question. And a second point is that a priori, the point you made earlier of using the same approaches as we used in GWAS to correct for population substructure, it seems a common variance, but then using additional principal component accesses in just one or two or three or four would probably serve as well as anything as we can to try to correct for population substructure. Just your comments on that. Certainly you could use the additional principal components out to as many dimensions as sort of makes sense given the structure of the data for recently ad-mixed populations. An interesting question how you would construct appropriate controls, but I could imagine constructing control datasets comprising the ancestral groups, say two groups for that ad-mixed population. And that could be, I think, determined fairly well. And then you've got rare variance from both of those sources so that you would have at least an approximately appropriate background. I think Rory had one, and then Stephen. You make a very urgent case for having the cases and the controls from the same population when you're trying to look at rare variance. But if you have the cases and the controls from the same population, do you need to have studies in many different populations in order to find rare variance that are related to disease, or would just having a few actually allow you? So should we be focusing on a few large studies rather than having lots of studies in order to have different populations? I think the sample size issue is going to be important. And so I think my own inclination would be more toward fewer larger samples rather than many, many small samples. Stephen and then Chris. I just wanted to ask a question from Lynn. It's sort of come up and that's the question of like a mold-like analysis. I mean, raising this issue of recent admixture. You know, people have tried to use, I mean, they've successfully in selected places done this with microsatellites and to a degree with SNPs. The question is, as we get to these rare variances, is this an opportunity that's going to, in your mind, go away? Or could it become a powerful way to identify these things if indeed you had sufficient number of cases and controls from two different populations and then that had a difference in incidence of a particular disease? You think there is space for that or is that going to go away? Yeah, that's an interesting question. I mean, there's a literature going back some time on the power of rare variance to essentially trace migration movements and that I think would include admixture movements or admixture events. So I think rare variants might be especially powerful in that context. Quick question. So there's rare variants and there's, you know, singletons that in aggregate can lead to disease we think we're hypothesizing. And is there evidence for population stratification of a burden of variants in a gene? And would that potentially be a protection if in fact that is leading to diseases or disease traits against this phenomenon of different populations? It's an interesting question. Certainly if there's, I have to think about that, but if there's been, well, one thing that could have an influence on that pattern would be positive selection with hitchhiking of nearby variants. So you could get clusters of variants that elsewhere would be rare, but could be selected to rather high frequency in a specific population, just as a result of a selective event. So that's one way in which that could occur. So you'd have a suite of variants that would be in strong linkage to sequalibring with each other as a result of genetic hitchhiking. Is that Maynard and then Daniel. I think this is going to be an extremely important issue for a variety of reasons. There's been a sort of focus on the ability to extract genotype phenotype correlations without air. But there are of course, larger social issues on the table. And I just like to emphasize the point that I think, particularly in the United States, that as really a kind of point of attitude and policy that we should embrace our sort of mongrel background and not attempt to finesse this problem by injecting notions of ancestral purity into our sort of basic way of sampling the population and so forth. You know, we're going to be dealing with a highly admixt population. It will have some advantages and some disadvantages compared to Iceland or wherever. But it's our population. And I think this is the way we should tackle it. The problems are quite difficult, but I think that they will look different when we start having all genome data on a million people. And I just would like to encourage young theoreticians to tackle them. I mean, in principle, each little sort of segment of the genome there, of course, are major difficulties defining the boundaries of this segment, which differ from every genome to every other genome. But nonetheless, in principle, each has its own phylogenetic tree. And I think that the long-term theoretical path here is going to be to construct those trees. And that this is the way that the reference genome problem will be ultimately attacked is that every little segment of my genome or each of my genomes will have its own reference genome. And I think this is a solvable problem with enough data. And I think for this variety of reasons that I mentioned that we should, it should stay high on the agenda. Well, Maynard, I couldn't couldn't agree with you more about the importance of disabusing people of any notions of any population having any kind of quote, pure purity. It's one thing that genetics teaches us is that there is no such thing. Okay, one last question, Daniel. I think so, one of the things we see in looking at our exams from Finnish individuals is the effects of the bottleneck which have taken a number of variants that are very rare in the European population in general and bump them up to a higher frequency in Finland. I was wondering if you could comment then on the on the benefits potential benefits of looking in these bottlenecked populations, given that many of these sort of rare potentially very deleterious variants may be found at a higher frequency in populations like the fins or the armish or other other populations here in the US. Well, I think you've, you've stated the advantage very well, Dan, that you know, that increases the potential signal, not to mention at least some, although I think it's fairly marginal decrease in general heterogeneity in the population. So, you know, I see those as potentially quite advantageous populations in the in for these kinds of analyses. Thank you.