 Welcome, everyone. Thanks for coming. I'm Eric Green, Director of the National Human Genome Research Institute. And today we kick off the Genomics and Health Disparity Lecture Series. Now, this is a new series that we are implementing. It's designed to enhance opportunities for dialogue about how innovations in genomics, research, and technology can impact health disparities. And the lecture topics you'll be hearing about in this series will range from basic science to translational research. And I should point out that while NHGRI is giving the introduction to this, the lecture series is actually co-sponsored by our Institute, NIMHD, NIDDK, NHLBI, and also the FDA Office of Minority Health. And we thank our partners for making this series a reality. Well, I am pleased to welcome actually a very good friend of mine and also the NIH's, Dr. Carlos Bustamante, who's going to give the inaugural lecture in this series. Those of you who don't know Carlos, Carlos is a professor in the Department of Genetics at Stanford University, as well as co-founding director of the Stanford Center for Computational, Human, and Evolutionary Genomics. He also serves as director of informatics at the Stanford Center for Genomics and Personalized Medicine. Dr. Bustamante is a well-known population geneticist whose research focuses on analyzing genome-wide patterns of variation both within and between species to address fundamental questions in biology, anthropology, and medicine. Currently, he conducts research on human population genomics and global health, including developing statistical, computational, and genomic resources that are enabling trans- and multi-ethnic genome-wide association and medical sequencing studies of complex biomedical traits. In terms of a bio-background, Dr. Bustamante received his undergraduate degree in biology and statistics, as well as his Ph.D. in biology from Harvard University. But in his many honors have included receiving a MacArthur Fellowship, the Provost Award for Distinguished Research from Cornell University, where he was once on faculty, and also a Sloan Research Fellowship in Molecular Biology. He's also a member of the National Advisory Council for Human Genome Research, which is our institute's advisory council, and he is a valued and productive member of that important group. And finally, I should point out that you'll realize within about five minutes that Carlos is a terrific communicator and an engaging enthusiastic speaker. In fact, I'll predict you will not notice that he's actually come off a red-eye from San Francisco because his energy level is perhaps only matched by mine, but the two of us are pretty dangerous, actually, especially when we're on the same platform. And I personally could think of no one better than to kick off this lecture series. So please join me in welcoming Dr. Carlos Bustamante. Thank you, Eric. It is really such a pleasure and privilege to be with you today. This is an area that's really near and dear to my heart. For those of you who don't know much about my research program, I began life out as a statistical geneticist. My thesis was really on Markov chain monocarlo methods for sampling in high-dimensional spaces, so something that most people would, you know, snooze at. And after I sort of rose to the ranks at Cornell and got tenure, I decided that I really wanted to focus my research program on problems that to me were of fundamental interest and really sort of engaged my passion. And that's really led to an evolution where we've gone from basically a computational biology lab to really trying to be far more vertically integrated. Here's the pointer in that we've started, you know, doing now our own sampling in the field, development of sequencing technologies and even sort of functional assays for testing polymorphisms and genetic variation of large effect, if you want to think about it that way, in organisms of human importance. And that includes people, primates, pets and pathogens. And really, this foray began partly with our involvement in the 1000 Genomes Project. We were part of the group that led analysis of the 1000 Genomes Project. And at the point that we were thinking about how to structure the project, it became really clear that, you know, the group was at a bit of a schism, right? Do we focus on the set of populations that we've studied really well, or do we think about broadening representation and bringing in other populations that to date haven't been included? And part of that is because the goal of the project was to develop the next generation of resources for medical genetics. And so we really advocated hard for inclusion of populations from the Americas and want to tell you a little bit about today is what we've learned in being able to do that work. So when I think about why I do what I do, right, you know, and my kids ask me why do we care about DNA variation? You know, to me it's really very fundamental. On the one hand, we can use DNA sequence variation to unravel mysteries about genes and how they relate to biomedical traits of interest. And I'll tell you a story about that regarding of all traits blonde hair. Actually, it turns out to have a fun evolutionary story. We also want to understand in a very serious way the global distribution of genetic variants that are associated with disease and their public health consequences including, for example, pharmacogenomic variants and really clinically actionable variants. And then lastly, and this is for me really again sort of why I got into this business in the first place, it really helps us to understand and unravel critically interesting questions about the origins of humans and how we as a species were so successful in colonizing so much of the world. And in particular, the real revolution that is ancient DNA sequencing has opened up this window into the past that I want to tell you about and how we can even think about that as an opportunity for broadening representation. Okay, so we've talked a little bit about 1000 Genomes for those of you who know of this project. It began now eight years ago and the finishing touches on the last manuscript that describes the full project is ongoing now and we hope to have that in submission at nature or really at a journal published by McMillan Press that we can't mention. And initially, as I said, the project had focused on a design with four or five populations from three continental regions, Europe, Africa, and the Americas. I don't know if this is that, okay. Europe, Africa, and the Americas. And then a group of us said, well, you know, it's probably not a good idea politically or scientifically to have a project called the 1000 Genomes project that does not include samples from the Americas. And that really led to a very invigorating and interesting debate about, well, how do we engage populations from the Americas? What's the right sort of sampling scheme? How do we really think about that problem? And the argument many of us made was, you know, don't let the perfect be the enemy of the good. If we really want to develop this as a resource for mapping complex traits, let's focus on those populations that we're likely to be able to engage in large scale medical genetic studies. Let's think about admix populations. Let's think about, you know, minority populations within the United States. Let's think about populations in Colombia and Peru and Mexico that are likely to be the sort of next wave of countries that take these kinds of projects on. And so we've gotten 500 samples added, and I'll tell you a little bit about how we've used those in a related project called PAGE, which set out to be really the largest trans-ethnic and multi-ethnic mapping project in the U.S., 50,000 people funded by NHGRI. And I'll tell you about how we've sort of leveraged those resources in the design of a new array that's really we feel going to catapult forward this field of trans- and multi-ethnic mapping. What was interesting is that after we got the samples from the Americas added, our colleagues and friends who work on the Indian subcontinent said, what about the Indian subcontinent? And we said, yeah, what about the Indian subcontinent? It's a billion people. How do we not have them represented in the project? So this is one of these cases where we woefully under-advertised. It's a thousand genomes project, but it's got 2,500 genomes in it. But partly it's because of that desire to really broaden representation and create a resource that's relevant to mapping in many of the world's great populations. So one of the first results that kind of came out, even at the pilot phase of this project, is that when we think about human genetic variation, it's critical to distinguish variants on frequency. And one of the sort of truisms that's come about is that common genetic variants, those that say reach 15% frequency or 20% frequency in most worldwide populations, tend to be extraordinarily shared. So if we think about the exchangeability of populations, a pair of European populations are quite exchangeable for a variance that's at 15% frequency, but even a European and a Chinese or a European and an African population are going to have similar patterns of variation because these are quite ancient polymorphisms. These are genetic variants that have been in the human gene pool since our species arose in the great human diasporas. As it turns out, common genetic variants are actually pretty rare. Most genetic variation in the human genome is extraordinarily rare. So rare variants are common and common variants are rare. And the pattern for rare variants is that they tend to be extraordinarily population private. So if we're not talking about something that's at half a percent or 1% frequency, then we have a priori expectations that these are variants that will be at appreciable frequencies, but not in all populations in a subset of populations. So now if we think about the next generation of large scale studies, it really matters who we're studying, right? If we want to talk about 15, 20%ers, then sure, we can study the Icelanders and generalize. But if we're talking about the 1%ers, which is the bulk of genetic variation, then we really do need a concerted effort to make sure we're not broadening health disparities by only focusing on a subset of populations. And the statistics are still a little glaring. The vast majority of studies have really only included populations of European descent, but it's getting better. And I'm reminded of that famous quip. How do you herd cats? You move the food, right? And so that's what Eric and others have been doing. They've been moving the food. They've been moving the motivation into more of trans and multi-ethnic studies so that investigators say, yeah, well, if that's where we need to go, that's where we're going to go. OK. So several years ago, we began this effort to really broaden representation. In green here, you can see the populations that got sampled by 1,000 genomes. In yellow are the populations that come from a project called the Human Genome Diversity Project that was spearheaded out of Stanford by Kavala Spurza and colleagues for the last 30 to 40 years. And then in red are places where we've tried to fill in the gaps. So we've focused a lot on Sub-Saharan Africa and particularly West Africa and its relations to the transatlantic slave trade. And I'll tell you about that. We've focused on populations up and down the Americas, North Africa, and the South Pacific. And of course, there's still a ton of work that gets done in Europe. And in many ways, that remains one of our development laboratories, if you will, because we can do such fine-scale mapping in Europe, partly because of all the resources. But then what we want to be able to do is broaden out and show you what's possible in other populations. OK. So here's one example of why we need to study diverse human populations. This is a genetic variant in the HLA known as B-star 5701. The reason we and others care so much about this particular polymorphism is that it underlies what's known as abacavir hypersensitivity. So if you happen to be HIV positive and you go in for your triple cocktail, one of the drugs you'll get is abacavir. And it turns out that a couple of percent of the time, you will have an adverse drug reaction. And if you get prescribed abacavir twice, you could die. OK. It's a pretty severe reaction that has to do with whether or not you carry this genetic variant. Now, it turns out my dad's an HIV doc, and when I ask my dad, do you genotype your patients when you go out and prescribe abacavir? He tells me, well, it's so rare that we often don't, and we look at populations where this isn't normally segregated. But if you look at a global scale, right, if my dad were practicing in India as opposed to in Miami, then he should definitely be genotyping, right? Because this is a mutation that reaches close to 20% frequency in the Gujarati Indians, right? So this is very clinically relevant in this group. And then the next most frequent is in the Maasai. Again, this is all data coming largely from HapMap3. I can tell you there's no population genetic first principle that predicts this, right? It's really kind of a very empirical thing that we need to go out and catalog. But as you can see, there's tremendous variation from population to population even for a highly clinically relevant, you know, everyone would put this on their bucket list of variants to genotype in routine screening, right? The other thing that we don't know is what other variants in other populations may have associations that we haven't found yet because we've only really studied in a subset of populations. So that's sort of the overall message that I want to leave you with or begin to think about as we go through the talk. Okay, so my first vignette is a project that began several years ago that for us really illustrates this idea of what we might call local variants, regionally local but, you know, globally rare. So this guy named Sean Miles was a postdoc in my lab and he had done his PhD with Mark Stoneking in a set of islands off the coast of Papua New Guinea known as the Solomon Islands. The Solomon Islands are a sovereign nation consisting of about a thousand different islands. And Sean came to me with this photo and he said, Carlos, I want to study the genetic bases of blonde hair. And I said, Sean, look at this kid's US servicemen's jacket. I can tell you the genetic base was wrapped in cuckoo leal that got left behind, you know. And he said, no, no, no, I think it's something different. And I said, all right, you know, Sean, if you get the money to sample then we can talk about it. Figuring that like all postdocs he'd move on to the project he was supposed to. And of course, like all projects he doesn't, you know, he got really in fact rid of this pet project. He got money from the Werner Wren Foundation. He went out in the field and had a field assistant named Nick Timson and they collected over a thousand samples and did spectrophotometric measurements of hair and skin pigmentation and then came back to Stanford. And he said, all right, I'm ready to roll. I got my thousand samples, you know, let me at the sequencer. I said, whoa, whoa, hold on, hold on. This is still expensive, you know. You know, how do we know that we're going to be able to find anything? And he said, you know what, like, why don't you just look at the extremes of the distribution? I said, wait, what? He said, you want to genotype just a hundred people? Like, have you read Nature Genetics? Do you know the sample sizes that you need to do with GWAS? He said, no, let's just focus on the extremes. We'll take the kids with the blondest hair and the kids with the darkest hair. And I was very skeptical. And I said, okay, well, you know, when this doesn't work, you're really going to have to work on that postdoc project that we had agreed on. So he ran the association and by that time Emer Kenny was a postdoc in the lab. Some of you may know Emer. She's been very much involved with Paige. And this was her first week in the lab, which we kind of call the first best week ever. And, you know, so some of you know in GWAS these kinds of plots are called Manhattan plots. They're supposed to remind you of the Manhattan skyline. We like to cheekily call this our Dubai plot because there was one and only one strong signal of association. And when you dig in there, you know, it's not just, you know, a huge chromosomal region that's associated, but rather it's this narrowly circumscribed region with an odds ratio of 30. I'll put that up against anybody's GWAS results there. An odds ratio of 30 and only one real gene. Okay. And it's not some FLJ predicted blah, blah, blah. It's a real honest to goodness gene that people know a lot about called TURP1. And TURP1 is very interesting because it's expressed in melanocytes. It's involved in maintenance of the melanosomal structure. It affects the proliferation of melanocytes and cell death. And there are human knockouts that have a form of albinism known as OKA3 or albinos of dark skin. As a good PI, I then became very interested in this project and was very glad we had green-lighted this. So we went through sequence the exons and the whole region figuring, okay, you know, we're going to exclude any polymorphisms in the coding regions because it's one thing we've learned from GWAS is that it's all regulatory. Well, you know, as luck would have it, there was one and only one change among cases and controls and it was an arginine to cystine mutation at a highly conserved protein amino acid residue. You know, just read out the alignment here. Human, rhesus, mouse out to zebrafish. They all have arginine and our kids are walking around with a cystine mutation. And because we had also genotyped 1,000 of them, we could just sort of replicate in our own cohort here. And here's the distribution of standardized hair pigmentation. Here are the kids who are homozygous arginine. Here are the kids who are heterozygous arginine and cystine. And then here are the kids who are homozygous cystine. There are two standard deviations out from the distribution of hair pigmentation. These two are nearly identical. So it's really consistent with a recessive model for blondism where, you know, this mutation has reached appreciable frequency. In fact, the mutation is 30% frequency in the Solomon Islands and blond hair is 10%, right? So it's like a nice textbook example. What then became really fascinating for us is that it is really geographically restricted. So this is what one may call chemically induced blond hair variation. But if you happen to be, you know, European and have blond hair, you'll have an OKA2 mutation. You don't have this Terp 1 mutation, right? So here's blond hair. You don't need to be a Stanford trained or any trained dermatologist to phenotype blond hair. You need to be about six years old, realize people have different color hair, and then ask, why do people have different color hair and is that different in different parts of the world? And when you look at the second major population that has blond hair, we realize they've got blond hair for a totally different reason than Europeans have blond hair. So why are we going to then suppose that the genetic basis of diabetes is identical in all human populations, right? It doesn't really stand to reason, right? We need to go out and test that hypothesis. And in particular, we need to really sort of develop the right well-powered system in order to do this. The other cool thing about this gene is that it underlies a classical light brown allele and mice and dogs, which we also studied. A knockout leads to brown versus black pellage. And then this is that South African family that has an OKA-3 mutation here, mom and dad, and here are the kiddos. So really, for me, this study illustrated, number one, that model systems were going to be really key to disentangling this. The reason we were able to move so quickly was because someone had done the mouse work and someone had done the biochemistry to really kind of nail the functional mutation. Secondly, even for highly heritable traits, like skin and hair pigmentation, we do not have anywhere near a complete catalog of even the common alleles that are associated, right? So we need to do a far better job. And this is not really phenotype, you know, this is really right now phenotype limited, right? So it's not, you know, the cost of running this, even on arrays, right? It's pretty cheap. What we need are the right set of phenotype samples, likely, perhaps other traits are going to be amenable to this kind of dissection where we focus on populations that have interesting evolutionary histories. So for example, we've got a study right now in Puno Peru where we're studying preeclampsia at 12,000 feet in a population that's largely Imada Quechua. You know, 99% of the people have, you know, more than 97% Imada Quechua ancestry and 23% of the population is preeclampsia, right? So, you know, that's a great pool in which to try to fish for these large effect alleles. Okay, so with that, we really started to put our head together and say, okay, what do we really need to do? And I'd like to argue that we're not really yet in a post-GWAS world, because there are many populations in which we haven't even done the first GWAS, right? And so here are just some examples. My postdoc actually pulled these up. That's why they've got several from our paper, from our group. But, you know, here's a beautiful example from Alt Shuler and the group at the Broad where they were able to find this common risk factor for type 2 diabetes in Mexico that explains like 25% of the health disparities, right? Here's a mutation that's at 25%, 30% frequency in Mexico and in the rest of the Americas, but largely absent from other populations, okay? So, this is really kind of the motivation behind Paige and thinking about designing a new chip that will really be properly powered for multi- and trans-ethnic mapping. This is Chris Ghanou, who's sort of led this project for us. Paige is funded by NHGRI. The goal is to genotype 50,000 people. And, you know, in a good, strong arm negotiation with Illumina, we got this down to 55 bucks a sample with the ability for us to help design what that array would look like. And so, you know, how did we do this? Well, we really focused on prior knowledge and thinking about relevant polymorphisms to a real kind of clinical setting. We also collaborated. So, it turns out that NHLBI and others through ARA funding had funded Kathleen Barnes to do a wonderful project on African descent populations and their diversity called CAPA. And, you know, Kathleen was a friend of ours, and so we talked to Kathleen and said, look, let's join forces. And so, you know, we started working with her data and were able to roll it into the design of the project. And as a result, this array that we designed has better imputation and mapping accuracy in African populations than it does in other populations, which is really quite a feat given that African populations are the most, you know, genetically diverse populations. And we'll talk a little bit about why that is. You know, the other thing that we did is really try to hit clinical relevance, and so we focused on 2,800 genes that are of particular clinical relevance, both in terms of designation by ACMG as well as, you know, looking at where is it that docs order up genetic testing. We scoured databases including ClinVar and ClinGen, which we're involved with, and others to really come up with, you know, about 200,000 polymorphisms that are either pathogenic or likely pathogenic that we could put directly on the array and have them genotyped in hundreds of thousands of people. Okay, so that for us will be a really phenomenal way, particularly if you think about linking this ultimately with EHR, to get phenotypic distributions for a given genotype, which is really one of the things that we've yet to really do well, right? We tend to do a good job of finding associations between a genetic variant and a trait, but, you know, given how it's found, right, you can't really believe the estimate to the effect sizes, right? You've got Winner's Curse and all of these things. If you really want to understand the phenotypic distribution for a given variant, you need to go out and characterize the phenotypic distribution for a given variant, right? And, you know, this is related to kind of like the ApoE effect and so on. So to make a long story short, there are 1.7 million variants on this array, genotype that 55 bucks a sample, including a ton of new exomic variants with variation from Africa, the Americas, Asia. We've got this sort of 700k African power content and a GWAS scaffold that I'll briefly describe because it's really what's going to make this hum. So this is really the work of Jen Bochik, a postdoc in my lab, who came up with a new way to prioritize the sort of tagging SNPs and it really kind of goes to this idea that common variants are rare and rare variants are common. So what we really want on these tag SNPs are tag SNPs that will tag across multiple populations, right? That's one thing that for us was really, really critical so that you can properly test whether an association that you see in one population has the same effect or similar effect in another population because currently we don't do that, right? We take arrays that were designed for Europeans and then ask, okay, is the effect size the same for a given associated SNP in another population without really understanding whether the pattern of linkage disequilibrium is comparable, right? You may be really underpowered in that second population so if you don't get the same effect size or you get a higher effect size, it's very difficult to interpret. So that was really kind of the goal here and our workhorse was the 1000 Genomes project, right? So if we had not had all the data from 1000 Genomes, it would have been very difficult to really properly design and power this. And this is probably the geekiest bit of the talk. So what we wanted to do is sort of compare two tagging strategies. One is to go through and for every SNP, calculate the linkage disequilibrium, which is our sort of way of figuring out the local genomic neighborhood that we're trying to tag. Look across all the superpopulations and then prioritize on the number of SNPs that are given SNP tags. So here's what this sort of looks like. So, okay, so here's the set of SNPs. Here's the first SNP and you can see that in the African population it tags 184 other SNPs. In the Americas populations it tags 198. It doesn't do a good job of tagging in Asians and Europeans and so on. So you can sum this all up. It's, okay, how many SNPs are being tagged here? 826 SNPs. For this second one it's 801. For this third one it's 659 and so on. So now you'd say, okay, this is my number one SNP. I need to make sure that's on the array, then this guy, then that guy. Does that make sense? Okay, so now one problem with this is that differential investment in different populations, right, could lead to you choosing tag SNPs that do really, really well in a subset of populations but don't generalize. So what Jen came up with was a strategy that would take these data and then sort within populations so that now we could ask, well, how many populations are we really tagging with this SNP? At the end of the day we're really only tagging four, whereas in these other SNPs we're tagging all six populations and so we sort of reorder the prioritization so that we begin with the SNPs that tag the most across different populations and we put those first in our priority list and begin to build the scaffold in that way. Well, what is the end result? The end result is that we're now able to dig far deeper into the allele frequency spectrum than we could before. So in yellow here is what the sort of tagging was as a function of minor allele frequencies pre-bega array, page array, whatever we want to call it for Africa, the American descent populations, the Americas, Europe, South Asia and Asia. You can see here in blue what the allele frequency is for the new set of tag SNPs. The other thing that's important to note is that we're now starting to delve into an area of the frequency spectrum where recent human population growth becomes very, very relevant. So the fact that there are two and a half billion people living in Asia shouldn't surprise us in that the allele frequency spectrum now becomes highly, highly, highly skewed. These are populations that have grown so quickly since the advent of agriculture that their allele frequency spectrum looks markedly different than other populations throughout the world. The upshot of this is that now as we try to impute and tag genetic variants down in this low end of the frequency spectrum, we are no longer doing as well in some of these populations as we actually are in parts of Africa where even though there's more genetic diversity, the structure of that diversity becomes eminently taggable. So it's actually pretty interesting. And you can't really read this very well, but the point here is that this is our average, our squared. So how well are we doing in tagging as a function of common low frequency and rare variants? And so we're always sort of above 95% for all of these different populations until we start getting into the real, rare end of the frequency spectrum. And I'll just draw your attention to the fact that the populations that we're now doing the best in are populations of African descent and of admixed ancestry from the Americas, exactly who we wanted to target for Paige and other projects. So we're now really in a realm where these populations are getting equal footing in our ability to map. The other thing that's very interesting is you can break up taggability and our ability to do so as a function of local ancestry. And again, I don't want you to focus too much on the details. These are, this is Columbia, Mexico, Puerto Rico, Peru, African-Americans and individuals from Barbados. And where you can see that we're not doing as well are in these sort of double Native American segments, right? And the reason for that is that our array was based on whatever exomic content we can get our hands on, right? And so the amount of exomic content from Native Americans and Native American descent populations was really small. We had like 200 Native American exomes that we put into the pot. And then there were some others that came from 1,000 genomes. Whereas for Europeans, there were like 80,000, 100,000 exomes that one could tap into, right? And so that kind of difference in even just the basic fodder that goes into your array design ends up impacting you. But again, we're far better off now than we were before having designed this array. And then this is now getting sold and commercialized by Illumina. They're planning to sell somewhere between 200,000 to 300,000 of these this year. All right, so conclusion so far. I should have added here again, if you want to herd cats, move the food, right? It was because Paige sort of put this out for a large multi- and trans-ethnic mapping effort that this got designed in the first place. It looks like this kitchen sink design is starting to work. We've got good imputation working. 50,000 individuals in Paige should be genotyped by sometime this summer. And what I've told the Paige folks is that you should really be billing yourself as a pilot project for the Precision Medicine Initiative, right? If you really want to build a large national multi-ethnic cohort, this is probably about the best one to look at because it's really got an overrepresentation of African-American Hispanics. We've got Native Hawaiians. I mean, it's a very, very diverse cohort. And so we're facing a lot of the analytical issues that we really hope the Precision Medicine Initiative will face if it's really designed to be representative of the country. Okay, in the last sort of remaining time that we have, I really wanted to switch gears a little bit and talk about ancient DNA, not an area that you think about as being particularly contentious in terms of disparities research, but one, in fact, which I hope to convince you can actually benefit a lot from what we're learning in medical genetics and actually can yield insights into incredibly important and fascinating mysteries. So as I like to say, you can't open up an issue of science or nature without seeing a new ancient genome or even the New York Times. So this is the set of bones from the Neanderthal genomes. This is a guy named Utsi the Iceman that is the one who got us into all this trouble in the first place, at least my lab. And it's been just absolutely amazing to be part of this field and just see how what we thought we understood about the great human diaspora is really overturned when you begin to sequence individuals who lived 50,000, 100,000 years ago and see who they interbred with, what their patterns of variation look like compared to modern-day populations. One huge limit, though, is that this stuff is expensive, right? If you look at the Neanderthal genome, it's 1% human DNA or human-like DNA, right? Which means it's literally 100 times more expensive to do a Neanderthal genome than it is to do a modern genome. And so you basically need to have the good backing of the German government or the Danish government in order to do this work. That's why SK Billershed and Svante are the leaders in this, because they really have the sort of backing to do that. I don't think, and Eric could prove me wrong, I don't think the NIH would spend 15 to 20 million dollars to sequence a Neanderthal genome. It's just not something that we would sort of prioritize in the way that does get prioritized in other places. So for us, if we want to get into this game, we had to come up with a better solution. And that's where being at a place like Stanford can be really fun, because I went to go talk to one of my kind of crazy not crazy, one of my great colleagues who had some great ideas and we were able to come up with a better approach for this. So the problem was that less than 1% of the endogenous DNA is human, it's a practical, expensive to sequence, and so we wanted to come up with a better approach. So as I said, this is the guy who got me into all this, is a guy named Utsi the Iceman. He was found in 1991 on the border of Austria and Italy, literally frozen in a crevasse. When they found him, in fact, they didn't even know he was an ancient caveman. They thought he was some hiker who had fallen in the crevasse, so they started taking the body out, and they're like, huh, wonder why this hiker has a copper axe. They're like, wait, why does this hiker have a bow and arrow? Well, this isn't going to fit in the truck. And then they pull him out and they go, oh my God, it's a copper age individual. Now we know more about Utsi than almost anybody. Utsi's had like his last meal analyzed. It turns out it was wild ibex and icorn wheat. So he was living part of the paleo lifestyle with the ibex, but still having some carbs with the wheat. They even did scanning tomography. It turned out he had the beginning of atherosclerosis, so even the paleo lifestyle hunting all the time, you're still at risk for that. And when we sequenced the genome, sort of as a funny aside, when we submitted it to Nature, it didn't get in. And then one of our colleagues said, you know, I talked to the editor at the New England Journal, and they're potentially interested. And I said, okay. And so he sent it to me, and it read like the world's weirdest case report, you know, 5,408 year old patient presents with an arrow head lodged in his background. And so we sequenced his genome to figure out what risks he had. Needless to say, it didn't get in the New England Journal, but it came out in a Nature Family Journal, as we like to say. What was actually very fascinating for us is that we'd spent a lot of time trying to understand patterns of European genome variation. So here's Switzerland, here's Spain, here's Italy. This is a principal component analysis of individuals that have four grandparents from Europe. And when we place Utsi on the map of Europe, he clusters with these five individuals that sort of seem to break from the rest. And it turns out these aren't just any random five Italians, they are Sardinians. And when we looked at Utsi's Y chromosome, he carried a Y chromosome that today is restricted to Corsican Sardinia, right? So here's a 5,000 year old dead guy. And because we've invested so much in the genetics of Europe, we can give him pinpoint precision about his ancestry. We cannot do that for the majority of people alive today, right? And so that's why I like to call the Utsi rule, right? We should be able to do for people alive today what we can do for good old Utsi. And that was one of the main motivators for us. So to make a long story short, we developed something called whole genome insolution capture. This takes shredded DNA, adds a T7 promoter, and produces a ton of RNA that we can then use to bait. This was developed by Meredith Carpenter, a postdoc in my lab. She then got an NRSA to extend this work and is now the CSO of a company. We've spun out to commercialize this for lots of potential settings, as well as Will Greenleaf, my colleague at Stanford and his student, Jason. As I said, the way this works is that we produce boatloads of RNA by transcribing randomly sheared human DNA. We then biotinolate these probes and so basically steal a page from the exome capture world. You then have your usual workflow, your ancient tooth, you extract DNA. Less than 1% is human. We make our sequencing libraries in a clean lab, and then you just use array hybridization, or in this case, in solution hybridization, to pull out the human bits. The first set of samples that we applied this to was the first paper that had more than a couple genomes in it, and included seven pre-Columbian mummies, four Iron Age individuals from Bulgaria, and this particular individual from Denmark. We saw huge enrichment in our ability to map, reach the human genome. In blue here is what you get pre-capture in terms of the proportion of reads mapping to the human genome. In red is what you get post-capture, and you can see that capture works. In fact, it enriches in some cases so well that you're going from 1% to 2% human DNA upwards to 60% human DNA. In this way, it's really democratizing this field and allowing other individuals to enter. Here's an example of the improved resolution. This is what you would get. Here's a principal component analysis. Here's Africa, Asia, Europe. Here's a Bulgarian tooth, and then if you look at what you would get pre-capture, you get about 1,000 SNPs, and you can say this individual is European, which is not surprising given that they were found in Bulgaria. Now, if we go to the post-capture, so it costs us the same amount of money to generate 10,000 SNPs post-capture as it does 1,000 SNPs pre-capture, we're now beginning to get very nice subcontinental ancestry resolution similar to what we had with the sort of whole human genome sequencing of let's see. Another example is our Peruvian mummy. This Peruvian mummy is horrified, perhaps to find out here that they may be Asian and not Native American. This is actually one of the hypotheses because this mummy actually has blonde hair. You can't tell because it doesn't have any hair, but it comes from a Chichamoya culture which has blonde hair, and when we go up to the post-capture, we see that in fact it clusters very cleanly, thank you very much, with the Aymada and Quechua Indians that live in that part of the world today. So this was our first sort of example of how this could work, and sort of just to briefly review, we're going from about 1% human upward to 60% human. It's really opening up a whole new set of samples, and so I want to close with two such samples that for us have been particularly important, and we hope representative of where this is going. So one of the areas that we've been very involved with is the transatlantic slave trade and the use of DNA to retrace aspects of the transatlantic slave trade. This is really one of the huge horrors of human history. 12 million people forcibly enslaved and moved from different parts of Africa to the Americas. About half of them came into the Caribbean, the vast majority, didn't go into the Caribbean, went into Brazil, about 200,000 went straight into the Port of Charleston. If we want to understand the history of African descent populations, a lot of that history is written in the Caribbean, and in fact probably the best place to start, because you have very good slave ship records that have not been digitized that can begin to give us an anchoring point. This is one such individual. This is a skull that was recovered in St. Martin in 2010. It was excavated as part of a redevelopment program. The skull dates to the 17th century. We believe this individual was part of the Middle Passage. They were born in Africa and brought to the Americas. We know this from both isotope analysis, as well as the sort of dental modifications that are characteristic of certain tribal groups. We sequenced the genome and used the enrichment capture technology. In this particular case, it's a male. One of the first things we can do is look at his Y chromosome. Initially, we were a little concerned because he has an R1B, which is the most prevalent and common haplogroup in Europe. However, we could play the same Uzi trick because he carried a mutation that is actually pretty rare. When you look at the distribution of that mutation, it tells us that there was sort of a lineage of R1B that made it back to Africa into the Sahel, probably as part of the colonization of North Africa close to 12,000 years ago by Berbers and so on. So the Octanus North African people. Today, it has its highest frequency in Cameroon. Of course, we have not only the Y chromosome, we have the whole autism of this individual when we looked at the autism. The other thing we could do is place them on a very detailed map. We now have of the Y chromosome. We've sequenced about 3,000 Y chromosomes in my lab. These were actually sequenced by Francesco Cucca as part of his study of ironically Sardinia, and there are a group of Y chromosomes from Sardinia that were probably part of the Roman slave trade, and STM1 clusters with them exactly where you predict about 8,000 to 9,000 years ago when that lineage sort of radiated in Africa. When we look at the autism, we see that STM1 clusters with... So these are different tribal groups in Africa. The samples collected is part of an NIGMS project with Sarah Tishkoff, and what we can see is that STM1 clusters with the Bamoon who are a tribal group in present-day Cameroon. So for us, this was incredibly gratifying and sort of the first African individual who we can map back to Africa and really begin to break that Utsi rule. The other slave trade that we're studying that's really far less studied is the Indian Ocean slave trade. So a colleague of mine at Stanford is from Madagascar. Sorry, he's from Mauritius, and he's had a long-standing project out there, and so we started collaborating with him. It's very interesting to think about Mauritius. Mauritius was largely not colonized by humans until modern travel, modern era, and it began with an importation of slaves from Madagascar, East Africa, and West Africa around the 18th century. That lasted for about 80 years, and then indentured servants were brought in from Asia as well as European colonizers. So this is an island that's got very complex, multi-ethnic admixture, and today has one of the highest prevalence of diabetes in the world. And so one of the things they'd love to be able to do is go in and enable multi- and trans-ethnic mapping of diabetes. We need to understand the population structure, and we now have an opportunity to link this back to history. So Rosa for Hell, who's a postdoc in my lab, began really what's been an incredibly interesting project for us, looking at two sites in Mauritius. The first is a place called the Lamorn Peninsula. This is a very dramatic cave formation and a saltic monolith that sort of jumps up 2,000 feet at the edge of the island during the slave period This is where runaway slaves would go and hide, and after they'd been freed and the police and army came to inform them, they were so petrified that many of them jumped to their death. So there's actually a cemetery at the bottom here that has these individuals as well as others that perished during that period, and we were given access to this incredibly important place, as well as another cemetery known as Bois Marchand, which is the cemetery for the indentured servants. And so remember when we think about ancient DNA, we always think about the ideal climate for preservation, which is like Siberia, cold and dry. It's not supposed to be hot and wet, so the fact that we're getting DNA out of the tropics in the first place is actually a pretty good achievement. And then secondly now we're looking at this complex issue of admixture. Again, to sort of make a long story short, we have four samples from the slave cemetery, four samples from the indentured servant cemetery. Here, as you can see, the amount of DNA we're able to get after enrichment for those that came from the slave cemetery, all of the mitochondrial haplogroups are sub-Saharan Africa, which is very good, at least in terms of the history. It also probably speaks to the sex-bias nature of some of this admixture, and then we'll talk about what the autosomes look like in a PCA slide in the next slide. Likewise, from the Bois Marchand site, all of the haplogroups are either Indian or Southeast Indian or what you'd expect. So to me, this is really one of the nicest slides we've produced in a long time, because it puts all in one figure modern-day diversity. So here is West Africa, East Africa. Here's Southeast Africa, so Madagascar and so on. Here is East Asia and Southeast Asia, and here's Europe. And so you see, for example, here are individuals from South Asia on this client, and here are two samples from the Bois Marchand indentured servant who are right in line with that client. So these individuals are largely what we would today consider South Asian individuals. On the other hand, here's an individual that's kind of equidistant from this cluster and this cluster suggesting that they are a first-generation admixture of individuals coming from these two groups. Likewise, here's an individual who's largely Southeast African. And then from the Lamorn Peninsula, which is the slave cemetery, there are much more on this sort of African edge of the diversity space. So again, it's sort of early days in doing this, but for us it's a way to really begin to reclaim this history. We've lost a tremendous amount of this history. Individuals, both academic and non-academic, the general public, wants to understand this. The number of people I've talked to who are of African descent or multiethnic descent that want to understand their ancestry, and they turn to ancestry.com or 23andMe or other services to kind of get a foothold on that. Well, how do we know that's accurate? What are good ways of understanding that? How do we also link this to these incredibly painful but also important aspects of our history? And to me, I do view that as part of the mission in broadening representation in biomedical research, because these are questions that ultimately are important for understanding the structure of human populations, but also important to our understanding of ourselves, of our history, of the public's understanding of genetics. And so I think it's actually a very good opportunity to use this to engage the public as to why this is relevant and important. Okay, with that I want to conclude it. It's possible to obtain genome-wide ancient DNA from the tropics and reclaim these detailed genetic pictures. We're getting closer and closer to realizing the OZ rule, and that makes me happy. And I think it's really the way to go. It's important that we not just invest, even in something like ancient DNA, which seems somewhat esoteric and one subset of the world, because we're leaving a lot of interesting biology on the table and it's just not the right thing to do. In terms of the samples I've talked about, STM1 was likely to ban two-speaking population from northern Cameroon. There were two other samples that I didn't talk about. They were found at the same time and came from a totally different part of Africa. So it tells you a little bit about how that all unfolded. And then we're seeing even in complex admixtures we can deconvolve the ancestry. And so that makes us very optimistic that this technology will help us push further back into the sands of history. With that I want to acknowledge the folks in my lab who've contributed to this. All the ancient DNA work was largely led by Meredith. David is the guy who runs all our Y chromosome work. Alex runs my lab. Maria ran the African Slave Project. This is done in collaboration with SK Villarshev in Denmark and our shared postdoc Morton Rasmussen. And I always like to put up this far-sighted cartoon where the guy's pointing at this guy. So if you have any questions, I'm happy to take them back to my postdocs and see what they'll tell you. All right, thank you. We certainly have time for questions and we have two microphones here since we're videotaping this. I could walk to a microphone to ask a question. That would be appreciated. Yes. Hi, interesting talk. I had a question about the, I don't know what the definition of ancient DNA is, but in the Mauritian population, and I can't remember where it was in the Caribbean that you sampled the slave skull. St. Martin's. St. Martin's. What about the modern-day populations in those? Yeah. I mean, would there be a discernible difference between modern-day Mauritians versus this same one from 300 years ago? That's a great question. So my answer there is that if we think about, for example, modern-day African-Americans, and you hear Oprah go and get her genetic results and they tell Oprah she's Zulu, right? The one thing I can assure you of is that Oprah ain't Zulu. All of her ancestors aren't Zulu. The odds of that are very, very low. She's got ancestors that come from lots of different places and mitochondrial, let's say, or mitochondrial haplotype may be at highest frequency in present-day Zulu populations. What we can see in these Mauritian populations, for example, are the beginnings of those admixture events. So you find somebody who's a first-generation admixture event. When we look at present-day African-Americans and Hispanic Latinos, what we see is that this is a process that's been going on since the beginning, right? In Mexico, for example, there's a canonical painting of Cortez and the Malinche, who is his concubine, right? And that's the kind of national image of where the country comes from, right? It's that admixture event. In African descent populations from the US, what we've seen is that there were sort of two pulses, and you can see sort of admixture that happened before the kind of anabellum south and then something that really kind of almost became industrialized at the point of peak slavery, right? Where a quarter of the genome is coming from a European gene pool, right? That's a huge amount when you think about how that must have occurred back then. So it's giving us insights into those aspects. One of the neatest things that's probably happened, and it's totally unrelated to this work, but there's a Romanian skull that got sequenced that is human. It's anatomically modern human. But when you look at it, they've got more Neanderthal admixture than present-day humans, right? Something like 12%. And there's one chromosome that even looks like it's really Neanderthal. So that somebody had like a Neanderthal grandmother, right? So that kind of signature we're going to be able to discern in this way. Excellent talk. When you spoke about creating a basically a gene panel that can be applied through all seven continents basically, you talked about the linkage strategy. One question I had is did you get a chance to, when you looked at each gene across each continent, did you get a chance to substratify by gender or other confining variables? That's a good question. So I would say that, well, so the autosomes, right, shouldn't have average difference between males and females, but the X chromosome, we probably need to have a somewhat different strategy for particularly in admixt populations. If you think about admixture in the Americas, the X chromosomes tend to be largely Native American and African, whereas the Y chromosomes are disproportionately European, right? And so that what we gingerly called sex bias migration needs to be taken into account, particularly in doing more population genetic analysis. So it's a good question in that regard. So one of your roles that Eric didn't mention is the Independent Expert Committee for Human Readity and Health in Africa. And as such, sort of looking forward, where do you think that H3Africa has the potential to contribute to this big picture and the story and where do we put the food next? Yeah, that's a great question. So, you know, I think the, I want to say the most important thing, but one of the really important things that H3Africa has done is bring together African investigators to tackle these questions in a highly functioning network, right? Which is, you know, far, it's like very, very non-trivial, right? Like that's a very, very difficult thing to do. If we think about the culture shifts that have happened in medical genetics in the U.S., right? Like in 1996 people weren't sharing data all the time, right? Today it's hard to get people to share data, right? And so in building that sort of cohort of investigators, you've really got an opportunity to move the needle in a way that we probably aren't going to have if we don't think about continuing that project. I think that the, what I would love to see is, you know, a closer marriage with something like PAGE where we bring in some of that content into the, you know, the MEGA 2.0 or 2.5 or whatever we want to call it, because I think, you know, there we'll have much better purchasing power and we can negotiate better with Illumina and other providers. And I would also love to see, you know, H3 Americas, you know, because I think it, you know, it really needs much better organization. One of the things that I've gotten involved with is the Global Alliance. And it's really incredible to see when you talk to investigators who do research in genetics in places that are totally understaffed and underfunded, right? Like they've read all the papers, right? Like they're much better at reading the literature than we are, right? And they know what the interesting questions are. It's just really kind of a lack of resources that's preventing them from, you know, being superstars in the field. And so for me, it's always a pleasure to travel to developing countries and work together with them because it honestly totally reinvigorates me. Like whenever I get jaded, I'm like, okay, I need to go down to Peru or something because like then you go talk to people who are like, you know, we're all of the lab benches are stacked like this. And, you know, investigators are making $1,000 a month and, you know, using part of their own resources to fund their research, you're like, okay, like this is for real, right? Like this is what a life's passion is about. And so I think it's really, really important to do that because if we don't engage as a kind of global community, then you get bad outcomes, right? It's not the right way of doing things. Hi, Carlos. I, again, just to follow up on that comment, I wanted, I'm a little bit concerned about the placement of variation, especially in the context of Africa, given all the geopolitical history. For example, the Cameroon, one time Cameroon was part of Nigeria. And so how do we begin to tell the, especially when we are trying to present a picture where maybe people in the Americas want to actually trace, you know, in a geographical sense where they are from or where the ancestors are from? How do we marry these geopolitical boundaries that were mostly colonization, you know, which was really real and where people actually lived? Yeah. No, thank you, Charles. An incredibly important point. So I try to be careful and say, you know, they cluster with the Bamoon people who live in present-day Cameroon. Because I feel it's like about as accurate and as neutral a statement as one can make. But I think it's a very important point. And particularly if we think about, you know, so we've worked, for example, with the Saan and the Khoi Saan. And, you know, that is not the same population that's lived there for 100,000 years, right? They're not Stone Age people that have been, you know, frozen in glass for you, right? So they themselves have had a dynamic population history. And particularly when you're talking about populations that are traditionally hunter-gatherers and had much, you know, bigger distribution. I don't think there's an easy answer. I mean, I think we have to negotiate as a community how we talk about it and how we communicate this to the public. And that's why many of these debates aren't for scientists to spearhead because they involve so many other groups, right? You know, we've had the same kind of discussion when we talk about, for example, the work in Peru. And I'll say, you know, we study this cohort. They're 100% Native American, right? They don't have a lot of admixture. And then I rightly get called out, well, what do you mean? What does that mean? What if somebody, you know, is from a Native American group and they've got European ancestry? Like, they're not 100% Native Americans. Like, no, of course they're, you know, it's different definitions of how we identify, either culturally or genetically or biogeographically, ancestrally. And the more precise we are, probably the better, right? So if we say, you know, they share 100% membership in a clustering with Ayamata Kechua, well, then you kind of know what you mean by 100% Native American versus, you know, starting a debate about, well, you know, what's the difference between cultural identity versus genetic identity? We had a similar, I think, very good debate when we were working in Puerto Rico. And one of the reasons that Puerto Rico was chosen for 1000 Genomes is, you know, Charles helped us really lead the charge on why we needed to do this in 1000 Genomes, was because there isn't a present-day population that you would say, okay, they are the modern-day ancestors or, you know, and so when we started working on that and presented it at ASHG, Nature News, so because the idea was Puerto Ricans carry about 12% Native American ancestry that is coming from the Taino ancestors. So if you could stitch this together across many Puerto Ricans, you could reconstruct the genome of the Taino or reconstruct Taino diversity at the time of contact. And so Nature News wrote this article that was called Reconstructing an Extinct Ethnicity, right? And that set off this really powerful debate because there are people, not so much in Puerto Rico, actually, interestingly, in New York, who self-identify as Taino. And they're like, well, who the hell are you to tell us that the Tainos are extinct? Right? Like, this is just another kind of recolonization event. And we said, okay, well, we never... That's how... First, extinct is not accurate because extinct means left no descendants. These people clearly left descendants, right? And it's a very, you know, it's a very passionate debate, you know, but one that, you know, shouldn't just be had by scientists. So I totally agree. I don't have an easy answer, but I think it's one that we really need to think about. Well, Carlos, as we expected, this was a phenomenal way to start the series. We've touched on many issues, scientific, social, and other, and we're going to continue with this theme as we go through some of the other speakers. So please join me in thanking Carlos.