 Okay. Good morning, everyone, and thank you for joining us on this absolutely beautiful day here on the Bethesda campus. Our lecture today is devoted to genome-wide association studies, and as you all know, these kinds of studies really help us separate genetic variations that are biologically insignificant from those that do produce some sort of change that might ultimately be detrimental or advantageous to a particular individual. And the study of these variations are also critical to identifying what genes are responsible for a particular genetic or genomic disorder, as you heard about during last week's lecture by Lynn Geordi. There's also a much more practical reason to study these genetic variations, particularly the single nucleotide polymorphisms or SNPs that give rise to all of those subtle differences between each and every one of us in this hall, since a very thorough understanding of these variations might provide a way for us to know in advance how well someone will respond to a particular drug or to a particular treatment regimen. And we'll hear much more about the pharmacogenomic implications of having this kind of knowledge in next week's lecture in this hall by Howard McLeod. This week, I'm very pleased to introduce to you Dr. Karen Mulkey, who will be presenting today's lecture on genome-wide association studies. Dr. Mulkey is an NHGRI alumna. Having done her post-doctoral work in Frances Collins' lab, where she used genome-wide approaches to localize diabetes susceptibility genes, she is currently an associate professor in the Department of Genetics at the University of North Carolina, a member of the Carolina Center for Genome Sciences, and a member of the Leinberger Comprehensive Cancer Center at UNC. Her lab studies complex traits with complex inheritance patterns using many of the approaches that she will be describing to you today to study conditions such as type 2 diabetes and obesity. As always, it's a pleasure to have you here with us today, Karen, and so please join me in welcoming Dr. Karen Mulkey back to the NIH campus. All right, thank you very much. It's always a pleasure to be here, and no difference. It's always a pleasure to be here. So as Andy said, I'm going to be talking today about genome-wide association studies. And these are especially relevant for complex traits. And I have no relevant financial relationships to describe. So complex traits are traits that have both genetic and environmental contributions to them. There may be many genetic factors, many environmental factors, and these factors may interact. That is that there's not necessarily a single gene responsible for these traits. And some of the genetic factors have rather subtle effects. As we investigate genome-wide association studies, these are especially good at identifying common genetic factors that may be responsible for common variation in complex traits. And by common factors, I mean that when looking at a stretch of DNA sequence and looking at several copies of this stretch of DNA sequence, of course, many of the alleles are many of the nucleotides are identical between those sequences. But sometimes there are differences. For example, here's a T, but in some copies of the sequence there's an A. That's a relatively common variant, 3 out of 10 times in that representation. It's an A allele, so an allele frequency of 30%. There are also DNA variants that are less common or rare. So for example, over later in the sequence, there's only one copy of a G allele where there may be 100 or 1,000 other copies of that are a C allele. When we think about the genetic architecture of genes influencing common complex traits, we can consider the different power of various approaches to identify the underlying genetic variation. If we consider the frequency of the variants, up here being more common variants, and common is often defined as a frequency of an allele greater than about 5%, moving on down to the very rare alleles, those that might be present in only one person or one family. And considering the effect of the allele, how strongly that variant acts to cause disease or to increase risk to disease. So a very strong effect allele shown high on the Y axis compared to ones that have a relatively modest effect low on this axis. So genome-wide association studies are especially well suited to identifying common variants implicated in common diseases. In contrast to say the rare alleles causing Mendelian disease that were more easily identified using linkage approaches or candidate gene approaches, there have been relatively few examples of high effect common variants that influence common diseases. And as genomic technologies advance, removing more from the common variants into the lower frequency variants, and so lower frequencies maybe from 5% down to about a half a percent and is sequencing technologies develop and more individuals are sequenced, then we're moving into the identifying more of the rare variants that will be identified to play a role in both common and Mendelian type disorders. So today as we talk about genome-wide associations, I'm going to talk first about what the goal of these studies are, how the studies are performed, what can be learned from the associated regions that are identified by the studies, and then what the findings tell us about disease. So genome-wide association studies, the first ones were done seven years ago now, perhaps, they became many more done sort of in the five, three years ago and continuing on today. The benefits of doing a genome-wide association study compared to classical approaches such as linkage analysis or candidate gene genetic association studies were that genome-wide association studies are more powerful than linkage to identify common and low-penetrant variants and provide a better resolution than linkage so that the variants identified are closer to the underlying causal genes and or variants than linkage analysis approaches and they can be performed in an unbiased approach. There's no need to select candidate genes and know the underlying biology ahead of time. These can be used to discover completely novel pathways involved in a disease or trait that were not previously known. Now why were they only started several years ago? There was requirements to perform a genome-wide association study. We need to know the catalogue of human genetic variants so the genome need to be sequenced and genetic variants across the genome identified. There was a need for low-cost accurate methods of genotyping and technology advances have enabled this to be possible so now hundreds of thousands or millions of variants can be identified in a single reaction. We need large studies of people, large numbers of informative samples and along the way efficient statistical design and analysis methods to handle the large number of variants being analyzed. So the goals of a genome-wide association study are to test a large proportion of the common single nucleotide genetic variants for association with a disease or variation in a quantitative trait and doing all this without having to have any prior hypothesis of how the genes may act or what their functions might be. I'll talk through many of the steps in a genome-wide association study. So starting with ascertainment collection of the individuals, the samples, the methods for performing genotyping, steps of quality control using that genotyping data, some of the methods of statistical analysis using this data and the importance of replication. So as we start thinking about the phenotype that is being studied, this can either be a disease or a quantitative trait. So a disease such as type 2 diabetes or prostate cancer or it could be a quantitative trait, height, cholesterol levels, something that's not discreet but has a continuous distribution of phenotype across the individuals. Disease could be rare or it could be common, although the common disorders are perhaps more appropriate for a genome-wide association study. Quantitative traits have the advantage of being easy to measure, things like weight and height. Some of them require careful approaches to measurement and getting an accurate measurement. Genomic association studies can also be performed using traits such as gene expression level of all of the genes across the genome. The accuracy with which a phenotype is assigned is an important step in analysis. The more well-defined the phenotype is, the more likely one will be able to identify the genetic variance responsible for it. The more heterogeneous the phenotype, if it's really a mixture of many different causes that create that disease, then those will be sort of mixed together and harder to identify the underlying causes. When selecting the individuals to perform analysis, the strategy, so one strategy is to perform a case control analysis, meaning ascertaining cases affected with disease and then also ascertaining controls who do not have disease. Another approach would be to do a population survey, collect many, many individuals across the population and then determine which ones of those are affected with disease. Using the population survey, all the smaller proportion of individuals affected with the disease, but they may be more representative of that disease in the population than if you perform an ascertain the cases that are severely affected with disease that might be less representative although they might lead to greater possibility of identifying the genetic variance responsible for them. So in a case control analysis, the methods or the approaches used to define the case are relevant and important to consider when interpreting the results of a case control association study. So were cases defined with extreme phenotype, how were they collected? Is there some special subset of phenotype that may be especially enriched in that particular set of cases? Similarly with controls, are the controls selected to be random members of the population that are not yet affected with disease, but then some of them, if it's an adult onset disorder, perhaps will become affected with the disease next month or next year, are perhaps less good controls when seeking to have a greater difference although so consideration of these approaches is important for how those results are interpreted. So potential criteria that one could use when selecting cases would be to choose individuals that are more severely affected with the disease. These might be individuals that have a greater genetic load than and so provide a greater opportunity to identify the underlying genetic factors. One could require other family members to have the disease. This is more evidence of a genetic factor responsible as opposed to more of an environmental contribution. Choosing for an adult onset disorder, choosing individuals with a younger age of disease onset also could enrich for genetic factors. When considering criteria for selecting controls, could enrich the genetic effect by choosing individuals with a lower risk of disease rather than population based samples. It's important to keep the ancestry of the controls and the cases matched as well as possible and to try to match the controls to cases based on age, sex and other demographic factors that may influence disease. To show a bit of an example about matched ancestry, if the cases are collected from the population but have different underlying ancestry represented here by different shadings of the symbols here, so maybe solid filled symbols and these two different categories, if that different ancestry is differently represented, if the proportions of those are differently represented between the cases and the controls and there are genetic factors or genetic variants that are more common within some of those subsets than others, then those genetic variants may appear to be associated with disease when truly they are associated with being part of that subpopulation. When performing an association study in a set of samples that have not previously been analyzed genetically, you may have inadequate ancestry information prior to performing the genotyping. Assertating individuals from a particular area may assume that the ancestry is similar between individuals. After performing genotyping with hundreds of thousands of markers across the genome, one can look at the frequency of different alleles and identify perhaps subsets of individuals that create subpopulations within the sets of cases and controls. Subpopulations that I've been talking about, another word for this is population stratification. The issue being that population stratification can produce false positive association results in case control studies. In addition, individuals that are cryptically related, that you don't know are related but have, that are say, cousins or something not collected, not known in the collection of the individuals, can enrich for particular alleles within samples and that can also create a false positive association. Ways to account for or avoid stratification and relatedness, one is to perform genomic control. So this is a correction that's an average, evaluates the sort of the average excess association identified and adjusts the results of the association study by this average measure to sort of alter the threshold that you use to define what a significant result is. Another approach is to use the allele frequencies of variants across the genome to identify principal components of say subpopulations or of substructure within the samples and then include those principal components of substructure as covariates in the analysis to account or adjust for them. Another approach to avoid population stratification would be to perform a family-based study design where instead of selecting cases and controls, the association's analysis is performed within families and considering the relationships between the individuals. On a per, given a set genotyping budget, however, there's reduced power for identifying variants when individuals are related and part of those families. So the genotyping process, now genotyping panels are available with as few as 10,000 SNPs, single nucleotide polymorphisms, as many as 5 million SNPs now. Two main companies provide a number of fixed content panels available meaning that the genotyping arrays or chips are available with set SNPs that are being evaluated on them. The approaches used to select the SNPs for these panels, some of them are random SNPs, some of them are selected to be haplotype tag SNPs, and Lynn Jordy talked about this and I'll show a slide about this as well. Some of the variants or some of the nucleotides chosen to be on these panels are not nucleotides that vary, that have different alleles in the population but for which the intensity of the signal differs because of a copy number variation. And some of the arrays now are, that are now available have fixed content but the user is allowed to add on an additional 10,000, 50,000 single nucleotide variants. So if you were to perform a genome-wide association study today, you may choose a panel and then say, oh, but these particular variants are missing from that panel, perhaps you know of some less common or rare variants that are not on the panel or some particular functional variants or variants that you think really play a role, those could be added onto the panel. Higher density genotyping, higher density SNPs in special regions of interest could be added onto those arrays. So when I talk about selecting haplotype tag SNPs, I have an example shown here. So in this example now there are four copies of a particular chromosome. Again, most of the nucleotides are the same. This is representing three single nucleotide variants in this region. When combined together with variants that are both upstream and downstream of this, the variants can be shown represented as haplotypes. Given the history of human populations and the non-random recombination events that have occurred during human demographic history, there are clusters or sets of SNPs that are being inherited together in most members of the population. So selecting SNPs that are representative of variation of other SNPs allows a more efficient, fewer SNPs to be genotyped to represent a larger proportion of the variation. So for example, these haplotypes of 20 variants can be represented by just choosing three SNPs within this set. And there are other variants that could be chosen as well. This is sort of an example. TCTC variant here could also easily be represented by this variant here, CTCT. But the set of three variants represents the variation present. So this also means that when interpreting the results of an association study, although a single variant might be described, reported in a paper, say this variant is described as showing strong evidence of association, it's important to remember that there are other variants located nearby that are in linkage disequilibrium with that variant that are inherited together in the same pattern as that variant that may, that would also show similar or identical evidence of association with that trait. So I'll talk through a few of the methods of allelic discrimination that are used in these genome-wide genotyping panels. One of them is this Illumina Infinium assay. In the Illumina assay, DNA is amplified to generate larger amounts of DNA, and then the DNA is captured on oligonucleotides that are bound to B to raise. An allele-specific extension or a mini-sequencing assay is then performed. So here is the genomic DNA target. It's being hybridizing to a sequence that is on an oligo that's bound to a bead, and a sequencing reaction happens so that if the allele provided is a perfect match, then the polymerase can continue on with that sequencing reaction. If there is a mismatch of the end nucleotide, then no continuing sequencing reaction can occur. There are a few different forms of this assay that Illumina provides. They have an Infinium 1 assay, an Infinium 2 assay. In this case, there are two different bead types used to represent a single SNP, and one color of detectable label that's used. In this form, a single base extension reaction happens, so a single bead type is used, and two different colors of detector are involved. So when Illumina describes the number of SNPs available on a panel and the number of, say, custom-designed SNPs that could be added to a panel, they talk about bead types because some SNPs are assayed well with a single bead type, and some SNPs are assayed better with two bead types. Affymetrics has a genotyping platform called their GeneChip Array. In this strategy, the genomic DNA is the sort of reduced genomic complexity by performing restriction enzyme digestion and size selection of the fragments. Adaptors added, amplification steps, fragmentation and labeling, and the allelic discrimination happens based on hybridization of one allele to sets of oligos on the array. So in their GeneChip probe array, there are millions of copies of a specific oligo probe bound. So in a given region, here are DNA probes in sort of one part of the array, and there are multiple copies of this same sequence with the same variant allele present. A given SNP can be represented by many different probes. Say the SNP allele, the variant allele may be in the center of oligonucleotide, and there could be as many as the four different sequences represented on the probe, representing all four possible alleles that could be bound there, and then the variant could be offset by a nucleotide or two, not precisely in the middle, but moved over, or the probe could be a little bit longer, a little bit shorter. With time, the choice of which probes are the most efficient at discriminating between the two alleles improves, and that's what allows affeometrics to add on additional variants to be able to fit more variants onto an array, and allows the discrimination to be optimized for given variants. Affeometrics also has a newer platform, they're an axiom array. In this case, the DNA is amplified and fragmented into, say, 25 to 125 base pair fragments enzymatically, and then the fragmented amplicons load it onto the array to hybridize to oligos, and after selection, a solution of random nimer oligos that are labeled are hybridized to the array, and they're hybridized such that if the alleles match, then a ligation reaction can be performed, and so the discrimination, the allelic discrimination is based on ligation, which requires the alleles of the adjacent nucleotides to be matched and to hybridize well, and that provides greater allelic discrimination a little bit better than, say, hybridization would provide. And then the labels that are present are stained and imaged. So here's a representation of what some of the sort of coverage of common variants is for a set of arrays that are available, and these are a little bit some of the older arrays, and so coverage is calculated by looking at the set, some defined set of common variants that are present, and when you interpret what the coverage is of a particular array, you want to consider what that set of variants is. Often hat map variants will be defined or a thousand genomes variants, the more sequencing that happens, the more variants that are identified, so knowing what that reference set is is valuable, and then looking at the linkage disequilibrium between a given variant and the other variants that are present in that set is used to estimate what that coverage is for the given chips, and the coverage is going to differ based on the population of the individuals being assayed because allele frequencies differ and linkage disequilibrium relationships differ between populations. So some of the newer arrays that have more variants present on them do a better job, have higher coverage of common variants than say some of the older arrays. Now the most recent generation of SNP arrays that are available are improving coverage of the lower frequency variants. So in the initial arrays we're covering the frequencies of 5% and greater, now the frequencies covered are moving down into the less common ranges. So here's a slide from Illumina. One of the newer arrays that they have available is specifically chosen for the Chinese population. So this particular chip was designed to select variants based on individuals from Chinese ancestry, and so they show that the coverage on the y-axis here of variants with an allele frequency greater than 5% is sort of shown here on this particular array compared to one of their other genome-wide association arrays. So here's a more general array, and this is the one that's chosen to be specific for the Chinese population. And you can see that they're also improving the coverage of the less frequent variants. Those with a minor allele frequency greater than 2.5% increases with this specific chip. To be fair, here's also a slide showing one example of an array from Affymetrix, and they too in their latest arrays that are available show that they are, well, they have good coverage of the common variants. They're also moving into trying to have improved coverage of the less common variants in this little bit lower frequencies, sort of that 2 to 5% allele frequency range. Okay, so genotyping of samples, cases and controls, or members of the population is performed. Genotyping data comes back. There are a number of quality control steps that are important to do in a genome-wide association or prior to performing the association analysis. One is to look for and detect poor quality samples. So samples that had a success rate less than some level, maybe at the 95% of the SNPs are successful. The more SNPs that fail, the more that the SNPs that succeed are called into question as to perhaps be generating inaccurate genotypes. So if most of the samples are working very, very well and some of them are not as well, then it could be that heterozygates are being miscalled as homozygates for particular alleles. And so identifying and excluding poor quality samples is valuable. And excess of heterozygous genotypes might suggest that a DNA sample is really a mixture of two DNA samples. One can use the genotype data to evaluate whether any sample switches have happened in that process from when the DNA sample was collected from the individuals and then that, say, the tube of blood was collected, it was processed into DNA, it probably changed hands many times, it was moved from a tube onto a plate and a plate that was then genotyped and that whole process sample switches can happen and one way to identify whether that's happened is to look at the sex of the individual based on markers on the X and Y chromosomes and evaluate whether it matches the sex expected in that individual. If DNA samples are around a lab for a while, then particular alleles that are particular genotypes known from one set of genotyping reactions can be compared to those done with another assay to see whether, at another time point, to see whether any sample switches have happened in the intervening time. One can use the genetic data to look for unexpected related individuals. So again, when analyzing a cohort or a population for the sample for the first time, one can use pair-wise comparisons of genotype similarity and look for, say, unexpected duplicates, might turn out to be monosagotic twins or people who participated in the sample collection more than once with different identifiers. And you can also use the allele frequencies of variants across the genome to look for individuals who have ancestry that may be a little bit different from the rest of the sample and then consider that and either exclude them or account for those differences when performing the later analysis. In addition to looking for poor quality samples, one can look for poor quality SNPs. So shown here are a few examples of raw data of genotyping of sets of individuals. So shown over on the left here, now it's signal intensity of one marker, they said the X marker, we'll call it the A allele, signal intensity of another marker, it's labeled the Y marker, it's called the C allele. So this is a lovely looking marker where the allele intensity is very high on the A axis for this set of samples and relatively low on the C axis, so these would be the AA homozygates. These similarly are very high on the C allele axis, these would be the CC genotype and these would be the heterozygates. It's an ideal genotyping plot. When doing hundreds and thousands and millions of markers, software is used to assign the genotypes to various clusters. You can occasionally, the software might not detect that these two clusters are distinct and might call them together as heterozygates, so erroneously assigning heterozygate genotypes to these individuals, sort of look for cases when that happens and fix them or exclude those markers. Some assays for given SNPs don't work all that well and there's not much discrimination or the discrimination is not clean between the clusters and so the individuals that are especially close between these two clusters may be more likely to be miscalled with an incorrect genotype and those genotypes can either be excluded or it's at least most helpful to recognize the marker and perhaps exclude the entire marker to avoid having errors in the data that might lead to false positive or false negative associations. Other ways to, so that often happens at the genotyping level at the individuals performing the genotyping analysis are those who are looking at that raw data evaluating some of those characteristics. One can also detect SNPs that are of poor quality by looking for a genotyping success rate less than 95%. So now this is a SNP that worked in less than 95% of the samples. It's sort of an arbitrary threshold but a commonly used one might suggest that there's some problem in that assay that perhaps it's not discriminating well between the clusters, perhaps the genotypes that continue to exist are inaccurate and therefore excluding the marker would be more prudent. Often these analyses are done using a small percentage of samples are duplicated present twice within the set of samples being genotyped. So then the genotypes from those duplicate samples can be compared and finding mismatches or discrepancies between those identical samples is a bad characteristic for a SNP. I want to exclude those particular markers. Can also do a test for Hardy-Weinberg equilibrium looking for the expected proportions of genotype or genotype frequencies are not consistent with the observed allele frequencies. This also suggests that the marker perhaps has a problem that perhaps heterozygates are more often being called homozygates incorrectly and so statistical tests can be used to identify that kind of an error. If there are related individuals within samples such as a mom, dad, and a child trios, then one can look for a Mendelian inheritance of alleles from the parents to the child. Some groups will add additional quality control samples to their sets of samples to allow this kind of SNP error to be detected. And then it's also important that if a set of cases are going to be compared to a set of controls that the genotyping be done as similarly as possible between those two groups. If the cases are genotyped entirely separately from the controls, then it's possible that there is different allele missingness or that there's different accuracy of the cause between the cases and the controls. And this can lead to false positive associations. So it's important to try to intermingle the cases and controls as much as possible to account for any differences in plates or arrays or any of the technical steps in doing the genotyping to detect any sort of potential errors. Okay, so once the genotype data is cleaned, that meaning that the, you know, poor quality samples, poor quality SNPs have been removed, then one can go test for association. So in a case control study, this is now looking for differences between the cases and controls in terms of their allele frequency, genotype frequency of things. So for example, one could perform a test for trend, looking at the frequencies between those different sets. So look at the counts of individuals with different genotypes within the cases and controls. It's valuable if there are covariates that are also associated with disease. So if the disease prevalence increases with age, or if it's more common in males than females, then covariates representing all of these factors should be included in the analysis to account for them, to improve the opportunity for the genetic variance contribution to disease risk or the quantity of trait to be identified. Often tests are done with looking for an additive effect of the alleles on the trait, meaning that having one allele has an effect and having two alleles has more of an effect. Other tests can be done looking for evidence of dominant or recessive models or however the additional number of tests performed in doing an analysis like this would need to be considered when deciding what the threshold of significance of the overall results at the end are. So for example, in a case control study, when looking for the effect of an allele on risk of developing disease, one could calculate an odds ratio. So if these are counts of individuals, cases and controls that have in counts of the alleles A and C represented in those individuals, then one can calculate an odds ratio as the odds of having a C allele given case status over the odds of having a C allele given control status, and this would form an odds ratio. And so a value that is greater than one shows increased risk of disease for that particular allele and an odds ratio that's significantly less than one is evidence of decreased risk of disease. When performing association analysis on genome-wide scale, many, many tests are done. So if 300,000 to 5 million SNPs are being analyzed, then one would want to correct for that number of multiple tests when defining what a significant result is and what a spurious chance result could be. One approach for doing this is to take a commonly used threshold of significance, say 5%, so one in 20 times. You might see a result, a difference between cases and controls that is at this level of significance, and divide that by the number of statistical tests being performed. So a commonly used threshold assumes that the number of common variants being tested across the population, and this was designed based on a Caucasian. The population is approximately a million tests, and so taking a p-value threshold of .05, dividing it by a million, creates a new threshold of 5 times 10 to the minus eighth. So this is a commonly used threshold for declaring that a particular result is significant and not likely to have occurred by chance. Achieving a threshold like this requires either a large effect of that or of that particular variant or a large sample size to detect a more modest effect. Question. So different approaches are used to define the question is are there different strategies one could use a false discovery rate as opposed to this a Bonferroni correction for multiple tests. Different approaches are used. I would say that declaring a threshold of 5 times 10 to the minus eighth is very commonly used within the literature, although people will argue whether that's an appropriate threshold to be used, and often there are signals that do not reach that threshold that it's due to limited power, and when sample size increases in the next round of a study, then those variants become significant, and so it is a valuable thing to consider. So I show here an example of what results would look like from an association test. This is from an early test for type 2 diabetes association between comparing not quite 1200 type 2 diabetes cases to not quite 1200 normal glucose tolerant controls. This is work of the fusion study, and the results shown here are for the genome with the chromosomes lined up end to end, so chromosome one on the left all the way down to chromosome 22, and then the X chromosome with each dot representing a single nucleotide variant that was tested for association, and this analysis was done using logistic regression with an additive model and adjusting for age, sex, and birth province even within Finland to account for potential stratification, and then on the Y axis is this minus log 10 of the p-value, so a p-value threshold of 0.05 would be about there, so you can see that when doing this many tests, that's not an appropriate threshold for defining what's significant. There are many, many variants that have a p-value smaller than that threshold. The threshold for accounting for the number of tests done here would be in the sort of 10 to the minus seventh or that 10 to the minus eighth range, you'll notice at the maximum scale here is six, so none of the results from this initial study reached that threshold of genome-wide significance. That makes it difficult to figure out what variants might represent true positives. At the time that this study was done before genome-wide association studies were available, there were three variants or three loci that had a well-established role in genetic contribution to type 2 diabetes, and so we looked for the location of those variants within this data. So one of them was at the TCF7L2 locus, so it was gratifying to see that within the top 10 SNPs of this association analysis that those variants were present that suggested that it would be possible to be identifying genetic factors. Another of the variants was at the PPA Eric Amalocus, this is maybe now the top 300 variants, and another of the variants with an established role was around 3,000th on the list of 300,000 variants analyzed. One way that the, to evaluate whether there's an excess of significant results at a given threshold is to plot the P values that result from the test of association against the P values from a uniform distribution. So shown here on the x-axis is minus log 10 of a uniform distribution and on the y-axis minus log 10 of the P value from the test of association. So there's a black line showing sort of the expected right along the edge here, and the blue dots that represent the data that I just showed you, so you can see that there's sort of a slight movement off of this line, but very much follows along the line. So this is good from the perspective of there's no excess of associations that might represent population stratification or some sort of excess relatedness within the individuals, but it's bad from the perspective of there are no variants that showed strongly significant to excess evidence of association in the true analysis compared to the uniform distribution. If one was doing an association analysis in a population that had evidence of substructure or stratification, then a plot, a similar plot might show that the variants in these dark blue dots show an excess significant sort of all the way through the scale. If this population stratification is adjusted for, then the P values that result from the association test are more in line with that expected distribution. And so correcting for population stratification can reduce the excess result, excess associations that are false positives that are not due to true genetic signals. So performing an association analysis and doing all that work and not identifying significant results, frequent next step is to try to gain statistical power by increasing sample size. Larger sample sizes will have greater possibility of identifying genetic factors that have a more modest effect. So the frequent, the common way that this is performed is that each group does their own genome-wide association analysis and then the data from several studies is combined together by performing a meta-analysis of the results for each genetic variant. Now potential issues of performing a meta-analysis across studies, one is that different genotyping platforms may be used and different analysis strategies might have been used in the beginning. And also that the definition of cases and controls may differ. So there's some heterogeneity that's introduced by the fact that different studies are performed in different ways. Generally the strategy that has been applied is that larger sample size is more valuable and more powerful in the face of these, say, differences in sample collection. And so results need to be taken with, considered with some caution that about what heterogeneity might underlie them. But the generally larger sample size is identifying additional more variants. To address the different genotyping platforms that may be used by different groups, the several strategies for imputing or predicting the missing genetic variants between platforms have been developed. So in imputation, one might have in your study sample genetic variants typed at, say, a position here, a position here, a position here, but that the other genetic variants in the intervening regions were not typed. They were not selected for that genotyping platform. The study samples can be compared to some sort of a dense genotyping platform or dense set of genotypes. So HATMAP is a commonly used set of variants so this is on sort of samples that were chosen to try to be representative of some particular populations that were analyzed at a much denser set of genetic variants. Now more recently the 1000 Genomes Project has generated data, an even denser set of variants. And so one could take the genotyping data from a particular study and impute the variants from the 1000 Genomes Project and fill in many more of the genetic variants. So instead of analyzing, say, 500,000 variants that were genotyped on the array, one could analyze two, two and a half million variants present that are on some of these reference panels. So the strategy for doing imputation is that a probabilistic search for mosaics of chromosomes that match each individual is performed. So for example, the top chromosome from this individual is represented by this haplotype within the reference panel. The lower chromosome of this study individual is best represented by a mosaic of, say, one portion of a chromosome and another portion someplace else, suggesting that this individual has the portions of these two different haplotypes that recombination event has occurred sometime in the past. So then the genotypes can be sort of filled in from those phased chromosomes. There are several different approaches to performing imputation, and often the analysis provides some evidence of the likelihood that filling in that genotype was correct. And so thresholds for quality can be used. And if a variant is very part of a chromosome that's been seen many, many times in exactly that same set of variants and has been seen in many copies of that haplotype might have a lot of confidence filling in the intervening genotypes, whereas if it's a region of lots of recombination and it's unclear exactly which haplotypes match best, then the filled-in genotypes may have less accuracy, less likely to be correct. And so analysis can be performed and sort of choose a threshold and not include genotypes that are imputed with a low likelihood of accuracy. The advantage of doing imputation is that it allows the many different genotyping platforms, studies done on these different genotyping platforms to be combined together. So here's an example of one of the arrays that say perhaps genotype to these particular markers, whereas a different array genotype to these particular markers, and when both sets of data were used to impute markers from the HATMAP project, the markers shown in blue were able to be analyzed in both studies. So while the overlap between the sets of data available from one platform or the other was the directly genotyped markers that were shared was relatively small, the total number of markers that were able to be analyzed was a much larger, is a much larger set. Imputation doesn't require that the variants be perfectly in linkage disequilibrium with the variants that are tested, right? It's a haplotype-based approach, and so it's possible to identify variants that have a different frequency than the variants that were typed. So there are examples at least in the early stages where variants were identified to show association only when imputation was done, that none of the markers on the genotyping panel themselves showed association. So in this particular plot, this is a zoomed-in region of a portion of chromosome 9, and with some genes shown below, and the minus log 10 p-value for LDL cholesterol levels shown on the y-axis, and the dots that are shown in red are the markers that were directly genotyped on the particular genotyping array, and the dots shown in blue were the ones that were imputed based on using the genotypes from the AFI array and imputing the variants present in the HAPMAP sample. And so you can see that none of the red dots showed strong evidence of association in this region. However, at least one of the blue dots gets up into a more significant p-value showing evidence of association. This is the low-density lipoprotein receptor locus associated with LDL cholesterol, a result that was known prior to this kind of analysis, but goes to show that the imputing can identify variants that were not present on the genotyping panel. So here shown is an example of the structure of a meta-analysis where seven different groups got together. Each one performed their own genome-wide association analysis using a shared analysis plan for what method to use and what model to use and what covariates to use. And then a meta-analysis of those seven studies was performed. And the top SNPs, the most strongly associated SNPs from that study or representative ones of those results were selected to follow up in additional samples. So some studies, some cohorts have genome-wide genotypes available. Some do not but are able to genotype say 50 SNPs to go follow up in results. And so in this particular example, around 40 to 60 SNPs were selected and different groups in these replication cohorts went and genotyped those variants separately using a different genotyping platform. And then the data from those replication cohorts was analyzed to determine which of the initial variants showed significant evidence of association. So in this particular example, the genome-wide association analysis was done in around 20,000 individuals and then some of the top variants were followed up in around 20,000 individuals. The results of that particular analysis are shown here. Now there are three genome-wide association plots because there were three phenotypes analyzed with that set of data, LDL cholesterol, HDL cholesterol and triglyceride levels. All phenotypes measured in the same people once the genotype data is available, then looking at the range of all phenotypes present is relatively quick. So shown here are three genome-wide association plots and three these quantile, quantile plots. So let me zoom in and show a portion of one of these. So here's a portion of the genome-wide association plot. These are often called Manhattan plots because the tall buildings show up out of the background of shorter buildings there. In this analysis, this is sort of not the first round of genome-wide association studies for these traits, but a later round, so they show the results on this QQ plot here. The gray line represents the expectation if none of the variants show a significant association and this is shown now with a 95% confidence interval on that line. So black represents the set of all variants identified in this particular trait, LDL. When removing the variants that were known previously, then the blue symbols are representative of the data being reported in this particular study. So they still showed an excess of significant results. There are still novel signals, evidence of association being identified. They remove the effects of those variants. You can see that there's still some little bit of excess association present, but none of the variants in particular reached the genome-wide significant level. So meta-analysis is useful and follow-up and replication of initial association results, especially ones that don't reach genome-wide significance levels yet can allow for increased power and increased opportunity to identify novel signals associated with a disease or a trait. When performing meta-analysis, however, one has to be concerned about heterogeneity between the studies. So one example to demonstrate this, when the Welcome Trust Case Control Consortium performed a genome-wide association of type 2 diabetes, they showed strong evidence of association of variants at the FTO locus with type 2 diabetes. However, a couple other studies that were doing association analyses of type 2 diabetes at the same time didn't really see evidence of association with FTO at all. It turns out that the Welcome Trust cases were more obese than the controls in that study, whereas the other diabetes studies, their case control selection had been more balanced with respect to body mass index, body size. So the identification of this source of heterogeneity between the studies led to identification of FTO as a gene that plays a strong role in obesity. Some of that data is shown here. This is a plot showing odds ratios and their 95% percent confidence interval of the odds ratio. So the X-axis's odds ratio, 1.0, would mean that there's no increased risk or decreased risk of a given variant here at the FTO at the allele of this marker representing the FTO locus. The initial set of Welcome Trust cases of type 2 diabetes showed a strong odds for obesity. Here are the controls that were used in that analysis. So when you see the controls used in the type 2 diabetes analysis, so you can see the effect on obesity is larger in these type 2 diabetes cases than in those type 2 diabetes controls. That's why it looked like evidence of association with type 2 diabetes at first. When they went in and collected other sets of cases, other sets of controls and then valuably samples that were from population-based collections, so not disease status, ascertainments and evaluated the effect of this particular allele, you can see that it consistently shows an increased risk with obesity. So this and odds ratio is 1.3 and the confidence interval around it is quite narrow because it's a very large sample size and so this was the sort of definitive evidence showing that these variants are associated with obesity. Okay, so genome-wide association studies have been performed now for at least 237 traits. This is results cataloged by the NHGRI in a catalog of genome-wide association studies. The slide shows the various chromosomes and with some color dots representing positions of some of these loci and most recently there's last summary here is about 1449 published genome-wide association signals with p-values less than 5 times 10 to the minus 8 representing 237 traits. So many genome-wide association studies have been performed and many, many loci have been identified where genetic factors are associated with the trait or disease. As would be expected more loci are found with larger sample sizes. So in this recent review a number of different results are summarized with the number of cases shown here on the x-axis, 1,000, 10,000 and 100,000 and the number of genome-wide association hits or signals represented on the y-axis at 1, 10 and 100 and the different symbols representing different studies that were performed of different sample sizes and here's a subset of case control studies that were done for Crohn's disease, different various studies and you can see that generally the larger the sample size, the larger the number of cases then the more genome-wide association hits are identified showing that many signals exist and that the effects for many of them are relatively modest and that large sample sizes are needed to identify them. So let's look at some of the examples of types of results that are identified in genome-wide association studies. I'm going to look at a few plots of particular loci, so zooming in on the genome to particular regions. So here's a portion of chromosome 19 and about 400 kilobases are shown on the x-axis, each of these representing genes in this gene-dense region and p-value test of association over here has a strongest signal here with a p-value better than 1 times 10 to the minus 25th. This is replicating a known association one that's been known for a very long time of a variant of the ApoE locus associated with LDL cholesterol levels. Now this is not the variant itself that has been shown sort of most strong results only to play a functional role at this locus but it's inherited at a similar pattern. This example also lets me highlight that when results, so in this particular case this variant is close enough to a known gene that this gene might be the one highlighted in a report of a genome-wide association study. However, if this was a novel signal then the evidence, the decision about what gene label to use in a report might be a little bit arbitrary, might be a little bit driven by what the biology of those underlying genes might be. But it's important to know that when reading a paper of a genome-wide association study that the gene label assigned is often just the nearest gene to that SNP that happens to be the top signal and might not be the A gene that is contributing to variation at that locus. Also, even though a single gene might be provided in that label, there could be genetic variants that are affecting more than one gene at a given locus, that there's true causal underlying variants, that there's multiple of them and that they could be affecting different genes at that locus. So interpret with caution. Okay, so then some signals that are identified can be novel signals. In this particular case the strongest evidence of association was found within an intron of a gene, meaning that shown down here, these little tiny boxes representing exons here and all of the variants that show the strongest evidence of association here are localized within an intron. So perhaps underlying causal variants are not shown on the plot but are in linkage to sequel every one with variants in the plot and could be playing a role in the protein sequence or perhaps underlying variants are influencing gene expression of this gene or of some other gene nearby. Some novel signals are found at a distance from known protein coding genes. So these are identifying possibly novel biology or possibly novel mechanisms, so variants that are found at a distance to protein coding genes perhaps are affecting other sequences in the genome, RNA sequences, non-protein coding genes that may be present, not all of these are annotated in the genome yet or they could be regulatory effects, having regulatory influence say as enhancers or repressors of transcription of genes that are hundreds of kilobases away. More and more multiple signals of association are identified in a given region. This makes sense with what's known about genetic variation and allelic heterogeneity for Mendelian disorders. There's more than one way to influence a gene, there's more than one way to alter a gene, so there's often more than one common variant or signal that can play a role in association at a given locus. So shown here are two separate, it really is the same data shown twice but it is colored based on the relationship of the variants to one another. So there are really two signals here, one that's localized quite close to this particular, the promoter of this particular gene and another signal that is independent, independently inherited from the signal that is located tens of kilobases upstream of this particular locus. One way to look for independent signals is to include a given single nucleotide polymorphism variant in a regression analysis to adjust away the effect of one variant and then see what the results of the other variants are in the region. So in this particular case, if each dot here is representing the evidence of association with the trait, if one were to perform this test and include one of these variants in that test of association, at this locus the signals are independent and so by including this evidence of association, the test of any of these other variants would essentially go away and show no evidence of association. However, the association of these variants remains unaffected by that other signal. So this is really strong evidence of independent signals influencing association. Now there may be more variants that are not necessarily independent of each other, there could be two causal functional variants that share some haplotypes but not all haplotypes with each other and so when going into the functional biology, trying to figure out what the mechanisms are, what the underlying variants are, it's not just independent signals but the multiple signals that might be present that might help indicate how these DNA variants are leading to changes in gene expression or function leading to disease. Here's evidence of association that shows that you can obtain different results in different populations and that populations that are older and then have more evidence of recombination events that have narrower regions of linkage disequilibrium can provide greater resolution to the signal that can show a narrower region of association than in other populations. So shown here are some evidence of association with height for a set of variants across a region and then shown below are the linkage disequilibrium, pair-wise linkage disequilibrium plots for sets of variants in this region from the CEU HabMap population HabMap sample and the YRI HabMap sample and you can see that this evidence of association which is from populations of European ancestry samples shows evidence of association sort of across this region and that there's a relatively wide linkage disequilibrium block in this region whereas in the YRI samples there's a more narrow sets of these variants are more inherited together, these are more inherited together but these and these show less association with each other. The signal from Caucasian sample was quite broad. The signal in African American individuals was strong in this region but was not strong in this region so suggesting that the more likely location of a potentially functional underlying variant was restricted to down in this region and not in this one. This particular case, the variant that showed the stronger association the African Americans was also one that had been shown previously to have an effect on gene expression of one of the nearby genes perhaps providing some support for it having a functional role. The more genome wide association studies that are done with a range of traits the more that the same variants and the same genes are being identified as associated with two or more traits. Sometimes these signals are being identified that are associated with traits that one can recognize what the underlying mechanism might be. Sometimes the relationship and the different diseases or traits that show evidence of association helps provide some biological clues as to what those pathways might be that are responsible for a particular trait. So there are variants that are being identified for example for both diabetes and cancer and in at least one case the same DNA variant was associated with the same increased risk of prostate cancer and decreased risk of type 2 diabetes. Examples like this are suggesting perhaps the role of cell cycle genes and that variants can end up having different sorts of effects. Looking at the collections of traits and associations might help us understand what the driving biology is underlying a signal and which association is coming sort of as a result of that initial trait. So in this analysis of genome wide association signals the authors took the set of SNPs that had shown evidence of association with a trait or disease and then looked at annotation classes of where those variants were found in the genome and looked at annotation classes such as non synonymous sites, regions around promoters, regions in introns, regions that are intergenic and compared randomly selected sets of variants on genome wide association panels to those that showed evidence of association and looked to see whether there is an excess of variants in particular classes that had been found to be associated with disease. So in this particular analysis here's the odds ratio of 1 so anything crossing an odds ratio of 1 is not significant at the 5% level but these classes here of non synonymous variants and promoter regions at sort of 1KB and 5KB definitions all showed that the trait associated SNPs were over represented in these classes compared to just random variants on the genome wide arrays and even though there are more variants present in introns and many variants identified in introns and intergenic regions that show evidence of association there are also many more variants on the arrays that have these characteristics. So taken together the genome wide associated variants are being identified that explain some of the population variation for various traits. Shown here is a subset of traits a partial table from a recent review and it shows a set of traits and the heritability from pedigree studies expected for these particular traits. So some traits are more highly heritable than others and they show in comparison the genome wide association signal hits the ones that are sort of defined at genome wide significance and what proportion of the variation they are explaining of this heritability. And so we're approximately in many cases looking at say about 10% of the heritability is explained by the genome wide association hits. Now analysis are being done to evaluate what the effect of all common SNPs might be not just the ones that have reached that threshold to define significance but the ones that maybe have not quite reached it yet that with greater sample size and more power might reach it in the future to estimate what the heritability might be of all SNPs that are being analyzed. So you can see for example that the heritability that may be attributed to such common SNPs could increase a fair bit. Still not likely to be representing all of the variation that may be present where only genome wide association studies are largely restricted to some of the common variants and so this suggests that there are other genetic factors that are playing a role in heritability. The use of this information to predict disease is really dependent on the disease and the heritability and I should also say that in this particular case with type 1 diabetes there were variants known prior to the, they included some variants known prior to the GWAS era that had very strong effect when looking at that heritability number. One way that people are characterizing individuals is based on the number of risk alleles that they have. You could see some evidence of differences in groups of individuals. So well the variants might not be well predicted for a given person. One can count up. So in this particular case there were more than eight SNPs available that had shown evidence of association. So for each individual that counted up how many height-increasing alleles did that person have and then grouped them. So here's a block of individuals that had fewer than eight or equal to eight height-increasing alleles and plotted their average height and compared it to in these other regions the individuals that had at least 16 height-increasing alleles and their plotted their average height and so between the individuals that had the lowest and the highest number of height-increasing alleles there is a few centimeter difference in how tall they are. However these are most individuals fall in the middle of this plot. These are common SNPs and the individual predictability of the variants is relatively low. The value in clinical translation then of these genome-wide association studies largely is starting with the novel biological insights. These hundreds more than a thousand signals identified in the past few years provides hundreds and thousands of novel biology to biological signals to go investigate and evaluate determine what the role of those variants in those genes plays in disease which would then in time lead to clinical advances particular drugs or biomarkers that represent the disease better potentially leading towards prevention. There may be some improved measures of individual genetic approaches and I think you'll learn more about those especially with respect to drug development and drug response next week. So in summary when performing genome-wide association studies it's important or interpreting them it's important to pay attention to the design and quality control. Large sample sizes are needed to identify signals with modest effects. There are more than 1400 signals and counting across the genome-wide association studies done to date and finding any signal doesn't immediately provide information on the underlying biology or clinical utility but sets off lots of follow-up analyses that can lead to these discoveries. And the time to changes in medical care based on some of these results might be years but the biology is really advancing quickly. As we progress with genome-wide association studies more and more loci are being identified. Larger meta-analyses are being done. Groups are gathering together more and more sets of samples. There is deeper follow-up of genome-wide association signals so groups are creating custom arrays of not just 50 variants to follow-up but thousands of variants to follow-up to identify additional signals. Population-specific panels are being developed to increase the range of genetic variants that can be analyzed in a given study. More diverse populations are being used to identify variants. Other types of sequence variants not just single nucleotide variants are being incorporated. These are being done with multiple traits and looking at the relationships between those traits and these are beginning to allow gene-gene and gene-environment analyses and interactions to be evaluated. And finally the data are generating sort of evidence and spawning much future analysis to figure out the molecular and biological mechanisms underlying the signals. So thank you very much for your attention.