 Our lecture today is devoted to genome-wide association studies, and as you already know, these are the kinds of studies that help us separate genetic variations that are biologically insignificant from those that do produce a change that ultimately might be either detrimental or advantageous to an individual. And study of these variations are also critical to identifying what genes are responsible for a particular genetic or genomic disorder, as we learned in last week's lecture by Lynn Geordi. There's also much more practical reason to study these genetic variations, particularly the single nucleotide polymorphisms that give rise to the subtle differences between each and every one of us in this hall, since a thorough understanding of these variations might provide some sort of a key way of knowing in advance how someone might respond to a particular drug or treatment regimen. So to today's lecture, I'm very pleased to introduce you to Dr. Karen Mulkey, who will be presenting today's lecture on genome-wide association studies. Dr. Mulkey is an NHGRI alumna, having done her postdoctoral work in Francis Collins Laboratory, where she used genome-wide approaches to localize, excuse me, diabetes susceptibility genes. She is currently an assistant professor in the Department of Genetics at the University of North Carolina and a member of the Carolina Center for Genome Sciences. Her lab is actively studying complex traits with complex inheritance patterns, using many of the approaches that she will be describing to you in today's lecture. So please join me in welcoming Dr. Mulkey back to the NIH campus this morning. Thank you. Thank you very much. I really enjoy the opportunity to come back to NHGRI and to the NIH and to see a lot of old friends. So today, I am going to talk about genome-wide association studies and their applications and uses. And primarily, I'm focused on genome-wide associations as they apply to understanding the variants that lead to complex traits. And by this, I mean the traits that are the result of many genetic factors, as well as environmental factors, and perhaps their interactions, for which not a single, perhaps not a single gene is responsible for the trait, but the combination of those. This means that some of the factors that are involved have a relatively subtle effect, and so the identification of these genetic factors can be achieved using these genome-wide association studies. When we think about the role of the type of variants, the type of genetic factors that are going to be able to be identified by genome-wide association studies, we consider the difference between common and rare DNA variants. And to illustrate that, I show you here some examples of a stretch of chromosome identified in an individual. When we look at these across many different individuals, you can see that most of the nucleotides are identical. However, there are some positions where there's a single nucleotide polymorphism, for example here at T, where in another copy of that chromosome there's an A. This is a relatively common single nucleotide polymorphism. There are both T's and A's present in a countable number of times. Some DNA variants are more rare. For example, this variant here, there's a C nucleotide present in most copies of that chromosome and only much less frequently is an alternate allele, for example, a G allele present. When examining the genetic architecture, the genes that influence traits and diseases, different strategies have different potential power to identify the underlying genetic variants. So the common and rare variants that I just described are shown here sort of on the X axis with the allele frequency shown here. If an allele frequency greater than about 5% considered a common allele, the more slightly lower frequency variants went down to frequencies where the variant is observed with a frequency of .005 or less were down into the rare category. The frequency of variants in consideration compared to the effect size helps us determine what kind of a strategy is going to be most powerful. This is the stronger the effect of the DNA variant on causing disease, the higher it would be present on this axis. So a single gene disorder where a single variant causes, has a definite cause to the disease would be up here. So rare alleles causing a Mendelian disease are present here on the graph, whereas the common variants that are able to be detected by the methods we're going to talk about today are shown here by genome-wide association studies. So they're common disease variants that we are not going to be able to detect using these methods. And there are other rare variants that are also going to be difficult to discover through the strategy. Okay, so as we talk about genome-wide association studies today, we're going to talk about what the goal of these studies is, these studies are, and then go through how the studies are performed, then discuss what we can learn from the associated regions and look through some examples of the type of data that comes out of the association scan, and then talk a little about what the findings can tell us about disease. So the goal of genome-wide association studies is to identify these genetic factors that underlie, underlie diseases or common traits. Now the benefits of doing this kind of strategy compared to classical mapping strategies, now by classical mapping I mean linkage analysis strategies, often genome-wide linkage analysis strategies, or to do an association study with a candidate gene, a particular gene identified ahead of time that is thought to play a role in disease and to go examine just the DNA variation in that specific gene. So the benefits of doing a genome-wide association study compared to those techniques are that, first it's more powerful than linkage to identify the common low-penetrance variants. These are the variants where they are inherited both by individuals affected and not affected by that disease or that's the low-penetrance and more powerful than linkage strategies. They also provide better resolution than linkage strategies. When a peak is identified from a genome-wide association study, the peak is likely closer to the underlying gene or genetic variant or defines that region to a much smaller portion of the genome than a linkage study does. And finally there's no need to select candidate genes. This means that this can be an unbiased approach, you don't have to know what the molecular functions of the genes and what their roles are in the disease or trait prior to beginning the study. This means that the strategy allows us to identify completely novel mechanisms that may be playing a role. Now, the requirement to do genome-wide association studies, these have been around for about four or five years or so, the things that were needed to make them possible include first a catalog of the human genetic variants. So after the human genome was sequenced, the strategies to identify the DNA variants across the genome were implemented, that catalog was important. Methods to allow low-cost and accurate genotyping of many DNA variants together were needed and those technologies have developed. We needed large numbers of informative samples and the numbers of samples that are needed seem to keep needing to be increasing to identify additional variants and so these study samples need to be collected. And finally we needed efficient statistical design and analysis tools to deal with the very large number of statistical tests being performed to do these types of studies. So why do a genome-wide association study? The goal here is to test a large portion of the common single nucleotide genetic variation in the genome for association with either a disease or variation in a quantitative trait. As I described, the focus on the common variants is really a result of the strategy of the tools that are available right now to do these analyses. In the future genome-wide association studies will be possible for less and less common variants. And the true goal is to find the disease or quantitative trait-related variants without knowing ahead of time what the genes do and how the variants might function. So we're going to talk through several of the steps involved in performing a genome-wide association study, both from the start for collecting the samples, performing the actual genotyping of the DNA samples, performing quality control of that resulting genetic data, the statistical analysis of the genotypes that result, and the important steps of replication and follow-up of the initial results of the genome-wide association scan. So first, thinking about the phenotype that's able to be studied for doing a genome-wide association study. The disease, we could do a study that is a case-control design, looking at individuals that are affected with a particular disease and compare them to individuals unaffected with that particular disease. The disease might be relatively rare or it might be much more common. It could be something that affects only hundreds of individuals, or it could be type 2 diabetes, for example, affecting 6 to 10 percent of the population. Gene-wide association studies also work for quantitative traits where there's a continuous distribution of the trait values. For example, looking for genes that influence weight or body mass index or height. Quantitative traits that can be more focused in specifics such as coronary artery thickness. And genome-wide association studies can be formed using outcomes such as gene expression level obtained from individual samples. So both qualitative and quantitative traits are amenable to performing genome-wide association studies. When thinking about the case-control design or when thinking about trying to identify the variants that influence risk to disease, one strategy is the case-control design where one specifically tries to ascertain cases affected with disease and specifically tries to collect controls. Another and alternate strategy to trying to identify the variants that influence disease is to use a population cohort where individuals from the representative of the entire population are ascertained and then those who are found to be affected with the disease are considered the cases whereas others are considered the controls. The population cohort approach has the limitation that fewer individuals are going to be cases if only 10% of the population has the disease then only 10% of the individuals collected are fit that category. However they may be a little bit more representative of the case population. Most case-control studies that have been performed to date have ascertained specifically the cases separately from the controls so an important aspect then to consider when looking at the results of such studies is how was a case defined and how was a control defined because that can influence interpretation of the results. Criteria can be used to increase the potential that genetic variants are going to be identified by selecting cases that are more likely to be harboring the genetic variants. So strategies to do this include identifying more severely affected individuals who have maybe more complications of the disease. We could require that other family members are affected with the disease so choosing cases from families where they have a family history increases the opportunity that increases the chance that they have inherited genetic variants that are influencing disease as opposed to getting disease entirely from environmental factors. Another choice could be to choose individuals that became affected with disease earlier on. If it's an older age of onset disorder then individuals affected earlier may have a greater genetic load and therefore include more genetic variants able to be found. Strategies can also be used to try and increase the opportunity to find the genetic variants through the selection of the controls in an attempt to try and identify individuals that have low risk of disease rather than population based samples. So if we were to choose just members of the population as a whole to be controls for something like type 2 diabetes then some fraction of the individuals may be affected with type 2 diabetes and not yet know it or they might become affected with type 2 diabetes in 3 months and be carrying the genetic factors that might cause them to be more likely to be a case but we would be considering them a control at that point. In an attempt to try and have the study be as well-powered as possible, individuals that have the same ancestry background as the cases are important. Individuals can also be matched to the cases based on age, sex, and demographics to try and have the cases and controls be as similar as possible to each other except for the genetic factors that would lead them to develop disease. An important aspect of looking at the individuals and the ensuing analysis is that cases and controls in the end need to be matched for ancestry. That's indicated here by few different types of symbols that represent different ancestral backgrounds. If the ratio of those different, if the contribution of those different ancestral backgrounds is different between the cases and controls then any genetic factor that influences or that is represented differently in those ancestral backgrounds could appear to be associated with disease. And therefore come out as a false positive association. Samples that have unmatched ancestry might not be identified, might not be known to investigators prior to beginning the study. The data that comes out, the genotype data obtained from the individuals can be used subsequently after genotyping to detect that ancestry might be mismatched between cases and controls and to try to address that. So these are the mismatched ancestry, a little bit more globally. We could think of this as population stratification, meaning that there are subpopulations within the cases and controls and if those subpopulations are of different frequency that that can influence the results. Another similar issue is cryptic relatedness. If the individuals that are ascertained as cases are thought to be independent and unrelated to one another but in fact are actually relatives of one another then the assumptions of the statistical tests, the assumptions of independence that are used are then violated and this also can lead to a false positive evidence of association. And so one of the steps that's performed on the genotyping data is to try to identify whether individuals are more related than expected. One can then account for or try to avoid these population issues. One strategy is to use an average measure to adjust the results of the association study. This is a genomic control parameter that can be applied to the resulting data. One could also try to identify the principal components of population substructure from the data, from the genotype data and then use those principal components as factors that are included and adjusted for in the analysis. Another strategy to avoid population stratification would be to use family-based designs where trios of individuals, parents and an affected child are used and the entire study is performed within those trios. Now a trio design is a little bit less favored in many cases just due to the cost and the relatively reduced power given the number of individuals that are needed to be genotyped but it does alleviate any issues of population stratification. Okay, so once the individuals have been collected, the samples have been ascertained. The next step is to perform the genotyping on the DNA from those individuals. Over the development of genome-wide association studies, essentially this ends up meaning using a standard panel of markers available from companies. There are a couple companies that are marketing these fixed content panels that are used the most frequently. These fixed content panels will have 10,000 to a million or more single nucleotide polymorphisms present and so essentially your choice when performing a study like this is to determine which of these fixed content panels to perform. Choosing a fixed content panel is much less expensive than going out and deciding which particular DNA variants to test, given the large number of, the large number needed. So two companies that market these tests include aphymetrics and alumina. The strategies that they use to determine which variants are going to be on the, are present on these fixed content panels include one who's looking at random SNPs located across the genome. Second to use haplotype tag SNPs, as Lynn Geordi mentioned last time and as that I'll show in a moment. This is an attempt to select the variants that are most representative of the largest number of regions of variation in the genome most efficiently. These panels also include probes that do not contain single nucleotide polymorphisms but can be used to determine the presence of copy number through the intensity of the signal that results from just looking for presence or absence of a particular probe. This is a description of what selecting haplotype tag SNPs mean. This is a strategy that greatly reduces the number of variants that need to be tested on a chip and is really one of the strategies that made it possible to do genome-wide association studies that look at a large proportion of the common variation in the genome with a relatively limited number of SNPs in that 300,000 to a million range. Shown here are for examples of a particular stretch of chromosome and you can see there are three positions of SNPs that are present. If all of these SNPs and those off to the sides are brought together and shown into haplotypes, what we find in human populations is that there are a relatively limited number of haplotypes given the possible number that could be present given two variants at each of a number of different SNPs. So in this set of SNPs that are present, in this set of haplotypes there are many different SNPs but the entire information to determine whether an individual has inherited haplotypes one through four can be contained by just genotyping three of these SNPs. So by choosing these three that are shown here called tag SNPs or there are many alternate possibilities on this slide, different SNPs that could represent, that could tag the same haplotypes. By choosing this relatively limited set of variants we can represent a much greater proportion of the total human genetic variation than would be required by typing every one of those SNPs. I'm going to talk briefly through the two main strategies available by Illumina and Affymetrix and what their method of allelic discrimination is determining which alleles are present at a given SNP whether a particular DNA sample is homozygous for one or the other allele or heterozygous. So Illumina uses a strategy where that includes capture of the DNA through beta rays and a small enzymatic allelic extension reaction like a mini sequencing reaction. They've gone through a few different phases of technologies in precisely how that mini sequencing reaction happens. And the most recent technology is a single base extension where a probe aligns to the stretch of DNA, so here's the stretch of DNA from the individual. A probe is designed that hybridizes and is complementary to a stretch of sequence that ends right before the variant position. And the probe can then either be filled in with a T-nucleotide here or another probe would have a G-nucleotide available. That mini sequencing reaction, the incorporation of the specific allele then allows that nucleotide to be detected through staining, through a label to determine whether an individual has one allele or the other or both. Here's shown two of the versions of this infinium technology. This is the infinium one, the infinium two, and now they have an infinium HD assay. Shown here is if the incorporation of a T-allele present that would hybridize perfectly, this would be an incorporated allele. The G-allele would not be incorporated because it does not match to the A. The single base extension reaction is shown down here because there are two colors available. Not all SNPs can be represented with the same pair of the same labelings of the alleles and so they'll use some SNPs are represented by a single bead type and some SNPs are represented by two different bead types in the way that the assays are performed. The affymetric strategy, the allele discrimination is not due to that single base extension, it's due to hybridization. The general strategy used there is that the genomic DNA is digested through a pair of restriction enzymes actually and then adapters are ligated onto the ends of the fragments. A single PCR primer can amplify across those fragments to produce a larger amount of that total product compared to that initial genomic DNA that was used. Then these fragments are these PCR products are fragmented and labeled and then these labeled products are hybridized to a chip and the hybridization of the alleles is set up with specific probes that allow the two alleles to be discriminated from one another in the homozygous or the heterozygous state. Multiple copies of the probes are present on the gene chip probe array to allow the most specific hybridization patterns to be detected and to allow there to be some redundancy of the signal detection. This increases the opportunity for more accurate calling of those alleles. In choosing which panel to use or in choosing which strategy and which panel to use, not only the technique, the method of allelic discrimination but the content of the panels can influence the results in terms of the proportion of the genome that is in the end-assayed by this genome-wide strategy. Here shown are a set of different products that are available from Affymetrix and from Illumina. These are the names of those products and using the proportion of common variation that is either present on the chip or tagged through that haplotype tagging strategy on the chip is shown by this global coverage statistic. That global coverage statistic is different for samples of different ancestral populations because of the strategies used by the companies to select which variants would be present on the chip. For example, if you were performing a study of individuals with ancestry from northern or western Europe that would perhaps be most similar to the CEU HAPMAP population, then the human HAP300 product would cover or be evaluating perhaps 77% of the genome, whereas the SNP array 5.0 might be covering 64% of the variation. This is a little bit independent or not completely dependent on the number of variants present on the chip. It's important to think of not only the number of variants being analyzed but what their coverage is. You can see that the products are a little bit different in terms of deciding between a pair of products if you were deciding between these two. For example, in the CEU population this one has better coverage than this product, whereas in the YRI population this product has better coverage than this one. These statistics are for global coverage, so the average across the entire genome, a proportion of the common variation across the entire genome is captured by that product. The local genomic coverage varies, however, so if there was a particular gene that you are particularly interested in in terms of showing evidence of association, it's possible that it's not captured at all by these products. This shows a stretch of chromosome 17 along the X axis and the local coverage calculated in 1 megabase regions and moved over by 200 kilobase windows. There are different colored curves that represent the different SNP chips that were evaluated. You can see that all of them vary in their coverage and there are some regions that are very poorly covered by many of the products, whereas other regions are better covered. So just because it's a genome-wide association chip does not mean that all possible genetic variation is being analyzed even amongst this common variation. There are regions that are better covered and worse covered by the different chips. So once that genotype data is obtained, so the DNA samples from all of those cases and controls are members of that population cohort analyzed by one of those genotyping products. It's important to perform stringent quality control. Failure to perform the quality control will likely lead to false positive associations due to incorrect genotypes or poor quality samples that can influence the results really quite dramatically. So many steps are used to determine, to identify the potential problems and remove them. So for example, identifying poor quality samples. Samples for which the success rate of SNPs is less than 95 percent as a bit of an arbitrary threshold can indicate that the genotypes that are present are actually less likely to be correct. So setting a relatively stringent threshold allows poor quality genotypes to be removed. These might be samples that have some protein contamination or other qualities that make the genotyping accuracy less good. Samples can also be identified that have excess heterozygous genotypes. This can indicate that that DNA sample is perhaps contaminated with another DNA sample so that there are more positions where nucleotides appear to be heterozygous perhaps because they come from the two different original DNA samples. Looking across large numbers of SNPs, the fraction of heterozygosity can be identified and can discriminate and detect sample contamination. These are large studies being performed in lots of samples and there are lots of sample handling steps that are performed from the laboratory or from the steps of even collecting the blood or tissue sample from the individual. And so one way of detecting whether any sample swaps or sample switches have occurred is to, for example, look for incorrect sex determined by the X and Y chromosome markers not matching the records from the clinical ascertainment. The genotype data can also be used to identify unexpected relatives present in the sample as they referred to earlier thinking about the cryptic relatedness. So one way to do this is to take all of the genotype data from all of the individuals and do pairwise comparisons of genotype similarity. So for example, if you found that two samples had entirely identical genotypes or almost entirely identical genotypes, well that might be two DNA samples from the same person or it might be identical twins present in the data set, that those two samples with the same genotypes would violate the assumption of independence that the statistical tests require. The genotypes then also can be used to look for those measures of population stratification to identify individuals that may have ancestry different from the rest of the sample. So in that goal of being able to identify when the substructure is present because that substructure, especially in case control samples, can lead to false positive results identifying, characterizing what the likely ancestry of the individuals can help. So in addition to identifying removing bad samples, I'm going to start with the SIPs also can show, can be, need to be cleaned or need to be, the poor quality SNPs need to be recognized. So shown here are what some of the raw genotyping data, a cartoon of what the raw genotyping data for a few SNPs could look like. So shown on the axes here is the signal intensity for an arbitrary allele, say the C allele, and the signal intensity on the Y axis for the other allele, other arbitrary allele, let's call this the A allele. So looking at a particular SNP and looking at a couple hundred individuals, this is a plot that could be observed. So there are some samples where the signal intensity from the C allele is strong and the signal intensity from the A allele is relatively low. These individuals would be the CC genotype. Some individuals where the CC intensity is low, the A intensity is high, these would be the AA genotype individuals, and somewhere there's intermediate levels of both the C and the A intensity. So these would be the heterozygous CA individuals. So the software exists to go through the raw intensity plots and to try to identify and assign the genotype labels to the individual samples that are present in the, from that assay. And so here correctly assigned be the CC, the CA, and the AA genotypes. Here's an example where the software, where the genotypes nicely cluster into three different categories, but the software has inappropriately called these two clusters as the same heterozygous genotype. And so this would lead to or could lead to incorrect results and potentially a false positive or a false negative association and could be corrected if this could be detected that these two, these two are different. Manual review of genotype clusters when there are a million could take a very long time. So I've tried to identify the characteristics that the software uses and to evaluate which SNPs are most likely to have incorrect genotypes would help in trying to identify genotyping errors such as this. Another category of poor quality genotypes is shown here. In this example the clusters of the genotypes overlap some and it's difficult for the software or for even a human to decide for samples that are in this intermediate range whether they should be called as the CC genotype or as the CA genotype. And in practice what's likely to happen is that the genotypes would be removed for these particular individuals that are in the intervening space there. Now this can lead to some bias because in this particular case all of the individuals being removed are either likely either homozygous or heterozygous for this particular allele. So we're not, we are removing a specific subset of the, of the alleles this can lead to differences in the, if they're different say between the cases and controls can lead to differences in the association statistic. So certainly the highest quality gate is here in this kinds of examples you need to make the decision whether to remove a few genotypes and think that most of the remaining genotypes are correct or whether to drop that SNP completely. Other quality control statistics help identify and remove bad SNPs. So if the genotyping success rate is less than 95% so it identify cases like the one that I just showed or would say that perhaps that SNP is the assay to detect that SNP is not as robust and is potentially when showing a genotype not showing a correct genotype. Often duplicate samples are included in the, in the genotyping assay so that their genotypes can be compared and SNPs for which the duplicates show mismatches can be removed. The expected proportions of genotypes can be, can be compared with the observed allele frequencies. This is a test of Hardy-Weinberg equilibrium to determine if the SNP assay is perhaps incorrectly calling homozygous or heterozygous. If trio samples are present, mom, dad, and a child then looking for a correct inheritance of the alleles in the child from the parents can be identified and errors in that inheritance can suggest that that SNP is being incorrectly called and also differential missingness in the cases in the controls. If, if a SNP is less, less well genotyped in the cases then in the controls it could potentially indicate that there is some other underlying variant that, that is playing a role or, or and could lead to incorrect results. Okay, so now the good quality SIPs and the good quality samples have been identified. That data then is used for tests of association. I show here the case control test of association but of course quantitative traits can be tested for association as well. Here in the simplest example we look for the number of individuals with the three different genotypes at a given SNP, the number present in the cases, the number present in the controls, and test for a trend of the presence of those genotypes being different between those two groups. This analysis performed say by a test for trend or could be performed by a test of logistic regression can include other covariates that may influence that outcome. So if age and sex are influenced the disease of interest then those can be adjusted for in the analysis. Other genetic models can be tested. Most often studies are performed looking for additive effect of a risk variant. Dominant and recessive tests can be performed as well although that's now increasing the number of tests being performed and they're not completely independent of each other and that larger number of tests would need to be accounted for in determining what the most significant results are found. When thinking about a case control design in addition to determining what the p-value, what the statistical evidence of association is, the strength of that association needs to be quantitated. An odds ratio is a measure of effect that's often used when thinking about the qualitative traits. So shown here is an example of calculating the odds ratio that the measure of effect of that allele on the risk of developing disease. The alleles present are counted. How many A alleles are present, how many C alleles are present in the cases and compared to the controls and then the odds of having a C allele amongst the cases and the odds of having a C allele amongst the controls are compared to one another and so that ratio of those odds can be calculated from the numbers value determined here. So an odds ratio of one means that there's no effect of that particular SNP on disease. An odds ratio of greater than one means that there's an increased effect and if a confidence interval is built around that odds ratio then that can be used to determine whether that increases significant or not. So an odds ratio say the 95% confidence interval around that odds ratio can determine whether that increases significant at the .05 level. Many SNPs are being tested for association, often 300,000 or a million or more, so it's important to correct for multiple tests. One way that this is often done is to set the threshold for p-value significance high so that just the evidence of association that would be found by chance is not considered to be a likely influential finding. Now when the effect sizes are relatively modest then larger sample sizes are needed. So either there needs to be a big effect, a large effect of a particular variant that would say increase the odds ratio dramatically or there need to be a lot of samples to allow that significant result to be identified. Often in genome-wide association studies of European populations there's sort of a guideline that around a million SNPs are being evaluated and so a standard .05 p-value that results would be seen by chance one and 20 times adjusted for the number of tests being performed sets this threshold of approximately 5 times 10 to the minus 8th as a p-value threshold to try and achieve. Other populations that have more genetic variation would effectively be analyzing, looking at more tests and so the threshold would need to be more stringent. I'll show you some example data now from a genome-wide association study to explain that effective p-value. This was from several years ago the first analysis that the fusion study did looking for evidence of association with type 2 diabetes. So now the genomic position of the SNPs is on the x-axis, chromosomes 1 through 22 in the x and the evidence of association, the p-value from the statistical test shown on the y-axis. This is on the minus log 10 scale so that the smaller p-values the ones that show greater evidence of association are higher up on the plot. So a standard p-value threshold of .05 would be down here on the plot. Anything above that line would be a p-value better than .05. Clearly when doing the 300,000 tests that are shown here that's not a satisfactory p-value to be determining evidence of association there are way too many results that were identified here by chance. Going forward one strategy to try and figure out whether the data are identifying results that one might expect is to look for positive control signals. Evidence of association that prior to beginning the genome-wide study might know to be true perhaps from candidate gene studies or from linkage analysis. When looking at this particular set of data, this is a little bit more than 1100 cases and 1100 controls, looked for the location of variants known previously to be associated with disease. We were quite happy to see these variants here. In the top 10 of the results were one of the signals previously shown associated with diabetes. Here's another signal, maybe in the top few hundred of those tests of association. Here's another signal that we believed ahead of times in the top few thousand of those signals. The initial results here are showing evidence of association but some of the true signals are buried amongst a lot of the noise, the false positive associations due to chance. One way to assess the quality of that data and to look for evidence of population stratification and to look for an excess of significant results is to perform a quantile-quantile plot. This is to take the p-values from that test of association, those that were observed and plot them on the minus log 10 scale and compare them to the p-values that would be expected from just a uniform distribution of p-values plotted on the x-axis. In a case of no population stratification and a case of no interesting excess of significant results you would see the points for the SNPs aligned on this line right here. This is the data from that previous slide and it shows indeed that there's perhaps a little bit of excess signal, a little bit of the points falling a little bit off the line here but for the most part along the line. If there's evidence of population substructure the set of p-values may result in inflation across the entire range. This plot shows the p-values that might be obtained from a genome-wide association study where there is evidence of population substructure that's shown by these darker symbols here. You can see that they are falling off of the line of expected results for a large proportion of the distribution. After adjusting that data for the evidence of population substructure that was observed those values might should then get closer to that line of expectation. Now then the amount that those values are off the line of expectation at least at the most significant results show the p-values that might represent the SNPs that are interesting, show interesting evidence of association present in that study. Here's an example to show what sample sizes might be needed to identify particular genome-wide association signals. This is a set of genes and gene signals that have been reported for Crohn's disease, type 1 diabetes, myocardial infarction and type 2 diabetes and the power that would be available in say a genome-wide association study of a thousand cases and a thousand controls. This is the chance of identifying these variants using p-value thresholds of say 1 times 10 to the minus 8th or .01. So for a strong effect variant, this signal here that was relatively strong and has a relatively common risk allele frequency. This is a relative risk of that variant and the risk allele frequency. There's relatively good power to detect that signal of association. Another way of thinking of that is that to have a 90% chance of identifying this signal with a p-value less than 10 to the minus 8th would require 2,430 cases and controls. That's a signal that's relatively strong. Some of the other signals that have been identified such as this signal associated with type 1 diabetes would require 54,000 cases and controls to identify that signal at a p-value less than 10 to the minus 8th. A key way to gain power when performing a study is to collaborate and combine the data with other available data sets through collaboration. The most common way for these collaborations to proceed is that each study, each group performs their own genome-wide association and then the data, the p-values, the effect sizes or the odds ratios, the overall results are combined from all the studies through a meta-analysis. Now potential issues that come to play when performing these analyses are one that different genotyping and analysis strategies are used and another that the case definitions might be different. So if defining the disease is different then the data might be combined and we might be losing power through that combination. In fact the practice has shown that at least at this stage combining data from even if studies have defined diseases a little bit different that the gain in sample size outweighs the detraction of potentially defining those phenotypes a little bit differently. As we try to identify more and more variants than defining phenotype more accurately will become more and more important. Another issue though is that the genotype platforms might be different between the different genome-wide association between the different studies. One strategy that has developed a statistical strategy that has developed to enable the datasets to be combined also enables variants that were not genotyped to be predicted in the samples. This is imputation of genotypes and is described here say in your particular study sample. Here's an individual with two copies of chromosome and there are three SNPs that are analyzed in this sample. There are other SNP positions nearby but they are not being analyzed on that genotyping platform. These can be the genotypes observed here can be compared to a set of reference haplotypes. A very common set of reference haplotypes to be used to be those developed from the haplotype map project where a number of individuals have been genotyped for a much larger, more comprehensive set of genetic variants shown here. We would identify and match the observed genotypes to haplotypes present in the reference. So for example the A-aleals present on this haplotype are found to match this reference haplotype whereas the A-aleals present on this haplotype are found to match this haplotype at first and then this haplotype at later position. The identification of those matching haplotypes then allows statistically those genotypes to be filled in with different degrees of certainty depending on how depending on the presence, what the range of present haplotypes and what the possible haplotypes were in the reference data set. So this imputation procedure means that you can take the genotype data from 300,000 variants, for example, and impute perhaps 2 million DNA variants that are present in the haplotype map reference sample. Most of the imputation algorithms come with a measure of the accuracy or the expected accuracy of that imputation. So a threshold can be used to determine which are the most likely to be accurately imputed genotypes and that data retained for analysis whereas the less likely to be imputed accurately genotypes can be removed. So this imputation facilitates the meta analysis or the combination of data from different data sets. So for example, if this is a stretch of chromosome present, the one product the Illumina 317K platform might have SNPs tested at these positions shown in black. The Afimetrix platform would have SNPs tested at these positions shown in red. You can see that some of those positions overlap but many of them don't. So we were to try and combine genotype data from a present just with one study from typed on this platform. One study typed in this platform we would have relatively few variants where the same variants were tested in both populations. However, if both studies were to perform imputation then all of these variants shown here present in hat map would be able to be analyzed by both studies and therefore able to be combined together by meta analysis. So shown here is trying to represent if the one platform is testing this subset of variants, the other platform is testing this subset of variants, there's a relatively small overlap, but through imputation a much larger set of variants including all these and additional variants can be analyzed. Shown here is an example of a test of association where imputation allowed the signal to be identified. So along the x-axis is a portion of chromosome 19 and you can also see some of the genes that are present in this region including the low density lipoprotein receptor gene shown here in blue. Again the minus log 10p value of the evidence of association with the trait LDL levels, low density lipoprotein cholesterol levels shown here. And the variants that are present on the AFI 500k platform that were used in a pair of studies are shown in red. But the evidence of association that they observed with LDL cholesterol after genotyping is shown by those relatively sparse signals none of which reach a threshold any better than the p value of about .01. However when imputation was performed the variants that were present in hat map but not genotyped were able to be imputed or predicted tests of association performed and you can see that there's a signal for a snip here and a snip here that reached as high as good of a p value as 1.7 times 10 to the minus 6. So strong evidence of association at the LDL receptor gene for the LDL phenotype. This was a positive control signal that was able to be identified because of imputation. Here's an example showing what in practice a study design would look like for trying to combine data together. So here's an example from a paper published this year where seven different groups performed genome wide association studies for cholesterol levels in their separate samples. In total those seven groups have about 19, a little bit more than 19,000 individuals represented that were scanned. From that meta-analysis data the SNPs that showed the strongest evidence of association subset of those were identified so between 40 and 70 SNPs were then able to be tested in additional cohort samples. So these are samples that did not perform the genome wide association study themselves but went in and genotyped that small subset of SNPs. So there are five additional cohorts represented here representing about 20,000 individuals. So those SNPs tested in these additional individuals that data then can be analyzed together. So a sample size together of about 40,000 individuals to try to identify novel evidence of association. The results of that are shown here. Now there's three plots because there are three traits being analyzed. LDL cholesterol shown here, chromosomes 1 to 22, HDL cholesterol shown here and triglyceride levels shown here and you can now see that compared to the signals that are down here in the noise that some signals are showing quite strong evidence of association with P values as high as 10 to the minus 40th. I'm going to zoom in a little bit on a portion of these to talk about the QQ plots that correspond to this data. So here's a subset of the LDL cholesterol data for a set of chromosomes here and you can see that some signals that the signals are colored a little bit differently. This is some are colored in blue. These are low side that had been previously reported. Not so many samples were needed to identify the first set of associated signals and so those are shown. So for example APA-Lyproprotein B association was known previously whereas in this study combining those sets of data together some novel low side were identified shown in green and were able to be replicated in that additional, those follow up samples. So those are labeled here. So for example Tim D4. Some of those signals that were chosen, some of the SNPs chosen to be followed up in the additional cohorts did not show evidence of association for those follow up samples. So those are more likely to be representing false positive signals out of that genome wide association study. The QQ plot then that's shown shows what the evidence of association is for that whole set of data in comparison to the expected distribution. The distribution observed for the LDL cholesterol P values is shown here in black. One way to show what the signals were identified that were new is to take this data and remove the P values corresponding to signals that were previously known before the study was performed and so that's shown in blue. You can see that there's still some quite strong signals and those are the ones that were reported new in this particular study. Removing those you can see that there still is a little bit of excess signal that is detectable that's shown outside of the range of what would be expected from just the uniform distribution. One of the problems that can come up or one of the important steps of interpretation that's needed when combining data together is to consider potential heterogeneity between the studies. An example of this can be a signal that was identified first at the FTO locus that was identified to be associated with type 2 diabetes by the Welcome Trust Case Control Consortium. Now it was a quite strong signal in that data set that compared type 2 diabetes cases to two different types of population based controls. When they compared their data to multiple other well powered studies performed at the same time the signal was mostly not observed in those other studies. When they went to look more carefully at that signal to try and determine what the, why that might be, what those differences might be they noted that the Welcome Trust Case Control cases of type 2 diabetes were more obese than the controls whereas the other studies had had more diabetes cases and controls that were more similar for their amount of obesity. When the Welcome Trust group performed that test of association with diabetes accounting for obesity or accounting for body mass index the signal went away. That is that they identified that the source of the heterogeneity between those diabetes studies was due to the association with body mass index and they detected then by following up that signal association with body mass index that this was indeed a strong genetic variation influencing body mass index and obesity. It's shown here in the report from their paper. So when they first looked at, and this is an odds ratio plot, so we're looking at the odds of obesity for a given variant at this particular SNP looking at the allele. So an odds ratio of 1 would mean that there was no evidence of association with obesity. The first sample that they looked at was those Welcome Trust cases of type 2 diabetes and they saw a quite strong evidence of association with quite strong odds of 1.58 for evidence, for obesity signal in comparison to the Welcome Trust case control, the consortium controls which are here. So you can see that when they were looking for evidence of type 2 diabetes the fact that these samples were more obese than these was leading to an association, evidence of association for type 2 diabetes when the effect was more direct on evidence of obesity. They followed this up to confirm the evidence of association with obesity by looking at additional samples of individuals with type 2 diabetes, additional controls, and additional members of population-based cohorts. And you can see that in each of these samples the evidence of association, the odds ratio is greater than 1. When the 95% confidence interval does not cross 1 that means it's significant at the .05 level. So larger studies have more focused confidence intervals. Taken together all of these data showed that the odds ratio from the meta-analysis of all these studies for obesity is an odds ratio of 1.32 with a quite small p-value. So heterogeneity can arise from meta-analysis. It's sometimes possible to discover the basis for that heterogeneity. The genome-wide association studies have been quite productive in the past several years. This is a summary of the collected together at NHGRI reporting signals where the p-value has been reported at less than 5 times 10 to the minus 8. If you look back at the slide that Lynn Geordi showed you, it was from a little bit earlier in 2009. You can see that a larger number of signals showed up even just in the six months between when that snapshot was taken and when this one was created. This is representing from about November of 2009. And these are signals from all sorts of diseases and traits that have been mapped to different positions around the genome. So what are some of the things that are identified, being identified from genome-wide association studies? Well, first thing often that folks will look for is whether known signals of association are observed in the data and can be replicated. So shown here is evidence of association with LDL cholesterol levels for a SNP that's present at the ApoE signal. This one isn't precisely the variant that influences creation of ApoE4 allele, but it's in tight linkage disequilibrium with that. So that's a signal that's been long known to influence LDL cholesterol that's observed here. But novel signals also are being identified. And this is the main goal for genome-wide association studies. So sometimes the novel signal is present within the intron of a gene. Here's an example shown here. The coloring on these plots represents the index SNP, the SNP that's often reported, say, in a paper or the strongest signal that's present in the genome-wide association data. And then the other variants that are being inherited together in similar patterns are colored based on their evidence of linkage disequilibrium using the statistic R squared. So that the signals that are most strongly, the highest R squared that are in strongest linkage disequilibrium with the top signal are shown in colors closer to red. And those that are in lower linkage disequilibrium are shown closer to blue. And if they're gray signals, those are ones for which that LD statistic is not known. So what you can see here is evidence of association where all of the signals for the SNP's present in HATMAP are contained in the intron of this gene, CDKAL1. This is association with type 2 diabetes. Sometimes signals are identified that are completely outside of known protein coding genes. So here's shown a strong signal that's located more than 100 kilobases from any known protein coding genes. In this case, CDKN2A and CDKN2B are pretty good candidates for this signal of type 2 diabetes. It turns out there's also a non-coding RNA that spans a portion of this region and it's possible that that plays a role. In some cases, common variants are being identified in your genes with known rare variants. Here's a case where a signal evidence of association with LDL was discovered for some variants that are located near the PCSK9 gene. But mutations in PCSK9 have been found to change LDL cholesterol levels by more than 100 milligrams per deciliter. Rare variants have been found to with frequencies around 1% can change LDL levels by about 16 milligrams per deciliter. The common variants being identified here, minor allele frequency or the minor allele present 20% of the time in PCSK9 changed LDL by about 3 milligrams per deciliter. So what we're learning about some of the genetic architecture is that genes that have been identified for rare Mendelian disorders also have some DNA variation that influences common traits. It's also possible to be able to do the flip of that and take variants that are identified with common genes identified or those I identified with common variants associated with a signal and go try to identify if there are any rare variants that might be present in subsets of families where disease appears to have more of a Mendelian form that might help confirm what the underlying gene is at a particular GWA signal. More and more frequently we're identifying that more than one signal, more than one set of variants that show association with the trait are present in a given gene region. So that is if we were to look at this evidence of association with HDL cholesterol in this portion of chromosome 15 if you were to ignore the colors for a moment you would see that there are some there's signal here and there's signal here to show evidence of association. Well it turns out that there are likely two different variants underlying at least two different variants underlying this signal. If you color this signal based on the linkage to this equilibrium the variants being inherited together are restricted to this portion of the peak. If you color it instead by this signal you can see that the variants being inherited together in a similar pattern are restricted to this peak. If you test for association with HDL levels for a set of SNPs and account for the variation present by SNPs in this peak this variation remains and vice versa. That's the good test of the independence of those signals. So there are independent signals present here both of which appear to influence both nearby this lipase gene that appear to influence HDL cholesterol levels. Now one set of these variants are variants in this signal that are found at the promoter of this gene that may be influencing the signal. The other set of variants are located at some distance upstream so perhaps there's a longer range regulatory element being affected by one or more of the SNPs in this region that influences that same gene. The larger the sample sizes are getting the more and more identifying that there are multiple signals present at a given location. Another characteristic that can be identified is that different populations that have different patterns of linkage disequilibrium can be used to narrow in and focus on where a signal might be present. So shown here is evidence of association with height across a region of chromosome 20 and the signal is relatively similar strength spanning relatively large region here. This was a study performed in individuals of European ancestry and the corresponding plot that's showing the evidence of linkage disequilibrium in the CEU population is shown here and you can see there's a relatively large set of variants that are being inherited in similar patterns. Now when SNPs chosen from this region were followed, the red SNPs chosen from this region were followed up in an African American population, evidence of association was only observed for one of them and not for the other showing that the signal was stronger here than here. When looking at the linkage disequilibrium pattern in the YRI population you can see that there was evidence of more recombination events perhaps that have happened in this region that the set of SNPs being inherited together in a more similar pattern is more focused and so this evidence of association restricted to one region perhaps narrows the signal that was present in the other population and in fact the variant here in this particular gene here has been shown to have an effect on expression of that gene. Another characteristic being identified is that some of the same genes are showing evidence of association with sometimes rather diverse traits. So here's a subset of a list that was curated together of genome-wide association signals. So for example there are variants near the PTPN22 gene that have been found to be associated P values less than 5 times 10 to the minus 8 with Crohn's disease and separately with type 1 diabetes and separately with rheumatoid arthritis. Well that type of a variant might suggest that there's a similar underlying immune component to these different diseases these different traits. Similar might be found in the glucokinase receptor being associated with C-reactive protein, cardiovascular inflammatory marker, lipid levels and waist circumference, these traits might be related and it could be that this variants of this gene are either having a pleiotropic effect or that the influence is on one trait and that the traits are correlated with one another and so the evidence of association is observed with multiple traits. Sometimes the signals are being identified for genes and for traits that seem to have less to do with each other and so it would be interesting to determine what the evidence of association is and whether those two signals observed for two different traits are really acting on the same gene or on different genes. And then bioinformatically trying to figure out what type of variants are being identified by genome-wide association studies. Well the vast majority of the variants are intergenic and entronic because that's where the vast majority of variants are located. When trying to determine whether there's a functional basis of their particular classes of variants among those sets of variants that are associated at a given locus groups have tried to characterize bioinformatically the prediction of what those variants might do. So in one study shown here they looked for in the sets of DNA variants whether there were annotations of say non-sononomous or a location in promoters or untranslated regions on other sets of categories and tried to determine whether there were the sets of variants were more likely to include variants that are for example non-sononomous. So shown here is a calculation that shows that the odds that particular trait associated SNPs are over-represented in the sets of non-sononomous sites is increased over saying that there is no access of non-sononomous variants. This excess is pretty strongly then tried to remove the effect of any variants that are non-sononomous or in high linkage disequilibrium with non-sononomous variants and still found that there is an excess of, and that's what the rest of these points show, still show that there's an excess of variants that are present in promoters either at the with different definitions of promoters suggesting that perhaps there are more variants that are non-sononomous and promoter-like that may be playing a role in some of these studies. This is a sort of a global statistic and really the true studies to try and determine what the functional signals are underlying the genome-wide association signals are going to require biological experiments. Okay, so at this point a small proportion of the variability in traits is being explained by common variants. Shown here are a set of diseases. The number of loci that were identified at the time of this study that was done and the proportion of the heritability being explained by those loci. So for some traits such as age-related macular degeneration, five loci have been identified explaining a quite large proportion of the heritability perhaps. 50% of the heritable variation in disease is being identified by that relatively small number of genes. Whereas for other traits, even with larger numbers of loci identified, a smaller proportion of heritability is being identified. So whether or not this information can be used in a clinical setting is going to be disease dependent on the strength of the signals being identified. One might try to take the variance identified and try and look at the prediction of whether using the variance can help predict what the outcome of phenotype or disease might be. This is data shown for 12 variants that were associated with height and counting how many of those variants are present in different individuals and asking among the individuals that had less than or equal to eight height-increasing alleles what is their average height and comparing that to individuals in the other categories. So individuals with 16 or more height-increasing alleles were on average say four centimeters taller than those in the other category. So this is having a relatively modest effect on overall measures of height. So the usefulness of these data in clinical translation, the variance are being identified that show evidence of association with susceptibility. The main contribution is showing these novel biological insights and even despite some of the effect sizes that might be quite small they can still lead to clinical advances through the application of potential therapeutic targets for identifying biomarkers that might help predict disease better, potentially determining environmental factors that might influence and allow us to have public health impact in being able to prevent disease. The other side of this trying to use these variants to improve measures of individual personalized medicine is perhaps going to be a little bit further behind because relatively fewer variants are being, a relatively small proportion of the heritability is still being explained. However these variants can potentially be used in diagnostics and prediction and maybe in determining what the ideal therapeutic treatment, what the ideal treatments are for particular individuals. Okay so in summary we talked our way through genome-wide association study design and quality control measures. The need for very large samples to identify smaller signals and sets of successful loci that are being identified. And finding an association signal doesn't immediately tell us about clinical utility and doesn't immediately tell us about the biology although the functional studies that follow on to these can help identify novel pathways and genes being relevant to traits. The future of genome-wide association studies more and more loci are going to continue to be identified as larger and larger sample sizes are evaluated through meta-analysis. In addition the signals to date have often been the number of SNPs that are chosen for follow-up in replication samples has been relatively modest and as more and more signals can be followed up in larger numbers of samples additional loci will be identified. Panels will be developed with lower frequency variants to allow a greater proportion of the variation to be assayed. More diverse populations need to be analyzed because they can show different power either through different allele frequency of the underlying variants or different environmental contexts that would show different genes to be playing a role. Other types of sequence variants such as copy number variants are going to be analyzed better. Phenotypes can be defined more accurately to be identifying genes related to it and then the interactions between these genes and the environment will be identified. In all the outcomes of these studies are going to be influenced by the ability to identify the molecular biological mechanisms underlying the associations. Thank you very much for your attention.