 Good morning everyone, welcome to week eight in our current topics of genome analysis series. I'm pleased to introduce to you today Dr. Karen Mulkey who will be presenting the lecture on genome-wide association studies. Dr. Mulkey is an NHGRI alumna having done her postdoc work in Frances Collins' lab where she used genome-wide approaches to localize diabetes susceptibility genes. She's currently an associate professor in the Department of Genetics at the University of North Carolina, a member of the Carolina Center for Genome Sciences, and also a member of the Linnaburger Comprehensive Cancer Center at UNC. Her lab studies complex traits with complex inheritance patterns using many of the approaches she'll be presenting to you today to study connections such as type 2 diabetes and obesity. It's always a pleasure to have Karen here with us at the NIH, and please join with me in welcoming her to NIH this morning. Thank you. Thank you very much. It's a pleasure to be here. I have no financial relationships or commercial interests. So we're going to talk today about the genetic basis of complex diseases and traits. And by complex diseases and traits I mean those for which multiple genes are contributing towards susceptibility, for which variants may be common or they may be rare or both, for which there may be interactions between the variants and various environmental exposures. When we think about the genetic architecture of human disease susceptibility variants, there's a range of the allele frequency that may be responsible or contribute towards a trait or disease. This can range from the very rare up through common frequencies, here's 5% and higher. Variants also range in the strength of the effect that they have. Variants contribute a very small amount towards an influence on a trait or disease and some are have a quite strong effect for which if you inherit that variant you say for an autosomal dominant trait you may become affected. Many of the common disease and common trait variants that have been identified most recently are due to common variants. And the technologies that have been used to identify these are genome-wide association studies most recently. We're going to spend most of the time today talking about genome-wide association studies that are detecting these common variants and then at the end we'll talk a little bit more about how the increasing availability of sequencing to identify lower and lower frequency variants is moving us up the scale. Now a lot of the variants that are contributing to common traits and common diseases have a relatively small effect, inheriting a risk allele for height, may change one's height by a very small amount as opposed to thinking fractions of millimeters instead of centimeters. So they're relatively low or modest on the effect size scale. The approach to identifying variants that contribute to a disease or a trait is similar between linkage analysis in families as well as in population-based analysis as is used in association studies. So in a family with a rare autosomal dominant disorder, the goal of mapping is to trace the alleles that are present in the affected members of the family. Say here the A1 allele inherited by all those members of the family. Association studies use this concept but on a much longer time scale. So here is today and the copies of the chromosome that we're able to measure and ascertain currently compared to the many generations back to common ancestors. A variant that arose around here that is inherited by the individuals that are descended from that person. Maybe all of these individuals may be carrying a risk allele that is increasing their susceptibility to a trait. And we'll identify variants here compared to those individuals that don't carry it. So the goals of using a genome-wide association study are to test a large proportion of the common genetic variation that's in the genome for association with disease susceptibility or variation in a quantitative trait. And we can identify the variants that contribute to these diseases or traits without having any idea what the underlying genes do, what the functions of those products are, doing it based on just the map position in the genome. And this approach, genome-wide association studies, has been very successful. This map of all of the chromosomes from 1 through 22 plus the X and Y shows with various dots representing low side that have been identified for various classes of traits or diseases and their positions in the genome. So an outline of what we'll talk about today. First is going through the genome-wide association study design and some concepts of the facts that are important or the strategies that are important in setting up a study and performing the study and aspects of the analysis. And then I'll show a few examples and talk about interpretation of the results, the effect size, the significance level and some example, low side and what their characteristics are. And then towards the end we'll talk about movement towards performing sequencing studies to look at lower frequency variants or just otherwise the way that association studies with lower frequency variants and rare variants has different needs in terms of analysis. So doing a genome-wide association study. The study design depends on whether you have a disease or a trait. If it's a disease, one could do a case control association study. That has worked quite well especially for common diseases where there's a lot of individuals that can be ascertained who are affected with the disease and can be compared to controls. When looking at a quantitative trait, a population-based study can be used in all members of the population evaluated for a particular trait that could range from something that's very easy to measure such as weight or height to something that requires an experiment such as measuring the gene expression of one gene in the genome or all the genes in the genome. When considering a case control association study, so I'm going to talk about that study design rather than the population-based one. Usually the cases and controls are similar and comparable in all aspects of except for their disease status. If we want to increase the potential genetic component, the potential contribution, enrich the genetic effect size, we might decide to set up that case control association study to try and enrich those cases for having genetic contributions to that trait. We might choose the more severely affected individuals for a given disease or have a requirement that the cases that the people affected with disease have other family members affected with the disease that makes them maybe more likely to have a genetic contribution than a larger environmental contribution to being affected. Or choose those with a younger age of onset that may suggest a stronger genetic component. When considering the controls, we want to enrich the genetic effect size, you may choose to look for lower risk of disease rather than population-based samples. Now if we do these strategies and try and enrich for the genetic effect size, then we may improve our chances of identifying particular variants, although those effect sizes may not reflect what's really going on in the population. And a population-based study would do a better job of really reporting, evaluating what the contribution of those variants is. And we want those samples, those controls, to be comparable to the cases based on other aspects, age, sex, demographics, other things that may be contributing to risk of disease. One important aspect of making sure that the cases and controls are comparable is to consider underlying ancestry or subpopulations within a particular group. So shown here is maybe a matched set of cases and controls, similar numbers of individuals that have different patterns to the solid dots that hatched between those cases and controls. If we have ancestry differences, different subpopulations represented within the cases and controls, then that may lead to false positive results from that association study. Now we may try and collect this information from the individuals, especially asking somebody what their ancestry is, or we can determine it based on the genotyping data and use that information to cluster the individuals and detect their similarities or differences. It may be that we have inadequate ancestry information until we do that genotyping step. So the challenge, if there's a mismatch in those allele frequencies between the subpopulations is that we may have population stratification, and that can be produced, some false positive associations, especially in case control association studies. And so a lot of effort goes into trying to control for those differences to try and make sure that the case control association study is performed accurately. Here's an example of population stratification where the explanation was published back here in 1998. The previous study had been done and shown that this particular marker, this IGG haplotype, this GM marker was associated with type 2 diabetes in a particular community. And the individuals that have the marker, where the marker is present, have an age-related prevalence shown here. And those where the marker was absent have a higher age-related prevalence of disease. So it looks like presence of the GM marker is associated with lower prevalence of diabetes in this population. However, that apparent association with diabetes was found to be due to an association between that marker and the amount of heritage of one subpopulation in this group. So based on great-grandparent ancestry, you can see that the prevalence of diabetes increases with the amount of increased Indian heritage here, whereas the prevalence of this marker sharply decreases. And through this and other analyses, they showed that that association was due to that relationship. So strategies to account for or to avoid population stratification, one would be to carefully match cases and controls, even potentially at the individual level. Identify a case and then go find an individual from a control population that has the same age, the same sex, the same other aspects that may be contributing to the disease. That works well if there are a large number of controls to be choosing from. You may restrict an analysis to one subgroup, define that subgroup as well as possible to try and avoid the possibility that there are multiple subgroups contributing differently. But also use an approach that adjusts for the genetic background, can take the genotype data and use principal components to infer ancestry, and then adjust for those principal components in the association analysis. And finally, some do not use a case control association study per se, but a family-based study design where one can genotype relatives and analyze the transmission of alleles between heterozygous parents to their affected offspring. Looking at transmission disequilibrium test, a family-based association test, methods I'm not going to talk about today but are one way to perform an association study that's not a case control study. So we've identified the individuals for the study, and now we genotype them for markers spanning the genome. SNP panels are available that have, on the order of 10,000 to millions of SNPs. A couple of companies that market products for genome-wide association study include Affymetrix and Illumina. These different arrays that can be purchased may contain random SNPs across the genome. They may contain selected haplotype tagging SNPs. And you talked about this previously in this course, SNPs that are chosen to represent the variation across a region. So the SNP that is present here in the first column that has a CTCT pattern amongst these four haplotypes is represented by genotyping this variant here that has a TCTC pattern across those haplotypes. If those variants are inherited together in the same pattern across a lot of individuals, one can serve as a proxy for another or a tag for the other. And using that information across the genome can be helpful to limit the number of variants and use the smartest set of variants to cover as much of the genome as possible. These arrays can also include copy number probes and not only looking at changes of a nucleotide or not, but whether regions may be deleted or inserted. Arrays and panels are moving towards having more lower frequency variants as they are identified from sequencing studies and are being detected. It can be put onto arrays and allow cost effective analysis. And the past few years are arrays that are focused mostly on variants that are present in the protein coding regions of genes in the exomes are being developed that include a lower range of frequency variants but include the ones that perhaps are easiest to follow up functionally to figure out how they may be playing a role. And some arrays are designed so that the investigator can use the framework markers that are present on the array and then also choose their favorite markers that may not be present, that may not be well represented on that array and add to that custom add-on markers to allow a most efficient study design for the lowest cost. Two of the assays of how these approaches work. The Illumina Infinium assay, the genomic DNA from the individual is amplified and fragmented. And then a beat array of capture probes is purchased that array has on it oligonucleotides that have specific sequences on them. These fragmented genomic DNA hybridizes to these arrays shown here is the bead type and the sequence that's present there, the genomic DNA that hybridizes in a region. And this primer, the sequence that extends ends right before the presence of the variable nucleotide that's being assayed. So the template for a short primer extension is provided by the genomic DNA and the nucleotide that is added on, either the C or the A in this case, are labeled with two different colors so that after performing the chemical reaction and doing some fluorescent staining, you can detect that fluorescent color and its intensity so that the amount of intensity of this color represents how much C nucleotide is present in the individual at that position and the red color is representing how much A nucleotide is present. The aphiometrics axiom strategy for doing genotyping. Again, amplify that DNA and fragment it from the given individuals. Here, the capture is performed on an array and a cocktail of labeled oligos is added on in the reaction and the discrimination is performed by a ligation that will ligate together one oligo with one of these provided oligos if there's a perfect match, a perfect complementarity of the nucleotide sequence at that position. The ligase is very specific for that. Wash off the other nucleotides and then a step of staining and imaging detecting that label that is on those oligos and allowing the, again, the intensity of the two different colors representing two different alleles of labels that could be present at that position. The arrays are genome-wide arrays, but what their actual coverage is across the genome can vary. Not every single variant is covered in any of the arrays that are available. So when choosing which array, maybe choosing, you may wish to choose based on what the coverage is of that particular array for the variants that you wish to investigate. So here are some examples of various arrays and the coverage across the genome based on different populations from the HapMap study. And you can see that some have moderate coverage up to higher coverage, really, with more SNPs that are available. More markers on the array allows for higher coverage, but that coverage differs between the different populations because the frequencies of the variants are differing between these populations. So the genotype data is generated. There are several important steps of quality control to make sure that the data is valid. So one step is looking for identifying and removing bad samples, samples that may have been genotyped. So if a sample's poor quality, maybe we'll detect it because it's success rate. The number of, that sample is successful on less than, say, 95% of the variants on the array. That might suggest that the sample is lower quality and some of the variants that did succeed in genotyping might be giving erroneous data. So it would be useful to exclude that sample. If the average number of heterozygote calls in that sample is higher than the other samples being assayed, then that might suggest that a sample is contaminated. And again, that the genotype calls are incorrectly representing a combination of samples. Maybe able to identify sample switches by analyzing the data, looking and seeing whether the sample is of the appropriate sex based on markers on the X and Y chromosomes. Or by comparing genotypes to any previous existing genotypes on those individual samples. So a lot of sample handling that happens in the genotyping process and useful to detect any handling errors. We may identify any unexpected related individuals. Duplicate the same person. They have participated in a study twice and can detect that those identical samples are present or twins or relationships between individuals that were not known previously. And the data can be used with that principle component analysis to evaluate ancestry based on those allele frequencies across the genome and identify individuals that may have much different underlying ancestry than the rest of the sample. It's also a genotype quality control step at the level of the SNP, the level of the variant. If a particular variant has a success rate less than 95%, well maybe that assay is a little bit unstable and so the genotypes that are called might not be accurate. It might be better to remove that particular SNP. Some duplicate samples are included intentionally to evaluate and determine whether or not the SNP assays are accurate or not. So if there's a high discrepancy rate then those samples may should be excluded. Looking at tests of Hardy-Weinberg equilibrium, whether they expected proportions of the genotypes are consistent with the observed allele frequencies. If there are any family relatedness, family relatives in a study such as TRIOs, you can look for tracing that the alleles are inherited properly in those families. And if there's different missing less, if it's a case control study, maybe a different missingness rate of the cases versus the controls or if there are other aspects that recognize that the cases and controls perhaps were genotyped differently that may be important to keep track of because it could end up influencing the results in the association analysis. Here's an example of the readout of the genotyping assays with the signal intensity of one allele shown on the x-axis and the signal intensity of the other allele shown on the other axis. So for example, at one SNP we've genotyped several hundred individuals here. These individuals had high levels of the x allele and lower levels of the y allele. These would be homozygots for that x allele. Here are homozygots of the y allele and here are the heterozygots. That's a nice example. Sometimes the software that is calling these genotypes and recognizing these clusters may accidentally combine two clusters together and call these all as heterozygots and these as those other homozygots. We need to detect when that kind of thing happens. Sometimes the assay is not really all that, doesn't discriminate that well between the clusters and then we need to determine whether we can keep this assay and whether we're confident of the genotypes in here or not. Maybe particular samples need to be removed because we're less confident of the genotypes where these clusters overlap or perhaps the whole marker should be removed from the analysis. So now we have the genotype data. We have the high quality genotype data. Now we can do tests of association. If it's a case control study, we could perform, evaluate the genotypes and count how many individuals have the different genotypes between the cases and the controls. Could do a test for trend. Could do an analysis that includes other covariates, other things that can be influencing the outcome of the disease. You could do an additive test that's most commonly performed but could also do a dominant or recessive test. When doing an allelic analysis, just looking at presence of the A and C alleles, for example, at a variant, you could count the number of cases, the number of controls and calculate that odds ratio of the presence of those alleles and look for whether the odds of having disease are increased. An example here of an association study, the odds ratio plot from a number of different studies all looking at the same variant. So we're looking at the odds ratios down here and the value of one, meaning there's no influence of that variant on disease is shown by this vertical bar. So any given study with references shown over here is represented with the dot, representing what that odds ratio is, the bar representing the 95% confidence interval. So some studies, maybe the smaller ones may have larger confidence intervals, some of which cross over that value of one. Some of the larger studies have more larger sizes of the boxes there and more limited confidence intervals. Data can be combined here across populations to show that what the odds ratio is here in these different populations and summarize data together. This is a very strongly associated variant from a genome-wide association study and it had an odds ratio of 1.46 and the p-value across all of these studies is quite significant at 10 to the minus 140th. Looking at a quantitative trait, maybe doing more say a linear regression analysis here where the looking at these graph of the different genotypes and your trait of interest and plotting the trait values for the different genotypes and then looking at the beta value here, representing the slope of this line for this particular SNP. In reality, we'd be including covariates. If my trait is toe size, then toe size might also be affected by sex, by age. Maybe the relationship between age and toe size isn't completely linear, maybe my toes get smaller as I become more elderly, maybe body mass index plays a role. Any significant, any other variable that is significantly associated with toe size I'd want to include there so that the value that I'm focusing on for the association between the SNP and the trait is most representative of that SNP contribution. Now we're assuming that the trait is normally distributed for each genotype and has a common variance and that the subjects are independent. And so traits may need to be normalized ahead of time or transformed in some way so that those distributions are normal. And if there are not independent subjects, if there are some relatives within the study, then that relatedness needs to be taken into consideration. So collected the samples, genotype them, perform an association analysis, perform it on every marker that's available and can generate a plot like this. Where each dot is representing one variant that was evaluated, the chromosomes are lined up and to end here, here's one through 22. And the evidence of association is shown as negative log 10 of the p-value is shown on the y-axis here. So this is a very large study of 188,000 individuals from 60 different studies that combine their data together and the p-values are quite strong and exceed up here even more than 10 to the minus 100 for some of these, often called the Manhattan plot because this is supposed to look like the Manhattan skyline. So if we're gonna test many, many variants using the threshold of significance maybe needs to take into consideration that number of tests. Testing 300,000 to millions of SNPs, one approach that's used to correct for consider those multiple tests is a Bonferroni correction that's based on considering that a p-value of 0.05 that's maybe more typically applied to a say clinical type study. If the number of common variants present in the at least European genome is approximately equivalent to a million independent tests, then 0.05 divided by that million gives a p-value of five times 10 to the minus eight. So this is a commonly applied threshold for declaring that an association is significant if it exceeds that level of significance. So if we're gonna get to this level of significance we're gonna need the variant to have a large effect on that trait or a large sample size to be present to reach that threshold. Two more aspects I wanna talk about about the design and analysis of the studies and one is the ability to impute un-genotyped variants. So in a given array, here's a comparison of two different arrays, some of the introductory arrays from the Illumina and Afiometrics that had 300,000 and 500,000 variants on them and each this is sort of a position shown across here for a region of the genome and the variants present on one array are shown with these black hatches, the other array shown by the red marks. And you can see that there are not that many variants that overlap. In fact, if you looked at the pool of variants on the arrays for those two, it's a small proportion of the exact same variants that are being genotyped. However, given the relationship between variants in the linkage disequilibrium that exists in the human genome we can predict what the genotypes are at nearby variants that are in high linkage disequilibrium with the ones that are being genotyped. This process is called imputation. So if we impute in this case to variants present in the hat that were genotyped as part of the hat map study in a reference sample, we can impute using either of these data sets, many of the variants shown here that are a much denser set of variants. Briefly, the way that this works is that I have my given study sample with maybe a copy of a portion of chromosome from mom and a portion of a chromosome from dad. This particular individual is AG. At this position, AC. At this position, and AA. At this position. And then I have a set of reference haplotypes. Maybe they're from the hat map project or the thousand genomes project or my own sequencing study of many, many more individuals in my population of interest. A reference set of samples where the genotypes are much more dense. So maybe shown here. Many more variants than were observed in my study sample. Could take those observed genotypes and look for the reference haplotypes that are most similar to them and look at the likelihood that these particular alleles are present and on what type of haplotype this has to do with the frequency of those haplotypes and the presence of those particular alleles. So perhaps the AAA haplotype in this region best matches this purple haplotype here, whereas the GCA doesn't best match any of these haplotypes across the whole region, but is best represented by a portion of a haplotype here and a portion of a haplotype here. That then allows me to impute those missing genotypes or predict those missing genotypes. And the various algorithms that exist for doing imputation usually provide a quality score along with each given genotype that's imputed. I may impute a variant that is in extremely high LD with one of my genotype variants much better than a variant that is lowish LD and where I'm counting on the relationship in looking across other haplotypes and doing more estimating with less confidence. So what's shown here are snips that were directly genotyped on a particular array are shown in red. And this is a test for association with LDL cholesterol in a region of the genome across a few hundred kilobases here near the LDL receptor gene. And we're looking for association with LDL cholesterol. You can see that none of the red genotyped variants show strong evidence of association with this trait, but when able to impute those ungenotyped variants, variants are identified that show strong evidence of association in this individual study. So performing an imputation test to allow a greater proportion of the variants in the human genome to be tested for association can increase signals being identified. Now another aspect of doing the association study will be to combine the data together from multiple studies. Maybe I perform a study in thousands of individuals looking for association with a disease or a trait and I find some evidence and I publish my paper and but there's likely more data in there and I'd like to combine that together with others, my colleagues who are also generating, performing a similar study. So combining, we'd like to combine that data together, giving more weight to studies that have greater precision. Probably that means a study that's larger, that has more subjects in it, deserves a little bit more weight in that combining data than the one that is much, much smaller than it. And this increases our power to detect signals compared to the individual studies alone. We can also use this to investigate the consistency of the effects across the studies. And also investigate potential sources of heterogeneity, what are the differences between these studies? Maybe the phenotype difference in how the cases and controls were selected or how the individuals were participated may be a little bit different and we may identify some of those differences. Maybe use different genotyping and analysis strategies, the environmental effects on the individuals that are participating in the analysis may differ and we may be able to recognize those differences in effect when we test for heterogeneity in the meta-analysis. So some common meta-analysis methods, one is to combine based on P values and taking into consideration the effect of the variant is the C allele associated with increased trait value or decreased trait value, considering that when combining the data based on the P values for those significance or perhaps more commonly an effect size meta-analysis that's using the normalized effects from each of the studies to combine together to perform that meta-analysis. Combining those effects, we could do a fixed effects analysis where we're assuming that the between study variants is zero that each of the studies is gonna be contributing the same effect size or perhaps if we don't believe that assumption to be true, random effects, met based meta-analysis. Other methods exist that can incorporate some of the uncertainty in those beliefs and do a Bayesian type analysis. Meta-analysis also offers another chance to adjust for population stratification and this has been termed genomic control. So shown here in the plot are looking at values of an expected effect. Here it's from a chi-square test looking for association between alleles and comparing that to those values that are observed and comparing them for every single variant that is analyzed in the study. And so most of the variants, most in the genome-wide association study are going to fit along the normal expectation of the distribution they show no effect on association and should be falling right along this line of identity here. When variants are more significant in the observed study than in from an expected uniform distribution then they would rise up off the line. And so these may be representing true signals of association. But if the signals are rising off this line very early on at lower and lower, less and less evidence of association here that may be representing in fact that there are some population outliers and or structure in the data maybe or relatedness between the individuals. So the factor by which this increase is present can be calculated and the results of the tests of association can be adjusted by that factor. And so a study that says that they perform genomic control calculated the factor perhaps for each of the studies being input into the meta-analysis and adjusted those. So if there were lots and lots of relatives and there was a lot of inflation in one study then those p-values were adjusted to become less significant in that contribution to the analysis. So we've set up, we've designed, we've performed our genome-wide association study. Let's talk about interpretation of some of the results of these studies. So I'm gonna use as an example an analysis that was done with seven individual contributing studies that performed a genome-wide association study and then a meta-analysis was performed of those studies. And the lead variants were identified and followed up in additional sort of a stage two set of cohorts. So the initial analysis was done in almost 20,000 individuals and the follow-up in another approximately 20,000 individuals. And this was a study looking at traits involved in cholesterol and other lipid levels. And the main results are shown here. Here's the full set of results. I'm gonna zoom in in a second. But there's a Manhattan plot provided for each of the traits looked at LDL cholesterol, and triglycerides. You can see that signals were identified across the genome. Also shown are these quantile, quantile plots, like the one that I just described, where the expected log P value, minus log 10 of the P value is shown on the x-axis and the observed values on the y-axis. Now I'm gonna zoom in on a little bit here. So on a portion of the Manhattan plot, in this case they colored the loci based on whether that signal had been previously reported in earlier studies, or whether it was a novel signal from this paper. And here they also pointed out the ones that they tested in their stage two analysis but that didn't meet the threshold for significance. Here in this QQ plot, they show this gray line here is shown the sort of the expected results there's no evidence of association with the confidence interval in this shaded pattern. And the results from all variants in the association test is shown here. And that includes lots and lots of very strongly associated variants. So then when they consider the variants, remove the variants and the regions around them that had been shown to be associated previously, the signals is shown here. If they remove the variants that they are then reporting and asking are there additional variants yet to be found associated with this trait, that's the green bars here. And so you can see that perhaps there are additional signals that if a larger analysis were performed, a greater meta-analysis that additional signals may be identified. The results then are often reported in a table, and this is very small here, but shows a portion of the data for the loci that are found to be associated with LDL cholesterol. So here are four rows representing four signals that were newly identified in the study and then some of the loci with prior evidence of association with that trait. And the given variant is indicated and results for combining the data from that initial stage, the genome-wide association study, as well as the follow-up studies that only genotype to subset of the most significant variants. And combining that data together, they show the p-value and the sample size that was present. They also show what the size came up with a measure for the size of that interval in kilobases and the number of genes that are in that interval. And this is many different ways to define a locus, and this is one that they represent here, and then labeled some genes of interest that may or may not, but looked good for potentially playing a role in that particular trait. Now in this case, to report the effect size associated with each of these variants, instead of using the combination of studies, they chose to use one particular study, here the Fremington-Hart study that was a population-based study. And so perhaps the effect sizes are most representative, better representative of the population than if we chose only, say, the cases and the controls and the extremes of the population and we're missing sort of the people and more in the middle. So they report here that the two different alleles and then the effect size four, in this case, the minor allele. So it says that the T allele is associated with increased levels of LDL cholesterol, and this is the beta value from the regression analysis and it's a standard error. So we can look at the beta value and determine whether the allele is associated with increased if it's positive or decreased levels of that trait. So now for some examples of what the zoomed-in association plots look like in particular regions and some of the potential kinds of results that can be observed. So here zooming in on a portion of chromosome 19, you can see that every dot now representing a variant that was tested for association here and this was using variants imputed to the HapMap reference panels. And down below are genes in the region, including the APOE gene known to be associated with LDL cholesterol far before the genome-wide association study era. Also plotted is the recombination rate, looking at the rate of centimorgans over megabases. So you can see here with this blue line, hotspots of recombination. And so sometimes these hotspots of recombination sort of point out or sort of are consistent with the limits of the region of association. So what's found in this case is that we can replicate a known, previously known association signal. That's really kind of nice for a study that's quite new to a good positive control when the same signals can be identified. But novel signals are perhaps the most exciting, the reason to do the study to identify new things. Some of those studies may be completely localized, may report variants that are completely localized within introns of genes. So these are the strongest associated variants here. You can see the recombination hotspots that are sort of bounding around this region of the variants that are showed that most the strongest evidence of association. None of those variants are in coding regions. Many signals are localized outside of known protein coding genes in the genome. So it's shown here, a signal, a set of variants in a nice sort of small region, but more than 100 kilobases from the closest known protein coding gene. This may be due to a more variants acting at a distance on a protein coding gene, or maybe there's an underlying RNA that is being influenced by the non-coding RNA that's being influenced by the variants. Or maybe there's an underlying protein coding gene that we don't know about yet. At many loci, there are multiple genes in a region, and the process of identifying and determining, suggesting what the most likely influenced gene is, that's done in a paper, often is based on looking at the the genes in the region and looking at the literature and looking at various approaches to try and indicate whether or not the particular genes could play a role. In this case, there are multiple good, when looking at the literature, genes that may be involved in this trait, and this one's HDL cholesterol. And you can see that we're sort of at the limit of what the evidence of association is telling us from this particular study, because there's a whole set of variants that are inherited in very much the same pattern across all those individuals that show very, very similar evidence of association. The lead variant is a tiny bit better that's more strongly associated than some of these others, but that could be due to just fluctuations in the samples that are contributing here, or small changes that may not be significant, that the likelihood that this variant itself is the causal variant is not much bigger than the chance that some of these others are playing a role. So really, a set of candidate variants that may be influencing the trait at this locus could be considered to be including at least this set and maybe this set and maybe even others. It's important to know that to interpret due to us loci names with caution, many of the names that are provided to go along with a given variant in a table may just be the closest gene to where the lead variant is that gets reported. So this may be labeled with one of these gene names when these are also in the region. How do we interpret the plausible candidate genes? So here's a set of approaches that were done. Here's a recent paper from the fall of this year reporting signals loci newly found to be associated with here all HDL cholesterol levels. And in this table, they report what the nearest gene is to the lead variant, how far away that nearest gene is. So in this case, the nearest gene was 13 and a half kilobases away. Many of the examples here are the variant that was the lead variant is within the bounds of the gene, either an exoner or an intron. How many genes are within an arbitrary distance to your 100 kilobases? So you can have a sense of how gene rich or gene poor a given region is. And then approaches to try and sort of guess or estimate, provide some sort of sense of how to interpret that result. One approach is to go look at the literature of all of those genes in this region, maybe even at greater distances and try and say, well, which ones have a plausible role in lipid analysis or HDL levels or have some piece of data from a model organism or their biochemical function that may be known that might suggest that they're good candidates for this trait. Another approach is to take all those candidate variants and ask, are any of the lead variants, and this might be done by taking the top variant and looking at those that are in linkage to sequel labrium at an R squared level greater than 0.8, and asking whether any of them change the protein coding sequence of any of the nearby genes. So which genes have a non-synonymous variant in them? You can see that in this top case here, there are three genes that have non-synonymous variants. These asterisks, being the two of them, are a little further distant away. In other cases, there's a single non-synonymous variant that's been shown to have a functional effect on that particular protein, and so that might be a stronger piece of evidence than some of the others. Because not all non-synonymous variants affect functions of genes, some of them are silent. Another approach is to look for whether the variant that is associated with the traits, such as HDL cholesterol, is also associated with expression of a nearby gene. So that's termed an EQTL, an expression QTL. So in this particular case, the variant that is associated with HDL cholesterol at this locus is also associated with the expression of the RBM5 gene. So making up the alleles here, maybe the A allele that's associated with increased HDL cholesterol may also be associated with increased expression of RBM5 that might provide stronger evidence that RBM5 is involved in the HDL cholesterol levels. An important aspect of interpreting these studies is to figure out whether the variant that is most strongly associated with RBM5 in that region is inherited together with that, the HDLC associated variant or not. Sometimes these lookups can be just variants that are in modest or low LD with some other variant that is driving expression of that gene. And finally, approaches are being developed and used that look at the genes that are present at multiple of the identified loci, take them together and look for whether particular pathways are represented that might suggest that a gene at one locus a gene at another, a gene at a third all together work in the same pathway that perhaps those are the more likely affected genes than ones that are unrelated to the other loci. A given locus doesn't necessarily just have one signal of association. Here's an example, the same association plot is shown twice here, but it is colored based on the linkage disequilibrium with a lead variant over here and you can see that these variants are inherited often together. Lead variant is a p-value of 10 to the minus 15 and the same thing is true for these variants over here that these are inherited in a similar pattern but that there's not a lot of linkage disequilibrium between these two signals even though they're relatively close together near this gene. So here, this interval is overlapping and variants are present at the promoter of the lip-c gene whereas these variants are a little bit further upstream. These signals appear to be independent suggesting that there are two underlying variants or more that are contributing to expression or function of that gene. Makes a lot of sense that there could be allelic heterogeneity for complex traits. We know it's true for single gene disorders, there are a lot of variants that are present in a given BRCA1 gene that can lead to breast cancer in a similar way, we'll likely be identifying many different variants that can be influencing the expression or the function of a given gene to influence a complex trait. As studies increase in size, especially meta-analysis studies that are having increased power to detect signals, we're identifying more and more low-sci where more than one signal appears to be present. One way to calculate this or to declare whether or not it appears to be more than one signal is to perform a conditional analysis. So taking, for example, that same linear regression equation but now including two SNPs in the analysis and asking whether, for a given trait of interest, whether the lead variant, including the lead variant, to say of each of those two signals, again with the covariates included, if this beta value for this SNP changes when this one is included in the model, then it suggests that this variant is sometimes inherited with the other one and that those signals are not necessarily independent but may be contributing together. If neither of those betas changes in the reciprocal tests, if looking at the B1 value when B2 is included and the same, if neither of those values changes, then the two SNPs appear to be independently affecting the trait. So that's the clearest situation of being able to say multiple independent variants are affecting a trait. The variants may not necessarily be independent. Some of them are gonna be dependent on each other. This will be perhaps harder to define and explain but is likely reflecting really how variants influence genes. Another approach to narrowing a signal when performing an association study is to look across populations. So here's an example of a meta-analysis performed using individuals of European ancestry and with the strongest evidence of association. This is an HDL locus distant from this gene here. And you can see sort of a plateau of the most significantly associated variants whereas a study of approximately similar sample size of individuals who describe themselves as African-Americans has a narrower evidence of association. The same or very similar lead variant across these studies but if we assume that it's the same underlying variant at this locus across the two, then maybe the population history and that is that there was a greater recombination events, more recombination events that limited the region that is still showing evidence of association in this population compared to this one and it can help us localize the underlying signal. Taken together, genome-wide association studies have identified many variants for many different traits. The amount of the trait variation that's been explained by those variants differs by trait. You can see shown here are a number of different traits or diseases and the heritability that's been estimated based on pedigree studies and the heritability that can be explained by the GWAS traits included in this particular analysis, in this particular summary. So for some traits, the genetic variants that have been identified are explaining a larger proportion of the heritability, especially true here, perhaps, because the HLA locus has such a strong contribution to type I diabetes risk. Many traits, it's much more modest, the percent of the trait variation that's been explained so far by these common variants. So the use of that information is going to be disease dependent. So now I'd like to turn to sort of the current and future use of looking at some of the lower frequency variants that perhaps were not included in genome-wide association study analyses performed previously. Now part of this has to do with the design of the arrays that have been used, that they tended to include variants that were common in the population. So the lower frequency variants perhaps were not included, perhaps are less easily able to be imputed based on reference panels and are better being identified by ongoing sequencing studies. So in addition to genotyping analyses for complex traits, exome sequencing and whole genome sequencing are being performed. Variants are being identified and genotypes called across these sequencing studies similar to genotyping-based studies imputation can be performed. And using lower frequency variants, the same analysis could be performed, a single point association analysis, take each variant and one at a time ask whether it's associated with disease or not. Given that sequencing is still expensive for very large studies, the results may not be significant, it may require a larger level of replication. And in this case, you could go follow up the particular variants that showed suggestive evidence of association here. But another approach is to combine the data across variants together at a locus where different variants may be present in different individuals, but together help implicate that particular gene in association with the trait. If you're looking at different variants in different individuals, then yes, you could follow up those variants, but you may also want to sequence that particular gene in additional individuals to find out what new variants those individuals may carry that you didn't know about in the initial study. So some sequencing study designs for complex traits, and I have examples of a couple of these. One is to sequence selected individuals, say choose those that have extreme trait values from the very top of the distribution to sequence those individuals and choose those at the very bottom of the distribution, maybe expecting that you may identify specific variants that are driving disease with larger effects in those individuals. And so even though they are rare across the population, they are having a stronger effect, either increasing or decreasing a trait value in those individuals. Really larger number of individuals may need to be identified to get a significant result. If the variant is low frequency, it's going to take more and more individuals to observe it enough times to have confidence in the effect of that particular variant. So increasing the number of individuals assayed, one way could be to decrease the sequencing coverage, say not use all of the reads to get very high confidence genotypes in every single individual, but use fewer reads per individual so more individuals can be assayed, perhaps with some lower confidence, but that larger sample size to still call the variant, identify the individuals, find more people that have a particular variant to test. Another approach to increase the number of individuals is to sequence one sample and then choose the variants that are identified, take those variants, collect them together onto a genotyping array that can be used, perhaps more cost effectively in many, many more individuals to increase the sample size by following up the sequencing by doing genotyping. It also may be useful to sequence population isolates where the frequency of variants that are rare otherwise have drifted to higher frequency and may be more easily detected. So a couple of examples. Here's one of sequencing at the extremes of body mass. So in this study they sequenced the coding regions and splice junctions of 58 different genes and they chose 379 individuals with a very high average body mass index and a similar number of individuals with a lower body mass index. And they found a number of new variants that hadn't been identified previously, including eight in this particular gene, MC4R gene, one that was known previously to play a role in obesity. They then subsequently tested those variants for function to try and have greater confidence that a variant that was being detected, identified, was it really influencing how this gene acted? So the variants are shown here, whether they were known previously or whether they were novel and their effects on the functional studies that they did, summary of those here. So what they showed is that some of these rare variants that had not been previously identified had functional effects and that this approach could be used to identify other variants that perhaps collected together across individuals would help us identify and implicate a given gene. Could also do sequencing of additional individuals, say at a locus identified by a genome-wide association study. So this may be like positional candidate genes based on the position from the GWAS study and to sequence those regions in cases and in controls or in the other individuals with extreme trait values and look for variants that are present only in one group compared to the other. With the idea that perhaps finding even one through a smoking gun obvious functional variant that has a strong effect could implicate one gene in that locus region compared to the many others. So in this strategy, it may not be the variant that's underlying the association study, but some other variant that influences the gene that helps implicate what gene may be present at the GWAS locus. That might be easier than identifying a variant that maybe has a more modest effect that's common in the population that's really underlying the GWAS locus. So an example shown here is for type one diabetes. This group did a sequencing study following up on low side that had been identified from a GWAS for type one diabetes and one of the genes here is this IFI H1 gene. They re-sequenced the exons and spliced sites of 10 candidate genes and they used pools of DNA from patients and controls and then took the variants that they identified in this strategy and tested them for evidence of association in more than 30,000 subjects. This is a paper published in 2009. So their strategy for decreasing the cost of the sequencing was a pooling of DNA. Variants that they identified include a couple critical splice site variants, first nucleotide into the intron, into intron eight and intron 14. Here's a stop code on nonsense variant that was identified and then a couple non-synonymous substitutions. And they test these in the much larger sample sizes, shown here is a given variant and here pairs of rows of how frequent it was in their type one diabetes cases. So a minor allele frequency of 1.1% and in 9,000 controls a frequency of 2.2%. So 8,000 to 9,000 individuals. And so this particular variant is associated with a reduced risk of type one diabetes and has a p-value of 10 to the minus 14. Similar here, the splice variant, this stop code on and this other splice variant had frequencies of one to 1.5% values around half of a percent, other values below 1% in terms of how often they had been identified in even these large numbers of individuals. In all cases, all four of them, the variants that were identified had lower, were identified more frequently in the controls in the cases. These variants are appearing to be protective from type one diabetes, some with stronger evidence of association than others. And here's also an example where they compare the results of a case control study to one of those family-based association tests where they're looking at the number of transmitted to non-transmitted alleles and the relative risk in additional individuals here. So they use this data to implicate that variants in IFIH1 can influence risk of type one diabetes. They're decreasing, these rare variants are decreasing risk of disease, but show evidence then that it's possible to, so implicate that particular gene at this locus as playing a role in disease. Now many individually important variants like those may be too rare to detect for association in the trait. However, they could be important when taken all together. So tests are being developed, statistical methods are being developed that allow these variants to be grouped together in an association test. These are called burden tests or maybe gene-based tests. And the approaches are to combine the information from multiple variants. So shown here, many chromosomes from individuals affected with disease where the X's are representing given variants. We have a candidate gene down here that spans this region. So some of the variants are present within the candidate gene, some maybe are present nearby. Some of them are common shown by these vertical bars where others are rare only found in fewer individuals. So gene-based tests can be used to combine the information from the multiple variants into a single test statistic to use as the predictor in the association test. Now one of the challenges and one of the differences between the methods is what information about the variants to use. So some tests will focus or will choose the variants excluding the most common variants because those perhaps are less likely to be having a strong or functional effect on the underlying gene. So they might exclude these ones here that are rare and use a threshold and say I'm going to combine variants together in a gene or in a gene region where the frequency is less than 3% or 1% or less than a half a percent. So it depends how big that initial study is, what that threshold can be. And then maybe choose an annotation of those variants and choose an annotation that is suggesting that it's a more functional effect. So maybe not include variants that don't change the protein sequence but include the ones that are loss of function or that are maybe non-synonymous changing the protein sequence with the idea that one of those maybe plays a more, has a stronger effect on an individual and could be responsible in that individual. So develop a test that then selects sets of variants to ask together whether that set of variants is more often found in the cases compared to the controls or vice versa. And some tests will include allow variants to both increase risk of disease or decrease risk of disease. Others are based on the assumption that all such variants would act in the same direction to increase or decrease risk. So there are many alternative forms. This is a very active area of research and different approaches being used that essentially altogether are collapsing information into a single test and the choice of variants that are included can have a very big impact on the test. So if we include variants that don't have an effect, too many null variants can really reduce the statistical power so we end up not identifying a significant result even if many other variants do have a functional effect. So filtering, choosing which variants to include based on frequency and predictive function are the most commonly applied approaches at this point. I think time will tell what the approaches are that most usefully identify novel genes influencing disease susceptibility or a trait based on different variants being present in different individuals. So together, genome-wide association studies, common variants or rare variants found from sequencing used for association studies all can lead to the identification of susceptibility variants for common and complex traits. The biggest utility of these variants may be the novel biological insights for the trait or disease of interest. When we identify a locus that is not near any gene known to play a role in disease, then we have the potential to learn a huge amount about other types of genes, pathways that may be involved in disease, that may identify and lead to clinical advances such as new targets for drugs or other biomarkers to be able to detect and predict disease maybe even leading to prevention. It's also possible that identifying the particular variants together can help in improving an individual's response, especially if it's a response to a particular drug or ability. The future of complex trait analyses from using genome-wide association studies and sequencing more and more loci are continuing to be identified. The proportion of the heritability that can be explained by variants known to date is still relatively modest and so there's many more discoveries yet to come. Larger and larger meta-analyses are identifying more and more of these signals. There's deeper follow-up now of the signals that are identified to try and understand what the underlying variants in genes are. Studies are being performed in more diverse populations that allele frequencies different, the environmental contributions differ and the genetic variants that may play stronger roles in different populations will all lead to identifying the biological basis for disease. We'll see more and more of these gene-based tests from the rare variants and more of the studies that are identifying loci being used to look for gene-gene and gene-environment interactions that may better explain the overall contributions to disease. And we have a lot of work to be done for all those loci that are being identified to understand what the molecular and biological mechanisms are for how these DNA variants are contributing to disease and trait variability. Thanks a lot for your attention.