 All right. Thank you very much. It's a pleasure to be here and to talk to you today about these approaches to studying complex traits. So by complex traits, I think of the variation between individuals that arises from the combination of both genetic and environmental factors, traits for which multiple genetic variants perhaps in multiple genes contribute to inter-individual variation or risk of a disease. Today we'll talk largely about variants that are common between in populations such as those blue stars and towards the end talk more about identifying the variants that are lower frequency or rare in populations and how they contribute to complex disease as shown by that yellow star. The identification of variants that contribute to complex traits and diseases is similar to the approaches that would be used to find an autosomal dominant rare variant that might be traced in a family. So for example, if we're trying to identify such a variant and have a three-generation family and the blue individuals are the ones that are affected with disease, we might search through the genome looking for an allele at a marker such as this A1 allele that is found in all of the affected individuals but not found in the unaffected individuals. The same principle applies to looking for variants in populations but the timescale is often longer. So for example shown over here, the individuals that have inherited a risk allele are present in the present day and they may share common ancestor back many generations ago at this point here where a mutation arose that was then inherited together. Now more time has passed, more recombination events have occurred around that initial variant and so the region around the risk allele is probably smaller in this kind of a setting than it is in a three-generation setting. So the principle of genome-wide association studies that we'll discuss a lot today is to look at a large proportion of the variation across the human genome and look for alleles at variants that are associated with risk of disease or variation in a quantitative trait. And the overall strategy is unbiased with respect to what the function of those alleles and what the function of those genes might be. The strategies by which one may find variants are influenced by the frequency of those variants and their effect size. So shown here on the scale of allele frequency of risk alleles where common variants are typically defined as having frequencies of 5% or greater in a population down to the rare variants that might be specific to an individual or a family. And the effect size of the variants on the Y-axis where a high effect size is more causal with respect to disease and those with more intermediate or modest effect sizes maybe contributing towards risk of a disease or variation in a trait. So a lot of what we're going to talk about today are the genome-wide association studies have fallen to sort of this area of the plot, common variants that can be identified through these strategies. And a lot of these have been found to have relatively modest or low individual effects. If there were variants that were common and had strong effects on disease they would be identified with this strategy. There just are not very many of them. And then as we move into talking about sequencing-based studies we're going to move sort of on this plot towards the lower frequency variants that are able to be detected once they are analyzed. So genome-wide association studies in the past 10 or so years have been very successful at identifying loci across the genome. This map shown here is from a database collecting together these genome-wide association study loci that was originally based here at NHGRI at the NIH and now is based at this site where you can go and look at the set of loci that have been identified for a range of different traits and their positions shown by these colored dots on the different chromosomes. So today we're going to talk about genome-wide association study design, talk through the identification of samples and study participants, and then the genotyping process and cleaning to tests of association between those variants and the quantitative trait or risk of disease, and some aspects that are useful to identifying those variants using imputation and meta-analysis. And then we'll talk about interpretation of specific results, the use of both effect size and significance when describing an association, and we'll look at some example loci. And then finally talk about where the field is moving as technology develops and sequencing allows more variants to be identified, including those at those lower frequencies. So study designs for looking at complex traits can be cohort designs or case control studies for a specific disease. So for example, a population-based cohort that ascertains all members of a population without respect to their status of health or disease ascertains individuals at some particular point and analyzes them. A related type of study design is a prospective cohort where the individuals are ascertained early, then various traits are measured over time, waiting, say, for disease onset to then look at risk for disease. And then a case control design, when there's a specific disease to be looking at, you can ascertain the cases and ascertain the controls, enroll them, and then think what happened prior to disease onset. And here with a genetic-based study, we're asking what genetic variants are associated with the difference between the cases and the controls. So in a case control design, it's important to try to identify cases and controls that are similar and comparable in every way except for that disease status. So try to identify ascertain individuals that have similar age, that have a similar ratio of sex, for example, other demographic-type data. If you want to enrich for the cases for the potential to identify variants that cause disease or that influence disease, you might try to ascertain individuals that are more severely affected with that disease. Or require that cases have other family members that are affected with that disease. There's a greater chance that then there's more of a genetic component than, say, an environmental component. Or if it's an older age of onset disease, look for the individuals that are affected earlier in life. They may have a greater genetic load. When considering the ascertainment of the controls, one could use a population-based, just people who simply don't have disease, but perhaps it could enrich for identifying genetic variants if you look for controls that have a lower risk of disease rather than population-based samples, may perhaps individuals that don't have other family members affected with disease. Another important aspect is to try to match those individuals based on ancestry because allele frequencies differ between populations. So for example, the different shading here in the cases and controls represent individuals of different subpopulations that may have different allele frequencies of variants across the genome. If there are ancestry differences, say, in this example where there are more of one type of subpopulation in the set of cases than in the controls, then this can lead to population stratification. It could lead to false positive associations between variants and the risk of disease. So in classic confounding, an exposure that's correlated with a true risk factor but that's not causal can misleadingly be seen to be associated with disease. So we can adjust for that risk factor and eliminate the bias. For example, if you're measuring alcohol consumption and lung cancer, if smoking and drinking alcohol are related, if you don't measure smoking, then you may overestimate the risk of drinking alcohol on lung cancer. In population stratification, a genotype that might correlate with a true risk factor because both are correlated with the risk of disease, if you adjust for that true risk factor, you can eliminate the bias. So ethnicity per se does not explain the risk, but it's a marker of individuals that may have similar risk. For an example, allele that has a gradient of allele frequency across Europe from northern Europe to southern Europe might track with a factor that affects risk of disease such as a different dietary factor that influences disease that differs between northern Europe and southern Europe. So that could induce population bias when studying the effect of that allele on disease. So population stratification is systematic differences in the allele frequencies between those subpopulations that might be due to different ancestry. And if you, especially in a case control study, if you oversample the individuals from one group in the cases versus the controls, you can get spurious associations. Same principle applies to quantitative trait studies as well. So how can one account for or avoid population stratification? Well, one is to try hard to match those cases with the controls so that there aren't subpopulations present that differ between the groups, could perform that study restricted to one subgroup or try to adjust for that genetic background. So it could take the allele frequencies of the variants across the genome and use a principal component analysis to infer the ancestry from the genotype data and then adjust for those main factors in the association analysis. An alternate strategy is to perform a family-based study design analysis that I'm not really talking about today, where you genotype the relatives and analyze the transmission of the alleles from heterozygous parents to offspring. Okay, so we've collected our, we've ascertained individuals for the study. Now let's genotype them. Genotyping is most efficient and cost-effective when a standard set of variants are present on a genotyping array that can be purchased and genotyped relatively inexpensively. Arrays are available that has, say, 10,000 to 5 million variants preselected that are on them. Some companies that offer commonly used genotyping products are aphymetrics and alumina. And these arrays can be designed in different ways. Some arrays may have random variants or random single nucleotide polymorphisms across the genome. Others have been selected to best represent the variation between populations. It'll show in a second. Some have copy number probes. Some have a greater collection of lower frequency variants on them to allow study of moving from those common variants into lower frequency. Some arrays have been designed specifically looking at variants in the coding regions of genes. And some of these arrays have a preset set of variants. And then the company allows you to select additional variants to be included on the arrays. So if you have variants of special interest to follow up, those can be incorporated as well. This principle of selecting variants that represent the variation across the genome builds on that idea of linkage disequilibrium and the similarities and differences between populations. So shown here, for example, are four example chromosomes with a set of variants in them and the haplotypes that can be comprised of those variants. And in this region, there are a set of variants that are present, but the number of patterns that they represent is limited due to the relatively young age of the human species. The recombination does not happen uniformly across the genome. And so there are regions that are inherited still together in blocks. So selected variants can be chosen that represent all of the variation present by any of these single nucleotide polymorphisms. So you might identify an association with one variant and then follow it up by analyzing if the other variants are considering that the other variants with the same pattern in the population could be causal or contributing to the disease. A couple of the strategies for the genotyping process, the allelic discrimination, can be different in the different types of platforms. So here's one example where the variants across the genome, the whole genomic DNA is used for the assay. Strategies used to select and know which variant is being assayed at which position on the array. And the allelic discrimination, in this case, is a short primer extension where a primer is sitting here and the allel that is incorporated depends on the allel that's present in that captured human genomic DNA. And then there's a visualization step. In this case, it's using fluorescence staining to enable a high throughput analysis of many, many different variants on a given array or bead chip. Here's a different strategy. There's still a target preparation step and a capture of the specific regions of the genome at given spots on the array. In this case, the allelic discrimination happens by hybridization of two oligonucleotides and a ligation step that closes the gap between the capture probe and this additional probe. That ligation step is very specific and requires a perfect match of the nucleotides. If there's a mismatch, the ligation doesn't work so well and that signal doesn't happen. And then the staining is dependent on this captured probe. Different arrays cover the genome, look at analyzed portions of the genome at different levels. So here are a set of example genotyping arrays and you can see the proportion of the common variants across the genome that are well represented by a snip on the array and that can differ by the populations that are present, that are being analyzed. And so depending on the amount of variation that exists in a population, one array may be do a better job than another in capturing that information. So when deciding which array to genotype on individuals, you can consider how well it is likely to perform based on the population those individuals are from. So genotype data gets generated. There's a couple of quality control steps that are important when analyzing those thousands to millions of variants and across the samples that were collected. So one strategy is to identify the poor quality samples. If a given sample, a given DNA has a success rate that less than 95% of the variants were successfully genotyped, that might suggest that that sample has lower quality and that the genotypes that are obtained might be inaccurate. If there are too many heterozygates amongst those genotypes, it could be that that DNA sample is contaminated with another DNA sample. Can look to try and identify whether samples are truly the people that you think they are from the ascertainment. Are they the sex that you believe that they will be based on the markers and the X's and Y chromosomes? Can look for unexpected related individuals by doing a pairwise comparison of genotypes across the genome. And also look for duplicates, maybe an individual that participated twice in the study. The test of association is gonna assume that everybody is independent. Can also take those, the allele frequencies across the genome and look to see whether they are relatively similar across all members of the group to see whether they appear to be part of the same ancestry group. Those are ways to evaluate the samples. Can also evaluate individual variants, individual SNPs. Again, if a SNP has low genotyping success across the individuals, then perhaps the genotypes that it is generating on the other individuals are not accurate. If you include purposeful duplicate samples and identify some variants that are inconsistent between those samples, then those are maybe just not well performing assays should be removed. Can look at the proportion of the genotypes and compare the different genotype frequencies with their allele frequencies and use this to figure out whether the assay is performing accurately detecting heterozygates compared to homozygates. If you have related individuals such as parents and a child can look to see whether the alleles are being appropriately inherited in patterns and also look for differential missingness between cases and controls that could generate some sort of bias if you're doing a case control study. Here's an example of looking at specific, the results out of a individual genotyping assay where the two axes are the signal intensity for each allele of a two allele variants. So samples that have high intensity of one allele and low intensity of the other would be one genotype, high intensity of the alternate would be the other, homozygate genotype and the heterozygates would be individuals in the middle. Here nicely colored correctly that the software analyzed in the data identifies those clusters and calls the genotypes automatically. However, in some cases those clusters might get miscalled. In this case, both of these sets of individuals were called as the heterozygates, so these are sort of have an inaccurate genotype present. And sometimes those clusters are not very well defined. It's sort of the most common type of problem and then it's difficult to accurately call the genotype of those individuals at the boundaries of the clusters. Those might be examples of variants that should be excluded or reanalyzed. Okay, so now we have the samples. We have genotypes on all the samples. Let's do the tests of association. So here we're looking at a quantitative trait. If we were to analyze whether a given variant is associated with a quantitative trait, a typical approach might be to use a linear regression type strategy. So in this case our trait that we're looking at is toe size and we're asking is there a given variant, a given SNP, here RS123456, what's its effect size for its relationship between that SNP genotype and toe size? So shown plotted here for a given SNP, perhaps it's two alleles are A and G. And if the individuals that are homozygous for the A allele have generally lower toe size and the individuals with the G allele tend to have a higher toe size, this slope of this line is showing that relationship between the allele and toe size and would represent a association. In reality there'll be some covariates that may also be associated with that trait that we'd like to remove the effect of so that we're really more precisely measuring the effect of that variant on association. So maybe sex and age are associated with toe size, maybe body size is associated with toe size. We'd remove the effects of those covariates in the linear regression so that we're focusing on the relationship of what's left to explain of toe size by this variant. Now the assumptions in this analysis are that the trait is normally distributed for each genotype and that the subjects are unrelated from each other. If we're looking at a case control study or we can calculate then the odds ratio, so this measure we would analyze the number of cases and the number of controls that carry each allele, say A and C of a given genotype and calculate that odds ratio and look at that the amount of risk. So in this case, this value of 1.33 is representing increased risk of disease for this variant. Here's an example looking at odds ratios of a number of different studies that looked for association between a given variant and type two diabetes risk and shows how odds ratios and their confidence intervals can be displayed. So on the x-axis here is odds ratio for type two diabetes, a value of one shown here and by that dotted line top to bottom is no risk of that variant on type two diabetes. And so you can see in a given study, here this top one is from a study based in Iceland, the odds ratio is shown here by the black symbol and the confidence interval around that is shown by the horizontal line. This confidence, this is a 95% confidence interval since it doesn't overlap this odds ratio of one then that's a significant result at 5% false positive rate. You can see a number of other studies that would have identified significant results. Here's a study that had a confidence interval that spanned one, it on its own would not show significant evidence of association but the data can be combined together and shown together that the odds ratio is significant and has a smaller confidence interval. The ability to identify a significant association depends quite a lot on sample size. Shown here is the relationship between genome-wide association studies, sample size and statistical power with odds ratios here on the x-axis and the amount of power to detect that association is shown on the y-axis with the different colors of lines representing different sample sizes. So in a more modest sample size only the strongest effects, the variants that have the highest odds ratios can be detected in some of those even with not very much power. Whereas as the sample sizes get larger you can see that there's quite good power to detect associations that have more modest odds ratios. When a whole set of variants across the genome are analyzed we can then go use that data, use those association statistics to look for any evidence of population substructure and to try and account for it. So here the results of the statistical test, in this case it's showing a chi-square test, are plotted against the expected values from a uniform distribution. So if there was no evidence of association it was a perfectly designed test then the variants might fall completely on this line of uniform results. If the association results are more significant than you would expect at an early step and an early across more of the distribution than expected this can be evidence that there's population substructure. The variants that might represent the true associations are the ones that show the strongest, the greatest effects or that are most significant and you'd expect some of these to be off the line if there are true associations present. But you wouldn't necessarily expect them to be a uniform distribution early on that's representing in a study. So this could represent that they're population outliers or structure. Now sometimes it's unavoidable in a study to remove those individuals if there's a lot of complex relatedness within the study and so one approach to account for that has been proposed here in this genomic control value. You can calculate a value that represents how much these association statistics are inflated and then adjust for it and report the association statistics after adjustment to try and account for that increased evidence of association. So we perform that test of association can plot them across the entire genome and the results might look like this. So here's an example of a genomic association study data for variants that are across the autosomes of the genome lined up end to end across the x-axis here and the statistical evidence of association by the p-value is shown on the y-axis and it's a minus log 10 of the p-value so that the most significant results are higher up on the graph. And these are often named Manhattan plots because ideally what you see are a lot of tall buildings of signals here that represent true associations that are more significant than you'd expect by chance. Well, how many would you expect by chance? How many associations would you expect by chance? We do a lot of statistical tests here. So if we test several hundred thousand variants up to millions of variants to determine what that appropriate level of significance is we need to consider how often you'd expect a result by chance. So when considering only the common variants that's been estimated that in a, say, European ancestry population there are approximately one million common variants that are present across the genome. If we might typically consider a p-value of .05 to be significant then an adjustment that accounts for that approximately million SIPs that are tested would suggest that the threshold we should use to call something significant is five times 10 to the minus eighth. So to achieve this level of genome-wide significance we need either a large effect or a larger sample size. One strategy, so the two principles I'm gonna talk about in our imputation and meta-analysis that help allow, make it possible to get larger sample sizes by combining data sets together. So the principle behind imputation is such as this, if one study has tested variants that are present on across the genome and each tick mark representing sort of across some genomic region, the variants that are analyzed, and another study may have used a different genotyping array and analyzed a different set of variants. We can combine these data together by imputing or predicting the genotypes that would be present at a much denser set of variants across the genome and thereby if each study can impute this greater set of variants, then instead of just the subs that overlaps between these two studies be analyzed, we can analyze a much larger subset of the variants across the genome. So for example, if here's a study sample that I have of a given individual and I've typed it at three different variants across a region and at this position there heterozygous for the A and G alleles, here there heterozygous, here there homozygous for the A alleles, I can compare the genotypes that are observed to a reference panel of samples that have been genotyped more densely at markers across the genome. Perhaps those samples came from the HATMAP study that analyzed variants across the genome or more recently a denser imputation panel available from the 1000 Genomes Project or as more and more sequencing studies are taking place in populations around the world, newer panels are being identified that will allow a greater ability to impute a larger number of variants across the genome. So in these reference samples that have much denser genotyping, the haplotypes are inferred. And then we compare the genotypes from the that were observed to those haplotypes to identify the best match amongst the sets of variants across the region. So in this case, the best matches might be of these, this haplotype may be the best match for the set of alleles whereas a combination of these two different regions might be the best match for the other set of alleles. Then these matches can be used to impute the missing genotypes at those other positions. And a number of approaches, statistical approaches have been identified to do this and they come along with measures of the likelihood that a given genotype is accurate at a given position depending on how well it is tagged, how frequently it's represented only with one set of nearby variants as opposed to multiple sets. So in this way, a given set of data can be used to impute and analyze a much larger set of variants across the genome. Here's an example of a region of the genome where the evidence of association is shown for LDL cholesterol and we're near the LDL receptor gene and the variants that are shown in red were the ones that were directly genotyped by a given study. And you can see that none of those showed very strong evidence of association. The variants shown in blue are the ones that could be imputed by comparison to a better reference panel and you can see that there are variants then that can be identified that showed stronger evidence of association. So this increased coverage of the genome can allow signals to be identified that might otherwise have been missed. Imputation also allows then those studies that genotype different platforms to combine the data together. So combining genome-wide association study data by meta-analysis, we can combine studies together. A larger study might be given more weight because it has more precision at estimating what the effect size would be and will gain increased power compared to looking at the individual studies. We can also look to see whether those effect sizes are consistent across the studies or whether there might be some heterogeneity in that effect size, perhaps due to differences in the environment or different contributions of that variant to that locus, to that disease in a group. So perhaps different studies define the phenotype a little bit differently. Perhaps the genotyping and other analysis strategies led to some sort of heterogeneity between the groups or those environmental differences. So some common meta-analysis methods that are available. One main category is a p-value-based meta-analysis that's also, then you'd want to take into consideration that direction of association with their variants associated with increased or decreased risk. And another main strategy is to compare those effect sizes, such as the beta's from the linear regression or comparing the odds. Within that type of meta-analysis, you can perform either a fixed-effect meta-analysis. This is assuming that each of the individual studies has the same effect, a same level of effect of the variant's risk on disease. If that's true, then a fixed-effect analysis would increase the power to detect an association. Random effects approach would allow there to be some heterogeneity between the studies and permit that, and if there's still an underlying shared association, can detect it. Other methods exist as well. Okay, so we have now identified our samples and genotyped them and done the tests of association. Let's consider how to interpret those results. So here is our Manhattan plot for HDL cholesterol from this particular example, and what's emphasized here is the p-value for association. Looking at a table that represents those loci, you can see that each association result is reported to show what a lead variant is at that association and which allele of that variant is associated with increased or decreased risk of or trait value here, HDL cholesterol. So both effect size and the evidence, the p-value for association are shown. So in this red box here, you can see a locus. It's named by two nearby genes and a marker that represents that locus and its location in the genome. And the two alleles are shown here. In this particular case, it's shown with the minor and major alleles in this population, so which one had a frequency below 50% and which one is above 50%. And then effect of a given allele. And this is represented different ways in different studies here. It's presented with the effect of allele A1. A1 is this minor allele, so we can go figure out that it means the G allele at this variant. So a G allele has a positive effect on that trait. So the G allele is associated with increased HDL, whereas the T allele is associated with decreased HDL. And here in 181,000 individuals, the p-value was two times 10 to the minus eighth. So all of the data can be represented both by an effect size and a p-value. So the results, zooming in and looking at a particular locus, we're identifying low psi and we don't yet know what the underlying genes might be. And this remains a challenge at some low psi. Some are easier to interpret than others. So here are some examples of what the association results might look like. In this case, the evidence of association, we've zoomed in to a region on chromosome eight and I've chosen these results from different studies. I think it's easiest to look at these plots when every variant has been tested in a similar sample size. So you can see in this particular region, there's a whole cluster of variants in this region that show similarly strong evidence of association. And the location of these variants is in a region right near a gene that might be a good candidate gene for this trait based on its previously known biology. Sometimes a signal is identified, as in this case, a set of variants that show evidence of association, again, in a pretty narrow region of the genome, but there are no protein coding genes right here in the nearby region. There are protein coding genes nearby, 100 kilobases away that could be good candidates, or it could be that there are non-protein coding genes in the region or regulatory elements that these variants are influencing to act upon one of these genes or others nearby. Some loci, the evidence of association is not narrow, and in fact is sort of a broad signal where multiple genes may be in that region. So there might be many different candidate genes here. Looking at the literature may or may not be useful to try and identify what an underlying candidate can be. When the locus is named in a paper, it often will use the names of some nearby genes to help represent the signal, because it may or may not be the most biologically relevant genes to the symbol. You can see in this case, the lead variant is over here in this end of the sort of plateau of signal, and so these genes were named, and these look like good candidates for this that could be contributing to the trait here. However, if one of these other variants had shown a stronger evidence of association, it could be that some of these other genes might have been named as the signpost to represent that signal. So the important lesson here is to interpret the locus names with some caution. Many of them are nearly the nearest gene to a signal but are not necessarily the underlying gene that may be responsible for that signal. So here's an example, again, back to that HDL cholesterol locus where the study used several different methods to try and interpret what the underlying candidate genes could be at a given locus. So shown here, again, is each row is representing a locus that was identified by the study, and the columns here are several different approaches to try and identify what might be the good candidate genes nearby that could be influencing the signal. So one approach is to go to the literature and look at the genes that are in this region. Now, what's in the region? Maybe it's within 500 kilobases. Maybe it's within a megabase of the lead association signal and ask whether any of the genes have been implicated in a biological function that might be relevant to the disease in some previous study, maybe in an animal model study or in some biochemical study that's analyzed and shown what the function of some gene might be. As you can see, some examples here that were identified at some of the low side, but not all of them, that might help suggest that those genes play a role. Another approach is to say, well, a variant that changes the protein sequence that shows evidence of association with the trait might have a higher chance of being involved in being responsible for that signal than one that's in any other location within the gene or outside the genes. And so variants that are part of those association signals at a given locus can be identified. And so at some low side, there are multiple genes that show such variants and others might point to a single gene that maybe has a better chance of being involved in the disease. Another approach is to ask whether those variants that are associated with a trait are also associated with the level of expression of a nearby gene. So this is an expression quantitative trait locus analysis performed the same way as a genome-wide association study, but now the trait that's being analyzed is not, say, HDL cholesterol, but expression of gene one. How do the variants, do the variants show association with higher or lower expression of that gene? And then across the entire genome, so look at, say, 20,000 genes and their evidence of association with expression. And ask whether the same variants are associated with expression. So here at this locus, variants that are associated with HDL are also associated with the level of RBM-5 gene expression. So perhaps in this case, the variants influence RBM-5 gene expression and that's what leads to the effect on the trait. And then finally, different strategies are being identified to try to use the fact that multiple loci show evidence of association with a trait and combine together the data of those different loci, the genes that are present at those different loci, to look for patterns or collections of genes that may help together implicate a particular pathway or help identify a given gene. So several different approaches have been developed to do that, say using text mining to identify all the potential relationships between genes in a given region and a trait, looking across loci and see if any of the similar patterns are identified. And then suggesting that if a given gene in a locus shows that kind of pattern that maybe it's a stronger candidate for playing a role. So you can see from this example that it's not straightforward to necessarily identify what the underlying genes are at these loci. So it leaves future biological studies to figure some of that out. Loci can also be a little bit complex. There may be more than one common variant that are associated with a particular trait that are located nearby each other in the genome. So shown here are two signals that are independent of each other that are both associated with the trait. This is the same plot shown twice, but colored a little bit differently. So the overall evidence of association, there's sort of these two main humps of variants that show evidence of association. These variants here in the top plot, shown in this gray rectangle, are inherited in a very similar pattern to one another and show evidence of association with the trait. And so do these variants shown in that lower plot also show evidence of association with that trait, but they're not present in the same frequency as the upper set of variants and they're not inherited in the same pattern. So two underlying signals may be influencing the trait in this region. Variants in this top signal include some that are located right at the promoter of this gene and so maybe influencing that gene's expression through promoter activity. Some of these variants could be influencing the same gene or maybe even a different one through a more of a long distance regulatory kind of function. One way to identify whether two signals are independent from each other or distinct from each other is to perform a conditional association analysis. So when testing for associations that are just having one variant being analyzed consider two variants being present. So if we're looking for the effect of one variant independent of that other variant we'll include both in the analysis and then look specifically to ask whether that one variant still remains associated with the trait even after accounting for that other variant. And of course we'll include other covariates that may be influencing that trait as well to ask whether those SNP effects for say these two SNPs are independent from one another. So if one of those effect sizes changes when the other one's included in the model then it would show that they are at least sometimes inherited together in the same pattern. If neither of those effect sizes changes when the other is included in the model then those two SNPs are independent from each other and represent two clearly distinct signals from one another. If a signal is shared in its association across different populations then it's possible to combine data from different populations to narrow the location of that shared signal taking advantage of the different linkage disequilibrium that's present in different populations. So shown here is an example of an association signal in a given region on chromosome eight in a European study and the variants that show evidence of association show sort of a broader signal here whereas that's the same variance tested for association in an African American study showed a much narrower signal with some shared variance between those two signals. And this may reflect if there's truly a single underlying variant responsible for these associations in these two populations may reflect that more historical recombination of events have happened around the signal in the African American population then in the European population and suggesting that it's possible to use these data together to narrow in on what the signal best representation of that signal may be that might help identify what the underlying variants are and better determine how they are related to nearby gene. Okay, so a lot of the results that I've shown so far have been focused on common variants and their association with complex traits and the use of genome-wide association studies and the genome-wide arrays that focus on common variants to identify those relationships. And I'd like to move on to how the use of sequencing exome sequencing and genome sequencing is allowing rarer lower frequency variants and rarer variants to be identified and how those can also be related to complex traits. And in some cases, different statistical methods are needed to consider those lower frequency variants because they're present in fewer individuals and it's harder to see the evidence of association in reasonable sample sizes. So here's an overview of some strategies that can be taken to look for lower frequency and rare variants affecting complex traits. So in addition to the genotyping-based strategies, exome sequencing, looking at the sequencing the coding regions of the genes across the genome and genome sequencing can be performed. Different strategies for analyzing that sequence data through variant genotyping calling are performed. And then in both cases, the variants that are identified can be potentially used in an imputation-based strategy to identify additional variants nearby using reference panels. Then the variants can be, as we described earlier, analyzed one at a time in a single point test of association, each variant asking whether or not it shows association with disease and then this may require some replication follow-up and additional samples of that variant to see if it represents a true association. A new strategy, though, that will allow those lower frequency variants to be identified and to identify a gene, an implicated gene, even when the frequency of the underlying variants is lower, is a locust-based association analysis. Or this is sometimes called a gene-based association analysis or a burden test looking at the collection of variants that show association. In this case, you could follow up those specific variants that are identified or perhaps sequence additional individuals for a gene that's identified because they may have yet different variants that are involved in the risk or trait. So some sequencing study designs for complex traits. Here are a few examples. One might choose to sequence selected individuals, say from a trait distribution, choose the individuals with the most extreme trait values, the highest and the lowest trait values, or if it's a disease, choose cases and controls and look for variants that differ between those two groups. Strategies to increase the number of individuals could either be to decrease sequencing coverage. I guess I should say that sequencing is still expensive and so strategies are really designed around cost-effective approaches for identifying these associations. So it's feasible to sequence individuals at a lower coverage of the genome and thus be able to identify variants across the genome but perhaps with less confidence at a given variant or variants that are identified by a set of sequencing studies, say a set of exome sequencing studies could then be identified from those studies collected, put together on a genotyping array that's more cost-effective for follow-up in other studies. And another strategy may be to sequence population isolates where rare variants may have drifted by chance to higher frequencies that allow them to the association between those variants and the trait to be detected and that association may be true. The function of that variant on a gene and in a trait may be true in other populations but just harder to detect because it's observed so few times. So an example of this top strategy, sequencing selected variants is shown here. Here's an early study looking at the extremes of body size. So in this study, they sequence the coding regions and splice junction regions of a number of genes in a set of obese individuals and a set of lean individuals. And here in this region, they show the results for one gene, this MC4R gene and there are different variants that are present in different individuals. You can see that the number of individuals that these variants were found in is small on their own, one and two. They then went and followed up and asked whether these variants that are present in rare frequencies have a functional effect on the gene and in fact many of them do. Some of them have a severe effect and intermediate effect. So the variants are harder to identify from an association study of lots of individuals but they're present and they can have functional effects. Another strategy to identify low frequency variants for a complex trait might be to look specifically at variants near to low side that have been identified from genome-wide association studies. So these are, say, looking at the genes that are nearby association signal. These might be called positional candidates. We're choosing them as candidate genes based on their position and analyze them in cases and controls or again individuals with different trait values and look for variants that are present in one group and not in the other. And the thought here being that we're not necessarily identifying the variants underlying the genome-wide association study signal itself but perhaps can identify variants that will implicate a given gene in that region. So identifying variants that affect the gene function and lead to the same disease or trait can be used to implicate, say, a gene at that genome-wide association study signal. So shown here is an example of that strategy at a given gene for type one diabetes. So in this study, they sequenced the exons and spliced sites of 10 candidate genes from a set of 480 patients and 480 controls and then they followed up the variants that they identified. They followed up by testing for association in more than 30,000 subjects. So in this particular gene with the domain shown below here, they identified a set of variants. Some of these are variants at splice site positions that could influence the splicing of the gene and lead to different functional product. There are stop codons present and also missense variants. And shown here are the association results. So in this case, we're gonna focus on the left sort of half of the table from this case control study. You can see that there are four variants that are shown and the two alleles that are present. And so here in the cases versus the controls of this variant, the frequency of these variants is quite low and you can see that the controls have a little bit higher frequency of the rare variant than the cases do. So this variant, this low frequency variant the variant is protective for disease. The association is with a lower risk of disease, this odds ratio below one. So the frequency of this variant is, the common variant is higher in the cases. All four variants here are showing some evidence of association strongest with this lead variant but also association with these other, these splice site variants and this other coding variant. Even though they're present in not very many individuals and have quite low frequencies. So it's the collection of the variants together that helps establish the role of this gene in type one diabetes and showing that sort of the collection of variants together can help implicate a gene in disease. Suggest that this IFI H1 gene may be also playing a role in the common risk variants identified from the GUS study. So the idea that we could use multiple variants in a gene, so from that MC4R study, variants found in one and two individuals each, these low frequency variants shown here, it's collecting them together that's helping to implicate the gene in disease. So the strategy can be used in larger studies in a broader sense, not just in candidate genes, but looking across the gene. By trying to, across the genome, by looking at genes and trying to identify an increased burden of variants or the set of variants together that can help implicate a single gene or a locus. So the principle again is that many of these variants might be on their own too rare to observe the evidence of association in the hundreds or thousands of samples that exist. But there may be more than one variant that's present in different individuals that help implicate that gene. So the idea behind the gene-based tests is to capture that set of variants together to look for evidence of association with say case control status as shown here or for the quantitative trait. So some variants are shared between cases and controls and especially the common ones may not be driving the difference between the cases and controls in this case, but there might be different variants in different individuals that are say more common in the cases than they are in the controls and that together, especially if those are functional variants that you think really could be influenced the effect of that gene, then identifying those in a way that's statistically sound can be useful to implicate that gene. So these rare variant burden tests are gene-based tests. The principle is to collapse the information from multiple variants into a single statistical test. So one way you might think about that is to count the risk alleles across a set of variants in the cases and compare it to the controls. Different statistical tests have been developed. Some of these will allow the direction of effect of each variant to be different. You could imagine variants in a gene that increase its gene function. You could also imagine other variants in that same gene might decrease its gene function. Both could be involved in implicating affecting risk of disease. And then the choice of which variants to include in doing that statistical test can have a big impact on the test. So if we include too many variants that have no effect at all on the risk of disease, then that'll sort of decrease the power of the test. If we can somehow figure out what the, or guess what the likely relevant variants are, then that'll give us the greatest power to see a difference. So we might filter the variants and look for say only missense variants that change the protein sequence of the gene or look for ones that have a lower allele frequency. Perhaps those are more likely to be involved here or perhaps we could look for those with predictive function, try and predict, say which amino acid changes will lead to loss of function of the protein or which have a likely functional impact on that sequence. Here's a set of the many, this is the active area of research developing these tests. And here's a list of the different strategies that have been identified, different statistical tests to try to identify association in this collected set of variants. Some of them use different strategies of looking at all the variants that may require them to act in the same direction. Others that allow that direction to be different for the different variants increasing or decreasing risk that are implemented in different strategies and have different approaches towards trying to do that test of association. I'm gonna show one example of a gene-based test and this is looking at loss of function variants in a gene SLC-30A8 and an identification of the set of variants protect against type two diabetes. So in an initial study where some young, lean type two diabetes cases and some older obese controls were sequenced to try and increase the difference between the increase the potential of identifying risk alleles. A group identified a set of variants and then tested those variants for association in some another six to 7,000 cases and controls. They found a nonsense variant. It was in seven cases and 21 controls. So that is nominally significant, a p value of about 0.05. And that variant is associated with reduced risk of disease, right? It's present in a higher proportion of the controls than it is a portion of the cases. So that's nice, but a p value of 0.05 is not so convincing if testing a lot of variants across the genome. So this variant was included on one of those high throughput arrays that was then tested in many, many, many more individuals from different studies. In 48,000 individuals, the association persists, a p value of now 6.7 times 10 to the minus third. So continuing support for a role of this variant. Now it's difficult to increase the sample size beyond this because that variant is not very common. It arose in what appears to be Western Finland and is most common in Western Finland and just really not present so much around the rest of the world. So it's gonna be hard to show more evidence of association with that variant given the restrictions of people available. So they expanded to look at more variants in the gene and in other populations. So that initial association of the nonsense variant, so arching into a stop codon that was identified as shown up here at the top and that p value of 6.7 times 10 to the minus third is the collection of individuals analyzed for that given variant. Here is an additional variant, this frame shift variant that was originally identified in an Icelandic study that also shows some evidence of association. It was observed both in Iceland and in Norway and has a p value on its own of 0.0019. It's also a protective variant, a loss of function variant in the gene that's also protective. When looking across samples from a wider range of populations around the world, other variants are identified shown here. Some are nonsense variants, some of them are frame shifts that will change the protein sequence. Some of them are substitution variants. Collectively they're present in say one case, one control, not very large sample sizes. However, when you collect all of that data together, you see that there's still a protective effect of these loss of function variants together on association with type two diabetes. So those together, the p value of 0.0021. Collecting all the variants together across all of these studies is effectively what that gene based test is doing and shows a much more significant association at 10 to the minus six. It's showing a role of variants that are loss of function in this gene and their risk together on type two diabetes. So this is sort of an example of the strategy of collecting together the variants. You can imagine how identifying the set of the choice of which variants to include can influence the test and the use of different variants across the different populations may be needed if there are a lot of individually rare variants that are found. Okay, and then finally, here's an example of a more recent sequencing based study to sort of show what the state of the art is and maybe what's coming. This is from the UK 10K sequencing study and the approaches that they use that collect together and the different things that we've talked about. So they had a set of cohorts together and they performed both single variant tests in whole genome sequencing data. In this case, they did the whole genome sequencing data in about 3,800 individuals and then followed up those variants by looking at an additional 9,000 individuals. By doing that sequencing, they identified more than 13 million variants that were present and did some tests of association. Another strategy was to use the whole genome sequence data but focus on variants in the exome and then perform some gene-based tests on the variants that were found with frequencies less than 1% in the exome and shown here are different strategies for choosing which variants to include in those tests and the number of genes that were analyzed differs by whether there are variants that meet those criteria in the tests. And then finally, they use the genome-wide data from the whole genome sequencing and did some burden tests, choosing selected variants in windows across the genome. So you can imagine how to identify which variants to collect together when looking at the coding regions of exons and choose which variants have protein coding changes and they're expanding this into, well, how do we look at the variants in the other parts of the genome that made together themselves collectively, say, work together on the same regulatory element or things like that to help identify rare variants in non-coding regions that collectively together may influence risk of disease. So shown here is a plot of the variants that have been identified in this study again getting back from the beginning, looking at the relationship between minor allele frequency and the effect size on these axes and a lot of the variants that they identified are common variants and you can see that by doing these sequencing-based studies they're moving on this axis up into being able to look at the lower frequency and rare variants and to detect them, a lot of these end up needing to have a little bit higher effect sizes given the sample sizes available. Okay, so all together, the different approaches, common variants, rare variants, looking at susceptibility variants, the value of these variants largely will likely be to influence the potential to identify new biology related to risk of these traits and diseases. So perhaps the variants help implicate new target genes for the development of drugs that may be used to treat those traits or diseases. They may also help identify different biomarkers of disease that might help predict who's going to become affected with disease. At the moment, a lot of the variants have relatively modest individual effects on risk, though some may be identified that would help at the personalized level eventually lead to better diagnostic or prognostic approaches to learning about complex traits. So in the future of complex trait analysis, these different approaches, technology continues more and more loci are being identified, more groups combine data together, larger meta-analysis as possible, and then the sort of functional translation figuring out, following up on those signals, looking in more diverse populations, identifying more of the rare variants that may be contributing to association will allow then better understanding of the biology disease. Some of those variants may be combined together in sets of variants that together affected gene or a trait. Different variants may be influenced by different environmental stimuli and the combination of those gene-gene-gene environment interactions together will have a greater ability to explain complex traits. And then eventually following up on those genes and on the individual variants that are identified can help us all understand the mechanisms by which variants can lead to these complex diseases. Thank you very much for all your attention. If you have any questions, please come down to the podium.