 So I'd like to tell you about some of the experiences that we've had in handling and analyzing data from a genome-wide association study. And in particular, what I'd like to leave you with a sense of is things that you need to think of before you get your data in order to be able to deal with it. So the things I'm going to go through today are storing large amounts of genotype data, which is a quality control, which we've already started to cover, generating the initial association results and viewing those results, imputation of the missing SNP genotypes, and then storing results and planning specialized analysis. So the kind of takeaway message from this, if you haven't already figured it out, is that genotype data is huge. So if you have 500,000 SNPs and 2,000 cases and controls, you're going to end up with a billion genotypes. This means you're going to need compact ways to store the data. One efficient way to store that data, if you code your genotypes as 001 or 11, you'll have a file that looks like rows of zeros and ones, with the person indicated in orange, and then you'll have a SNP indicated as the row here. The total file space for 300k SNPs is about 4 gigabytes, and the largest chromosome file out of this would be about 0.4 gigabytes. Now this has strong ramifications for, I think, how people are used to doing things unless you're used to dealing with very, very large data sets. So one of the first things is these chromosome data sets, even when stored efficiently, they're too large for SAS or other commonly used analytical packages to handle. So you're going to need programs to select genotypes out of these files and to write them out into the multiple formats of programs that you're going to use to analyze them. And it also means that testing of procedures with large-scale trial data sets will be very useful before you get your actual genotype data. There are other sorts of data that would be useful to collect. One is the pretty simple stuff of chromosome and position. We've also found it very useful to have files that contain genes and functional annotation for different SNPs. And another issue that comes up in trying to combine data across studies and also keep things straight within your own study is how you're going to call the alleles, because they're two different strands. And so one strategy that we and others have taken is to calling your alleles based on the forward strand, for instance, of a given genome build that gives somebody outside of your study a quick reference and a way to be able to compare their data to yours. So one of the first questions that you're going to want to ask is how good is the data? So you're going to want to identify and remove bad samples and SNPs. And then for both you and someone who's reading about your data, you're going to want to compute summary statistics including percent successfully genotype samples, average genotyping success rate, duplicate sample error rate, and this non-Mindelian inheritance error rate, which are essentially errors that are not consistent with normal transmission of chromosomes in family members. So how do you identify the bad samples? As Elizabeth has started to show examples of. So you can look at bad samples and find them by the genotyping success rate. Typically bad samples are defined as genotyping success rates less than either 95 to 97.5%. You can look for a greater than expected proportion of heterozygous genotypes, and I don't think Elizabeth showed an example of where you actually had two samples mixed together in sample contamination. And if that happens, you'll get more heterozygous genotypes than you expect. You'll want to remove related individuals, and this happens more often than you might expect that in your control groups or in your case groups, you end up with people who are brothers or sisters or even people who decide to participate twice and have forgotten they've done that. So if you think they're independent, you're going to want to remove them, and that can be done based on looking at pairwise comparisons of similarity of the genotypes. There'll often be sample switches that can be caught by seeing as somebody has switched sexes during your experiment. And if you're working with cell lines, cell lines in order to survive when they're being transformed often lose or gain chromosomes. So you will want to check your data to see if there are any large regions of homozygosity, particularly those that are not compatible with life as a human, because it happens fairly often. So identifying poor quality SNPs, one way to do this is to look at the expected proportions of genotypes that are not consistent with the observed allele frequencies. This is called Hardy-Weinberg equilibrium. People use different thresholds to call SNPs bad. I've seen for genome-wide data thresholds from 10 to the minus 4th to 10 to the minus 6th. You can also do something that I think is good to do with all of your quality control measures, which is to look at a deviation from the expected distribution of p-values under the null, and just see where your distribution starts to deviate. Genotyping success rate for SNPs usually less than 95 percent is considered the cutoff. You can look at the duplicated sample or non-Mendelian error rate and see if that's elevated for your samples. Again, it's a good thing to calculate the overall sample error rate and then ask, is the rate I observe, you know, for each SNP consistent with that rate? And as Elizabeth brought up, looking for differences in cases and controls, especially differential missingness between these two groups is very important, especially if you've had samples that were collected in different ways or genotyped at different times, because there certainly are cases in which you can see lovely associations that turn out to be differences in missingness in these two groups. So there are programs available for large-scale quality control analysis. It's possible and perhaps likely that when you get your genotype being dated back from the genotyping service, they will have calculated these. You may want to select out specific sets of samples you're using for publication. You may want to run these on your own. So two that I know about, and I'm sure that there are more of these. One is a program called Plink, which I'll mention actually multiple times in the presentation. And another one is a program called Gain QC. Plink was developed by Sean Purcell and Gain QC by Gonzalo Abacases. Both of these have comprehensive suites of tests for quality control, and they've both been optimized to deal with large amounts of samples and large amounts of SNPs. So once you get to this point, doing your initial association analysis is pretty straightforward. For case control analysis, you want to use a test that's not affected by deviations from Hardy-Weinberg equilibrium. Cochrane Armitage Test for Trend is a popular one that's actually equivalent to the score test in logistic regression. If you have family-based samples, you'll be using a TDT or other family-based tests and for quantitative traits, a straightforward quantitative trait association analysis. Again, there are programs available to do these kind of basic analysis on a large scale. Plink has case control, some family-based tests and quantitative traits. Merlin has the capacity to do quantitative traits in independent samples and in families, and it actually has the ability to impute genotypes for untyped individuals based on genotyped family members. So if you have large families you want to analyze, you may want to think about whether you need to genotype all family members. Okay, so the next question that you want to ask is, are the results of your study believable? And as David shown, some of the ways to do that are to look at a QQ plot where you look at the expected distribution of p-values relative to the observed. Some of the other questions that you might ask are, are there stronger associations correlated with poorer quality, poorer SNP quality control measures? Is there confounding from differences in genetic origins of the case and control samples or population stratification? And these can be dealt with in a couple of different ways using genomic control or an eigenstrat analysis. If you have SNPs that are, if you, if you have cases and controls that come from populations that are very different, it's still not clear to me that you can completely correct for differences with, with these analysis, but I think it's probably going to be relatively few SNPs that would, that would sneak through. So seeing is, is a good thing for helping you understand your data once you get it. And there are multiple different ways to, to go about graphing the data. So the, the program Plink has different graphical outputs. You can actually add custom tracks onto the UCSC browser. So for folks who aren't familiar with that, this is a, a great resource that you can go to, to look at different biological features of the genome. It shows you the genes. It can show you LD. And then you have homemade graphs. And my feeling is that there's no one particular graph or, or program that's going to meet what you want to do to analyze your data. So to give you a sense, this is a place where we've taken, and the bottom tracks, all except for the, the top one with the purple lines, are the typical tracks available on UCSC. And what we've done is we've taken and uploaded the fusion data onto this browser. So the height of the purple line is the strength of the association in our region of interest. So we're able to overlay our association results with any of the genomic annotation that we'd like on the genome browser. And there are a couple different ways, couple different ways of doing this. The other thing is that many different people have many different ways that they display similar data. So these are three pictures from three papers of the type two diabetes genomic analysis. And what each one is showing you is in different panels, the, the strength of association. We'll have a pointer here. So the strength of association in the top panels across a region with some annotation of, of different SNPs. And then in these two, a picture of the linkage disequilibrium underneath the, the top panel. And what this gives you is a sense of how strongly, how much repetition there is basically in the, the, the, the haplotypes and in the, the SNPs that are in each different region. So you can see here in this graph, there, this is the association scale on a negative log 10 p value from zero to five. You can see that there's a large set of SNPs, right, that show association. But if you look down on the graph below to the r squared, which essentially tells you how correlated these different SNPs are, you'll see this is a region of very high, very high correlation between these SNPs. And so that in fact, if you correct for any one of these SNPs, all of the other associations go away. So this sense of knowing what the linkage disequilibrium looks like underneath a SNP is, is very useful in trying to decipher what sorts of information you have from your data. So, so one of the questions that, that is, is already been brought up is essentially getting more for your genotype dollars or imputation of SNP genotypes. So this, this imputation, there are two reasons why you might want to do this. So one is even for your, your markers that pass your quality control, you may have, you will have some missing data sometimes within the genotype markers. A lot of them are, are genotyped completely. And so it would be nice to have complete data for each marker. And then probably of even greater interest is having information on untyped markers. And so there, there are a couple programs out there that, that, that allow you to, to impute or infer these genotypes. And the approach that they take is to use the haplotype structure of an existing sample such as the hat map to infer data for samples with sparser marker sets. So to give you a sense of how this worked, so in the top line is a study sample where you've genotype, you have a genotype AG, genotype AC, genotype AA for that individual. And below are the reference haplotypes from the hat map sample. And what you can see is there's much, much more dense data here. If you look at how these, these haplotypes from the hat map sample line up, you'll see that the haplotype in purple lines up to the, the sample genotype on top. And the bottom one looks like it's a composite of the bottom green haplotype and the yellow haplotype showing here. So you can use the data from phased, phased reference samples to essentially impute all of the data that you're missing. Or at least most of it. So the advantages of this, of course, are that it allows testing of untyped variation. It allows easy combination of data across genotyping platforms. It's interesting in our Type II diabetes study when we first combined data with the Wellcome Trust and the Diabetes Genetic Initiative, we didn't have imputed data. So we took and we lined up SNPs based on R squared and we, we tried to put that together because there were only about 40,000 SNPs that overlapped between the Illumina platforms that we had used and the Afimatrix platforms that the other two groups had used. By the time we, we published, we had imputed data so that we could, we could match up to the, to the Broad and the Wellcome Trust data. And now we're going ahead with, with matching up of data based on imputation from, from all of the groups. And it's going to be very interesting to see us. I still don't know how many more signals we'll find based on completely imputed data. And I think that's one of the interesting questions that are going to come out of this. The imputation provides, as I said, complete data for analysis, especially if you're using multiple SNPs. This is important. And the imputation, depending on the, the product, can increase your coverage. For example, with the Illumina 300K, we increased our coverage of the, the kind of common variation in the genome. If you use an R-squared or a correlation of about 0.8, we increased it from 71% to 89%. So there was a substantial increase in the coverage of the genome using imputation. Okay. Imputation does come with, with some, you have to take some care in using this data. It's not a complete solution. So one, it requires large-scale computing resources. You need to carefully assess the quality of the imputation by comparing imputed genotypes to actually genotype SNPs. The error rates are higher than for genotype SNPs. It works, unfortunately, and, and not unsurprisingly, less well for rare alleles. So it's not going to be the way that you're going to be able to get information, reliable information on alleles with frequencies less than 1%. And when you do the analysis, the results that you get out from the imputation are essentially a fractional distribution, kind of the, the probability of each genotype. So if you imputed genotype with, let's say, 100% probability that it's AA, you could use that genotype just like you would any other genotype. But if you impute genotypes and you have a 10% probability that's an AA and a 90% probability that's an AG, you need to take that uncertainty into account in your analysis, which means that you're going to need ways to take into account fractional genotype counts. There are a couple programs that are available, a recently published paper impute by Jonathan Marchini and a program called Mach 1 developed by Gonzalo Avacasis. So the, in the last things that you need to think about are storing and viewing and merging results are not trivial with this much data. We've developed a system where we use an SQL database. I know there are capacities in Plink, though I haven't used it. It's also very important to test the speed of specialized analysis in different statistical packages. We went in to do some of our standard analysis is not quite thinking about this and calculated out the amount of time it would take and it was days or sometimes months to do it in SAS or to do it in R. So you really need to think about this ahead of time because it's quite staggering. And this may, if you have specialized analysis, means you need to start thinking now about if you need software developed to really do that analysis effectively. So in summary, there are lots of things that you need to think about before you get the data. How am I going to store, select and write out the genotype data? What quality control and analysis programs am I going to use? How am I going to store the data? Do I have adequate computing resources to do intensive computing if you decide to do large scale analysis outside of the set programs that have been optimized to do this? And then I think I can't emphasize enough that testing beforehand of either standardized or specialized processes with large scale data set. And the fun of all this is, of course, is that you get to work with great people and you get to find genes. Thank you very much.