 Thank you, Teri. Thank you all. Just to a slight addition, I'm no longer director of statistical genetics. Yeah, you said I took over that role for CIDR, and I'm now doing special projects for CIDR. And for those of you who don't know what CIDR is, we're a large genotyping lab. We started out over 10 years ago doing linkage analysis, microsatellites, and then moved into SNPs. And as the products have changed, we've gone from doing small numbers of SNPs to now very large numbers of SNPs. And so I'm going to talk to you a little bit about factors that you might consider in selecting a genotyping platform and at looking at data. GWAS studies, you could be genotyping 300,000 to a million SNPs. There are at least three platforms to choose from, multiple products for these platforms. So how do you choose? I can't possibly cover everything you ought to think about in choosing a platform in 20 minutes, so I'm going to talk a little bit about the basics of calling genotypes, examples of good and bad data, and a few things to consider. Basically how it works. I'm going to skip the chemistry. There's not time, and that's not my part of this business anyway. But basically all of the products are generating intensity data for the two alleles, and then assigning the genotypes based on the clustering of those two allele intensities into three for the three genotypes that are expected. And one thing to remember is these are phenotypes. There is measurement error. These are not perfect genotypes and perfect data. So you're going to have to think about the data that you have, and it's also impossible to review 300,000 or a million SNPs. When we genotype 1,536 SNPs for folks, we manually review each one of those. There is no way on earth we're going to review a million. So you're going to have to find high throughput methods to do it. This is a good SNP. This is Illumina data, and you've got intensity for samples. Each point is one sample for a particular SNP. So the samples here only have intensity for the A allele. The samples here in blue only have intensity for the B allele, and the purple ones have intensity for both, and those are the heterozygots. This is a very nice SNP. There's very good clustering. These are basically the raw intensities on the two axes. This is a slightly different view. This has been put through transformation, more or less normal, not quite. And then this is another view of the same SNP. This time they took the Cartesian coordinates and flipped it to make a polar coordinate. The shaded areas are the areas where Illumina would call the data, and the non-shaded areas, if there were genotypes out here, they wouldn't be called. This is another nice SNP. This is Affymetrix data. This is from the 5.0 product, and this is actually data that we ran in our lab plus data that was run in labs at Affymetrix. So at least in this case, running across two labs didn't make a difference. And another good SNP. Most of the SNPs look like these, but some don't. So most of the data is good for all the platforms. There's slight differences in quality. There's slight differences in call rate, but most of the data really looks pretty nice. But some samples, some SNPs, and some genotypes fail, and you have to find them without manual review. So how do you find the bad data? You use summary statistics across SNPs and across samples. You include investigator and control replicates in your data, in your samples to be genotyped, even though you have to pay for them. You include control and where possible investigator trios. So even if you're sending cases and controls, if you have any relatives, you may want to include some parent-child pairs or parent-child trios if you've got that kind of data to look at inheritance in your population, especially if it's really different than whatever controls are being run. You can use HapMap controls, and if you do, you can compare the genotypes to the HapMap data, but you have to be a little bit cautious because there is error in HapMap, and some of the differences are just going to be caused by error in HapMap or differences between genotyping platform. And so you have to be a little bit careful about how to interpret that. Finding bad SNPs, QC chips, call rate, did most of the samples get called for this particular SNP? Mendelian inheritance. If you've got trios, did the child inherit one allele from each parent? Replicates. Did it type the same way? Twice. Hardy-Weidberg equilibrium. Are the proportions of the genotypes something that's close to being expected? That's a little bit tricky. Some investigators like to use it. Some don't. It depends on your population, and if you'd expect the population to be in Hardy-Weidberg equilibrium or not. There are quality scores and clustering metrics for a lot of these platforms. You can use those as well. There's some bad SNPs that are going to pass any QC filter you're going to build, and there are some good SNPs that might fail QC, so it's always a trade-off. This is a bad SNP. There are not really three clusters there. There's more like five, and there are a bunch of points that didn't get called. There's another really awful one. There are actually kind of three clusters there, but the clustering algorithm didn't do well in terms of grouping them to genotypes. Here's another one. Alumnus-Infinium clustering won't let the alleles move too far over, so we just totally missed one. There's another one with a bunch of points not called, and clearly more than three clusters. There are lots of these. This is AFI data. AFI data has the same problem. Prologin data has the same problem. This is one, again, there was not much intensity here, and it just couldn't cluster it at all. This is one where the clusters weren't really where the data was. Those are awful. You don't want to be analyzing them as if those are real genotypes because you're going to get very strange results, and if the problems are all correlated with your phenotypes, you're going to get really interesting, significant-looking results that aren't real. But hopefully we can find most of them, but you need to use the intensity data to plot your most significant SNPs before you publish and take a look at them and make sure you really could live with those plots. Use a lab that will give you the intensity data. If you have intensity data, you can plot the intensity to check the clustering. You can cluster with a different algorithm. As new algorithms come along and better algorithms come along, you can recluster the data and see if you can save SNPs that you couldn't save before. You can recluster subsets or supersets of the data. If you've got sets of a particular sample type that might be behaving differently, you can think about clustering them separately and seeing if that helps. You can create your own metrics. You could look at the number of samples that had no or very low intensity for SNPs and maybe decide to drop those out even if they had passed the other filters. Finding bad samples. You want to do the same kind of thing. Look at the sample level metrics starting with call rate. Bad samples have genotypes. Even water has genotypes. And you may want to remove some of the bad samples before you do your clustering because they can kind of draw the clusters in odd directions. This is another way of looking at the data. This is what alumina calls a sample plot. So this is all of the SNPs across one of these arrays for one sample. So this is where all of the homozygates are clustered. This is all the heterozygates. This is all the other group of homozygates. You want to see at least some semblance of three clusters. This is a sample that failed utterly. There's almost no intensity and there's no clustering. It had a call frequency of 41%. So you don't want to be analyzing those genotypes. They're just meaningless. They're way at the bottom. And again, you're going to want to look at and filter your data before you start analyzing it. All these data deposit registries, DbGaP, are probably going to have filtered and unfiltered versions of the data. The unfiltered data is really useful for trying to develop calling algorithms, trying to develop QC metrics. But you want to be really cautious about analyzing the unfiltered data because there are things like this lurking in there. These failed samples tend to fall outside of the clusters. The normalization algorithms do odd things to them. So sometimes they wind up looking like they've got some intensity, even if they don't. Sometimes failed samples have intensity. And instead of getting that curve where everything's low, you just get a uniform distribution across everything. So again, you may want to remove some of the failed samples before you apply your clustering algorithms. And a lot of labs will do that. WGA samples, can I use them? The performance ranges from awful? Whole genome-amplified DNA. So if you have a little bit of DNA and it's not quite enough to do your study, can I amplify that DNA in order to get more of it in order to be able to do my genotyping? Maybe. The performance can be really awful and the performance can be really good. It depends a lot on what DNA you had to start with. How much of it was there? How badly was it degraded before you tried to amplify it? Depends on what method you used to amplify it. And even if you've got WGA samples that work very well overall, they may perform poorly for at least some of your SNPs. So you're going to need to pay additional attention to clustering and to your decisions for analysis. And also, you want to make sure your lab knows the sample type so that they don't just say these aren't working very well and drop them. This is WGA samples in the orange. And they're clustering, in this case, very well with the SNPs. Again, this is alumina data, so at least in our hands, it works very well at least some of the time. The majority of the SNPs for this project look like this. The WGA samples clustered right with everything else. But for some SNPs, there's lower intensity. They did pretty well, but you're losing a few of them and the clusters are beginning to run together. The call frequency for the SNP was 98%. They kind of failed early for the SNP. It wasn't a great-looking SNP to start with, and they're kind of all mushed together and low intensity. And the call rate for this one was 93%. So that SNP probably wouldn't have made it to most QC filters to start with. So again, even for projects where in general the WGA works, there are going to be a few SNPs where it doesn't work very well and you're going to have to think about putting in additional checks into your study to look at maybe call rates by WGA versus other sample types. If you've got multiple sample types in your study, and a lot of us do, especially as we start taking the study and the study and the study and putting them together, you're going to need to look at the data by sample type, both your metrics and probably at your plots. And if they're not performing equivalently, you're going to have to do lots of extra QC by sample type. If you have to cluster separately, then you've got even more QC and checks needed to make sure, again, that you didn't just do a really bad job for one set and do a good job for the other set and then mush the data. You could still have errors. If the sample type isn't random, it could cause even more headaches. For example, if you've got different types of samples for your cases and controls, then you've got lots of potential issues downstream and you're going to have to put a lot of checks in to make sure that the differences aren't caused by sample type rather than case and control. Preventing bad data. Discuss your sample types with your lab. What's their experience? And they may want to test a few samples before you start your project. All of these arrays are relatively expensive. You do not want to type a thousand samples that are going to fail when you could pre-test 10 and decide, maybe I shouldn't try this. Discuss plating with your lab. You may want to place controls uniquely or a range male and female samples uniquely by plate. Differences in intensity, batch effects aren't common, but they're possible and they may only be present for a subset of the SNPs. And again, you may want to mix cases and controls across plates or across batches or across time, depending on how things are coming to your lab to make sure that the effects of any sort of lab or plate effect are minimized. Genotypes. Even for great SNPs and great samples, some genotypes will fail. They may not be called. They might be called with low confidence or low quality scores or they may be called wrong. There's one isolated genotype that didn't get called for that otherwise great SNP. Here's one for API where one just got called wrong. Clusters just weren't quite in the right place for that one. There's gonna be a little bit of bad data. And again, hopefully it's small enough that it doesn't affect the results. Copy number. This is one of the new things that people are really getting interested in with both affymetrics and Illumina. There is intensity information and that can be used to infer a copy number at least for relatively large variants. This works really well with small numbers of samples. In manual review, we've done a number of studies where we've looked at known, well characterized deletions, insertions, and variants in clinical samples with investigators and been able to identify all of them. It seems pretty good, but it's not really a high throughput system. You can do an awful lot if you have somebody looking extensively at one sample across all chromosomes. You can't do that on 1,000 cases and 1,000 controls or 10,000 cases and 10,000 controls. You need software to do it and the software isn't really sensitive or specific enough yet. It's getting better. Hopefully we'll get closer. And I'll talk about it in a little bit. The companies are adding more technologies to try to make this better, but it's coming. You'll have some information in your data sets, but it's not quite easy yet. This is just one look to show you a little bit about it. This is the genome viewer from the Illumina software. This is just intensity across one chromosome, chromosome one for one sample and stays right here about at zero going up and down a little bit. This are the homozygots, the heterozygots and the homozygots. So what you're basically looking for is just that there are three distributions. This is chromosome X in a female. Intensity is there. Three distributions. Here's a male. All the heterozygots went away and the intensity is lower. So it's just sort of proved the idea works in theory. This is a known copy number variant that's fairly frequent on chromosome 10 in one sample in investigator DNA. This is that one of the SNPs in that region. So you can see that there are multiple distributions and that was a, whoops, that's actually an amplification and it's in a region with known segmental duplications and also you can see there are not very many SNPs there. That's because it's segmental duplications and it's really hard to find the SNPs and they didn't put them on the chips. Factors to consider in choosing a product, your population, study design, sample types, combining data with other studies and interest in copy number variants, product. The coverage of the genome, it varies. You get better coverage with more SNPs. Which SNPs are they tagging? Are they near genes? What's the quality of the data? What's the performance on your sample types and whether there is information on CNVs and what is it? Comparing platforms, make sure the numbers are comparable. I just went through this. The QC rates reported vary. The denominators can differ. Mendel errors can be reported per trio or per sample. Replicate errors per pair or per sample. I was testing some new software and looking at new platforms and emailing back and forth with I'm getting this number. Oh, I'm getting this number and a lot of it was people reporting different things. The same thing can happen with looking at coverage. The SNPs are correlated with many others. There are multiple measures of strength of correlation and there are multiple lists to use as a proxy. Are they comparing to the HapMap? Which version are they comparing to some other list of SNPs? Cost, hard to say. It's changing really rapidly, generally coming down a lot. It increases with the number of SNPs on a chip and it may decrease with the number of samples in your study and always remember that the reagents, the chips are only part of the cost. You also have to pay for the lab equipment and the other reagents and the people to run it and all the computers to deal with all the data. New stuff, real quick. New arrays from AFI and Illumina, a million SNPs and hence copy number content. Very different strategies between the two companies. We'll see how it works. Improved coverage in the Yoruba or population which should help for African-American populations and other African-derived populations. Illumina 1 million is still in pre-release. It's the same chemistry, workflow, same probe designs as there are other chips so it should work fairly well. There shouldn't be too many new things here. Affymetrix 6.0 is just released. Same chemistry in the workflow as the 5.0. Some changes in the probe design and the software so we'll see how it goes. We're going to be trying both these products very shortly in the lab. We have chips on order and hopefully when I get back we'll start getting some soon. More SNPs are better, right? I should just do a million, no matter what. Maybe, but not always. Methods that use genotypes on samples plus HAPMAP data to infer on genotype SNPs exist. They're beginning to work pretty well so you might be able to infer other genotypes to use in your analysis and this has been used also to combine data from studies that use different chips. More samples on fewer genotypes may give you more power so if it comes to should I use the more expensive chip or should I type more samples, you may have to think about typing more samples. One or two stage designs. A year ago everybody was thinking of two or three stage designs doing a genome-wide scan on part of the sample and then following up a subset of the significant results on the rest of the sample. Now it may cost less to do the genome-wide scan on all your samples because using a standard product is cheaper than using a custom array. Effect size of 1.2. Recent results have found effect sizes but they're really tiny and so the order of magnitude as Debbie said of the sample size we need may have gotten much, much bigger and this is affecting how people are thinking about their studies. Choosing a platform you're gonna have to balance coverage, quality control, cost per sample and design the most powerful study you can. The cost, the products, the clustering, the QC and all the analysis methods are changing really rapidly and what's best today is probably gonna change by the time you put in your grant and it's definitely gonna change before you get your genotyping or finish analyzing it. This is where I'm from. Thank you.