 There's so many things we need to do with these genome-wide association studies where I think we need much more input from epidemiological scientists. So an overview of what I'm going to do is talk a little bit about some of the nitty-gritty decisions that you need to make on thresholds of all types as you think about how to choose the SNPs that you carry forward. I'll talk a little bit about primary analyses beyond the SNPs genotyped on your platforms as a way of figuring out which SNPs to carry forward and additional information to consider in choosing the SNPs. This was alluded to on one of the early slides that the paper in Nature that Teri was involved with talked about, so publication level information, bioinformatics approaches, pathway studies and so forth. So let's talk a little bit about these nitty-gritty decisions. You heard a lot from Elizabeth and from Laura about these data filters, and I think Gemma Stella will talk a little bit about filtered and unfiltered data, the quality flags that you might set up. So some people prefer actually to start out with basically unfiltered data and carry things through using quality flags, you know, for a lot of different metrics, the Hardy-Wineberg equilibrium flags, the genotyping quality flags and so forth. So you could, you know, you can filter in a number of different ways. Some people don't even want to look at markers with relatively low minor allele frequencies because those are very frequently associated with high P values that don't replicate well in studies. So you have to make a decision right away about whether you're going to filter out a lot of what you think is noise and then have a sort of a pure set of SNPs that you carry forward or basically take through a lot more of the SNP data recognizing that then at the end you're going to have to check those flags before you decide that you're going to genotype the SNP because it may, you know, be pure crap. So, as I said, thresholds before or flags after analysis. Using the quality flags, you know, it has the advantage of preserving fleas, that's what we call them, and sometimes the kinds that you get when you lay down with dogs. These are funny looking signals. And as Debbie Nickerson alluded to earlier, the copy number variations were basically discovered as these fleas on, you know, what looked like otherwise good dogs. And the point is, some of these may be telling us about very important parts of the genome that could be related to our phenotypes of interest. And so, you know, some groups prefer to just use the quality flags. And underappreciated challenge of the different choices that groups make is, so as Laura alluded to, a lot of emphasis now is being put on the possibility of combining data sets, even pre-publication. And so, you know, as people go to share these really big data sets, you have to keep in mind, did people filter prior to analysis, and so, you know, sort of their list of signals is a relatively pure one, or did they just use relatively unfiltered data and set quality flags, in which case you have to pay attention to the quality flags as you try to combine the data. And so, people need to be much more aware of this, I think, going forward as these data sets start to build, and groups make different choices about sort of where they filter and where they flag. When we talk about follow-up studies and optimal designs, I mean, the bottom line is that there are just very practical constraints, a lot of times, on what you can do for follow-up. You know, for relatively rarer diseases and phenotypes, people may put every single case, you know, that they've been able to put together from the world's data into a first-stage analysis, because they feel like they need the maximum power. So, you know, you have then different types of follow-up that you can do. Alternatively, you know, for very common diseases, we talked about type 1 diabetes, type 2 diabetes, various forms of cancer that are quite common in the populations, clearly you can put, you know, what's going to go into the first-stage studies are just a very small subset of the individuals who would potentially be available for being studied. So, then you've got a real tension in terms of what you can afford to do in follow-up. So, sometimes these follow-ups are limited by cost. You may be competing voraciously to get that first publication out. So, sometimes the follow-up studies are going to be limited by time, by the number of samples you have available. And the follow-up studies may be very different when you're talking about copy number variations than when you're talking about SNPs, because following up signals from what may be copy number variation may entail a completely different kind of follow-up. I think you'll start to see arguments that things like one of the approaches I'm going to talk about for looking at untyped alleles, this TUNA approach, or the imputation approaches that Laura's talked about, certainly reduce and may have ultimately eliminate the need for follow-up of additional SNPs within the same samples that you included in the original study. That is, you know, for common variations in the genome, as you get upwards of a million SNPs, between what you're directly type and what you can indirectly interrogate using imputation or these other statistical approaches, you're basically testing everything in that primary sample. So, you're getting all of the common variation. You may then, you know, choose other SNPs to type in follow-up samples, other samples, but you may not need to type much else in your original sample. That may not be true with copy number variation where you get, you know, sometimes direct hits for copy number variants that haven't been described before, and you really need validation of that before you can, you know, fully characterize it. And even the types of follow-up studies you would do for known copy number variants use different technologies. So, that's another thing to consider. The number of SNPs that you carry forward can be something that's predetermined. You may be planning to do an Illumina bundle, for example. So, you'll have 1,536 SNPs that you're going to follow up, and that's it. And then you're faced with, well, how do I choose the 1,536 SNPs that I'm going to follow up? Alternatively, you may design your follow-up study based on thresholds that you predetermined for p-values or false discovery rates, or you may decide after you look at your results what merits follow-up. And, you know, I'm of the belief that let 1,000 flowers bloom. There are lots of scientifically justifiable strategies here, but these are the things that you need to be thinking about and considering the designs of the studies. The primary analysis beyond the direct tests you do, so I'm going to talk a little bit about a different approach for testing untyped alleles. This I call Tuna Nikolai, because Dan Nikolai developed the approach. I think of it as a nice Romanian dish where the key ingredients are a multi-locust measure of linkage disequilibrium and the availability of a reference sample like the HapMap. So, Laura talked about this a little bit indirectly, but the idea with these imputation approaches or what we call Tuna, testing untyped alleles, is the fact that in the HapMap, so we have these reference samples, and we have some set of SNPs that have been genotypes. So here I show three SNPs that we'll say have been directly genotyped on our platform. We've got a test SNP that we know about because it exists in HapMap. And in fact, if we look at our HapMap samples, so this is, we've got these biolithic polymorphisms, we see that given these three SNPs on our platform, the test SNP can be completely determined by the combination of the three SNPs, that is, the genotypes are easily determined because the zero allele at our test SNP is found only on this one haplotype, and the one allele is found on all other haplotypes. Knowing that information allows you with high statistical certainty to assign the alleles and genotypes in individuals that are outside this reference sample for our test polymorphism simply by having used these polymorphisms that are on our platform. And the advantages of these kinds of approaches, you utilize existing information on linkage disequilibrium using something like the reference samples, but you don't have the arbitrary block definitions. You can still construct basically one degree of freedom tests for each known variant with your deciding on the specified uniqueness that you want to go after. As we discussed, the in silico comparisons do require some kind of biological validation. And of course, with these approaches you can't capture information you don't know about. So with these direct approaches, you're limited to the variation that's known in the HATMAP, although the Marconi and Donnelly approach actually does extend to the consideration of each nucleotide as it were as a possible site for variation that you don't know about. So something like Tuna, these testing of untyped alleles, you can use it for in silico follow-up. So you could set a low threshold and tune a type every snip in the vicinity. You can convert lower density screens to higher density. And as Laura mentioned, these are really useful for enabling comparisons of studies across disparate platforms. So basically the way this works, you start out with your set of snips that are directly genotyped. And then for every other snip that you know about in your reference sample, you determine whether that snip provides sufficiently unique information to merit interrogation. If it does, you look for, so that implies that there's no or little pairwise linkage disequilibrium with something you've already typed. So something R squared less than some threshold that you set. Here I put .7. And then you can use multi locus haplotypes to try to impute the genotypes, as they actually do that. Or with Tuna, you're basically just estimating the allele frequencies. And so we find the smallest subset of snips able to interrogate the genotypes with sufficient accuracy. So we're looking again at these multi locus R squared values. The nice thing about Tuna is your primary template needs to be derived only once for each high throughput snip set. So aphymetrics 500k would need to be derived only once, theoretically. But it's so fast to do this, you may choose to optimize it for each project with the set of snips that are actually passing your QC. With this approach, the Tuna approach where we're not really trying to impute individual genotypes, but rather to estimate the frequencies of the untyped alleles, this is a few hours of computer work on a very modest size cluster, so a 10-processor Linux cluster, the kind that we have in our own lab as opposed to the bioinformatics cluster that we use for really serious crunching. So this is the kind of thing, as I say, so it's basically an overnight job to impute or I should say to test all of the polymorphisms in the HATMAP given any platform. And you get substantial return on that kind of investment, so even with the aphymetrics 100k set, you start out, say, in the CEP or your Rubin samples with about 95,000, they're actually polymorphic in the set of phase two HATMAP samples that pass all the HATMAP QC filters that we set up. And so you start out with 95,000. You get a lot of snips for free, remember, because a lot of the snips in the genome are in high pairwise LD with snips on your platform. But you can interrogate an additional couple hundred thousand snips in the Europeans and somewhat less in the Asian or African samples in the HATMAP so that you've got, there's only about 1.4 to 2 million snips that you don't interrogate even using the 100k set. If you go to something a higher throughput platform, you do much better. So with the aphymetrics 500k set, you know, you're getting 1.4 million snips for free, that is in high pairwise linkage disequilibrium with snips that you've already typed. But you can interrogate an additional couple hundred thousand almost so that what's left is only about 500,000 snips, and these are really from the rare end of the frequency spectrum. So once, so it's possible with the high, you know, even this medium throughput platform to get really all of the common variation and what's left is from the very rare end of the minor allele frequency spectrum. Same is true for the Illumina 317k platform. So you get a lot that comes for free because the Illumina platforms were designed knowing about the linkage disequilibrium. You're able to pull in another, you know, 256,000 snips using multi locus LD approaches. And so what's left again is, you know, four or 500,000 snips that are from the very rare end of the minor allele frequency spectrum with something like tuna biased or inaccurate estimators of your reference frequencies affect only the power, not the type one error. So you're not really biasing yourself in that way. And that could be particularly appropriate for samples that are really outside the HapMap reference group. So we study Mexican Americans and consider that to be an advantage. Here's some examples. So here we're looking at with the, this is the 317k platform in a set of Europeans. And so the direct are the actual 317,000 snips that are examined. The indirect is we drop one snip out and interrogate it exactly as we would with tuna any, you know, as if it were an untyped snip. So it's indirectly assessed through this tuna approach. And you can see there's a very good correlation between the allele frequencies that are indirectly estimated using tuna versus directly assessed by genotyping. And a relatively tight distribution of this, what might be called an error in the measurement estimate. If we go out to Mexican Americans where we only had 100k scan, there's only 100,000 snips available to do the interrogations. We don't see as tight a correlation, but there's a very clear correlation. So we see a little bit more error in measurement. But it's still, you know, a lot better than a sharp stick in the eye for getting you additional information from the rest of the genome. Two minutes, Nancy. Okay. So we get pretty tight. So the higher you set your thresholds for the multi-locust measure of LD, the better you do. So here's something where we're allowing anything to come in as long as we had a multi-locust R-squared values, something between 0.7 and 0.9. If we require it to be above 0.9, you know, you do better. And if you look at the test statistics, so this is a chi-square value. Again, this is with all versus requiring the multi-locust R-squared measure to have been above 0.9. So again, getting pretty good correspondence. Additional information for prioritizing SNPs. So remember that a lot of the first genome scans genome-wide association studies are drawing their cases from samples that were included in linkage analyses or have phenotypes that were previously used in linkage mapping studies. You might choose to prioritize based on existing linkage signals. Of course, there are a lot of genome-wide association studies on the same or related phenotypes being done contemporaneously. And it's hugely valuable to try to put that information together and have almost, you know, your in-silico, initial in-silico replications at the time of publication as they did with the diabetes studies that were published in science recently. But we all recognize we're doing this in part because we want to get at all the genes, so get some sort of a pathway. This is from the genes first identified for monogenic forms of type 2 diabetes and basically knowing what, that one of these was a transcription factor, got us into an entire pathway that enabled the discovery of many others simply through candidate gene studies. That's the position we want to be in and one of the speakers alluded to that before. There are lots of databases available to do that. The problem is those downstream annotations require input of genes and we get our signals from SNPs. So we need better annotation. We have physical annotations that we can use. People talked about LD relationships to local genes, expression phenotype information, and we also need to know how well each gene is interrogated either directly or indirectly by our platform as we go into these studies because if we don't, we'll make mistakes. So you can put in a set of genes into one of these pathway programs, look for functional annotations that are over or underrepresented, and one of the things that might come out, for example, is the genes involved in immunity and defense being underrepresented and that's just because the platform did a poor job of interrogating those genes, so we need to take that into account as weights and this sort of annotation can really change how you think about the meaning of a SNP. So here we've got a SNP in the middle of an intron within a single gene and so physical annotation classifies that as an intronic SNP in a certain gene. But if we look at how well the variation at nearby genes allows interrogation of that SNP, how strong the LD between that SNP and the other genes is, we see that it's just as it's in very strong LD with not only the gene that it's in, but in an adjacent gene as well. That's really useful information for taking forward it as well. And if we considered the information on expression just in lymphoblastoid cell lines of that SNP with nearby genes, we learn globally this SNP predicts expression of genes on a couple of different chromosomes, but it also predicts quite strongly, P10 to the minus 6, the expression of this adjacent gene. So not only is it in LD with this gene, but it predicts its expression reasonably well. And if we look at sort of the local rank, it's only 5 out of 22 of the SNPs in this gene for predicting that gene's expression, but it's the top one here in this gene with the P value of 10 to the minus 6 and not really doing much of anything here. So that changes a lot how we think of what genes might feed into the downstream bioinformatics approaches. And this is a database that we're working on and hope eventually to turn over to people who really do databases like DB Gap. So my colleagues and collaborators and Terry talked about the waves of data and we're just hoping to keep everybody from turning out like this porous surfer dude when they do their genome-wide association studies.