 I'd really like to thank Terry and the organizers for the opportunity to speak, and this is really a terrific and very timely session. And as you've already heard from the speakers so far this morning, it's all about replication. So I'm just going to make a few remarks about how that can work and choosing studies in which to attempt replication, starting with how to choose the studies for the original genome-wide association study. And we'd all agree, this is an epi-meeting, that a sine qua non would be to select a study with cases and controls as well matched as possible with respect to a variety of confounders, but in the case of genetics, ancestry is key. Because we want to minimize this thing called population stratification, or really just confounding by ancestry, and if we have cases who are drawn from a population that has a different set of ancestries than controls, we're going to pick up, in our case, control analysis a lot of alleles that are merely just different between the different ancestral populations. And so one immediate strategy that has been by and large employed to date has been to restrict to a single self-identified ethnic group. The question is how good is that for substructure in populations, differences in ancestral background that can't be summarized by census-based definitions of race or ethnicity. And so there's been a lot of enthusiasm for statistical control of potential confounding by ethnic substructure using genomic control techniques. A very popular technique now is a technique looking at essentially factor analysis of SNPs in the genome, be they ethnically informative SNPs that are known to be different between major ethnic groups or essentially random SNPs of these large SNP chips. And the question is how good are those techniques for reducing perhaps even eliminating differences between outcome groups that we can't control for merely by restricting to a single self-identified group. Here's a little data from Pete Kraft using some data from a GWAS in the nurses' health study, which I'll return to you later. But the intent here was to take a phenotype that we would strongly suspect there could be potential confounding even within people of self-described European ancestry that could be confounding by substructure, and that phenotype is hair color, where there's essentially a known client between North and South Europe, more dark hair color in the South, more light hair color, up to red hair color in the North. And so this is a phenotype where as we analyze hair color, we would actually expect to pick up allele prevalence differences that differ between a generalization, differ between Southern and Northern European populations. And that's exactly what he finds. So this is a QQ plot looking at the distribution of expected P values across 528,000 SNPs from the Illumina platform in about 2,400 women with self-reported hair color and a variety of other phenotypes. And here are the observed P values. And if there was absolutely no difference between expected and observed, so there was no difference, there was actually no cause of alleles for hair color, plus there was no systematic population differences between or allele prevalence differences on the basis of population between dark hair color and lighter hair color, the observed distribution would just line up on the diagonal axis. And in the crude analysis, there's a substantial departure with a large excess of SNPs in this region here, P values that are different between the observed and the expected. And the statistical parameter that's often used to summarize substructure is the lambda 1.24 in this case, which is substantial and just says that there's likely to be substantial differences between the allele prevalences in dark versus light or red hair color. And we're only expecting at the end of the day a limited number of truly causally associated SNPs that actually do control that phenotype. So the suspicion would be that a large number of these SNPs are due to confounding by ethnicity, essentially. And then when Pete applies eigenstrap factor analysis, develops four clusters essentially of alleles that appear to cluster on the basis of different ethnic groups and controls for them in the analysis, the lambda settles right down to 1.02, and there's far fewer SNPs essentially associated with this phenotype just on the basis of statistically controlling for the clustering that is on the basis or presumed to be on the basis of different ethnic substructure. Interestingly, going from four to 50 such dummy variables in the regression makes very little difference. So there's a limited number, or at least in these data evidence, for a limited number of population subgroups that if controlled for substantially accounts for the potential confounding by ancestry for this phenotype where we would expect such confounding. And so there's now a large number of examples of this in the literature suggesting that these statistical techniques in populations that we make our best effort to match for initially these statistical techniques are actually pretty good. Examples of this in the Walken Trust's case control consortium, which Terry referred to earlier, and I really recommend digging through the 40 or 50 pages of supplementary materials. They also discuss population stratification and show examples of this in action in their studies. But this meets no epidemiologist definition of great epidemiology. Seven case groups with up to about 2,000 cases rather eclectically sampled from different clinical and other populations compared with two control groups, blood donor controls and a birth cohort from the late 1950s. And yet, at least at the first cut, they're able to, in their crude analysis and they do a substantial amount of additional adjustment in their supplementary analyses, they have hits. As Debbie mentioned, no hits for hypertension, but hits for these other diseases which in this initial paper they show replicate and in a whole string of other papers that are coming out, a number of these most strong hits replicate. So not to be pejorative in any sense, but you might call this quick and dirty epidemiology by our usual Society for Epidemiological Research, pristine standards, and yet the top hits at least are clearly robust if you at least take the first step of doing the study within a single country and trying to restrict out self-reported ethnicities to limit in this case to populations of European ancestry. So broad matching on ancestry in region is probably adequate for discovery of the strongest hits, statistical methods for control of population stratification at least with the demonstrations so far within populations of European ancestry are adequate to assist and weed out some of the initial hits that are merely based on confounding by ethnicity. The big question is, would more rigorous, better technique in study design permit the discovery or be necessary for the discovery of weaker associations? And I think that's really a live question and something that we're going to discover over the next few years and could usefully be the substrate for some simulations and methodological analysis. Because the real issue here is when the signal flow is low, we're looking at much weaker hits than these initial hits. There's a huge potential for false positives, noise due to multiple comparisons. And how does that compare with the noise due to poor matching of controls essentially? And we can deal with the false positives with enough replication as I'll discuss, but the real issue is if we did poor design GWAS and then we go on to replicate subsets of those SNPs, are we missing hits initially false negatives that we then fail to attempt to replicate? And I think everybody would agree that we should always do the initial GWAS in the best designed population available for that particular phenotype, but for many phenotypes we have to admit that we actually don't have substantial best quality epidemiologic studies and so we're forced to use these more convenience approaches of clinical cases with common controls. Now I'm not going to go through this in any detail, but it's already been referred to Steve Chanick and Terry and others put together a very nice review of replication which sort of makes my job here unnecessary if you just read the review that came out in Nature a couple of weeks ago, just pointing out that we have to be very careful about language when we talk about replication that if we are talking about replicating initial studies, we should be very careful to be replicating the same association, not a different phenotype, the same genotype, not the same gene with a different SNP that may not be an LD, etc. And so they make some very nice recommendations for how to organize this and to present it, to think about it in prospects, but also to analyze, present and even review and edit the data to try and minimize the amount of noise we introduce to the replication stage by people essentially overreaching to say that because a SNP that is in the same gene as a previous association replicates, but that SNP might be in no LD with a previous association, that's clearly a very different form of replication if it's replication at all than replicating the same SNP for the same phenotype. A few considerations about how to approach the initial GWAS, there's clearly a wide variety of technical issues that other speakers are going to talk about, the standard advice we think about for any biomarker handle the case and control samples exactly the same way at every stage would be the best practice, not always achievable. John Todd has a very nice paper in Nature Genetics last year showing that DNAs from cases extracted by exactly the same extraction, DNA extraction method than DNA from controls, but the method was operationalized by two different labs, resulted in DNAs that gave rise to different genotyping characteristics, different signals in the genotyping platform, just merely because the DNA was extracted in different laboratories but using the same method. So ideally we'd always be handling case and control study specimens exactly the same way and we're going to be substantially limited in epidemiologic studies by whose study has actually collected DNA and the sort of bottom line is that blood and buffy coat seems to yield very good quality DNA in most people's hands. Initial studies for the large number of studies that have now collected buckle cells using a variety of protocols. Some of the initial data is encouraging, so Heather Fagelson has a paper in CBP recently looking at the swish and spit alcohols scope protocol and got very good large scale genotyping completion rates and concordance rates with buffy coat samples from the same people, but anybody who's worked with buckle cells knows that there's a wide variety of quality and probably the majority of these samples are not going to be easily sustaining very large scale analysis, particularly the older samples that were collected under other protocols longer ago. Holtz, you know, amplified DNA right now, AFI will accept that and it will run, Illumina's not quite there yet, but they claim it's in development so again this has major implications for studies that just have a very tiny amount of DNA and don't have the one or two micrograms that you really need to have to apply one of these SNP chips with the current technology. What do we mean by replication? So for statistical replication, again the recommendations from the group in nature point out that you'd like to essentially be attempting initially to replicate in studies that are very, very similar to the initial studies, so similar definition of phenotype, similar ancestry, but as one of the questioners asked for generalizability, obviously we'd like to know how replicable these findings are across populations with different ancestries and here different ancestral backgrounds as Debbie Dickerson referred to might actually help us narrow the interval of linkage disequilibrium and might actually help us move from linkage to causal association. So there's definitely a role for essentially cookie cutter, very similar studies, but also reaching out to studies, particularly those involving studies with people of different ancestries. Study design, again we almost always like if we could get it to get prospective data, rather than case control data, particularly for diseases that have fairly rapid fatality because that will protect us against survivor bias. In theory, if you have a blood resource and you have informed consent and you have samples from everybody, then you've got essentially 100% participation of both cases and controls and that should protect from selection bias. Perspective studies also, we think that the environmental data is not going to be susceptible to recall bias in the way that they could be for retrospective studies and so that'll aid in the interpretability of gene and primate analyses and there's always the chance that if you're doing a prospective study and there's plasma or serum measurements that those biomarkers will be simultaneously available to the G.U.S. Study quality is a tricky issue and here we've spent a lot of time in epidemiologic meetings really trying to sort of define what we mean by study quality. When we apply it to at least to these initial phase, G.U.S., the slightly sobering thing is as somebody who was really brought up with best practices of epidemiology and thinking about methods a lot, today it's not apparent that there's really a strong relation between what we would usually use as measures of quality and the probability of replication in studies and I'll show an example of that. But it may be because we're just looking for the stronger signals and so that's robust to some selection bias or limited participation in studies and it may matter more for weak signals. But again the initial information really suggests that sample size might even trump quality as long as we're with some reasonable limits and we're not throwing our standard methods right out the window and as long as we have some attempt to match on region and ancestry, sample size is important because we're dealing with much weaker signals than we thought we might be dealing with when this really got going a few years ago. So here's an example from the NCI breast and prostate cancer cohort consortium almost 8,000 cases of prostate cancer and this is a SNP in one of the regions Debbie referred to where there's no gene in site, AQ24, initially identified by the decode group and the multi-ethnic cohort and here's a pooled result across 8,000 cases, relative risk of about 1.3 for the heterozygates, 1.8 or so for the homozygates, 10 to the minus 19 for the test of trend. But that was obtained by pooling data across seven nested case control studies, what we usually think of as kind of best practice quality and for most of these individual studies, certainly the rare homozygate or less common homozygate relative risk is not significant, some of the tests for trend are significant, some of them are not. If we only looked at the epic cohort, for instance, as a large-scale, high-quality replication with about 900 cases, that test for trend isn't even significant and we would have called that failure to replicate. So again, the message that's been said earlier, we need large numbers and we need to pool across multiple studies to pick up these relatively weak effects. And very quickly, here's another example from the Eastern and Al Nature paper looking at five SNPs that came through their three-stage design in breast cancer. And here's the strongest SNP, the SNP in FGFR2, their initial association in the stage one, which had to be strong to be picked off for screening in stage two, where they screened about 13,000 SNPs out of the initial 240,000 SNPs that they looked at. Here's the winner's curse in action. Basically, we know when we screen for the top P values that we're going to be biased in favor of picking up the associations that just happened to be strongest in that first stage, and that suggests that a lot of the initial screens need to be looked at more carefully because there'll be things that didn't pass the threshold for confirmation that probably truly associated, but just weren't that strong in the initial data. And then a large stage three with 20,000 cases across almost 20 studies. And the quick bottom line here is at least for this SNP, very, very strong, robust, 10 to the minus 70 ultimate confirmation of an effect that is about 1.3 per alleles, 1.3, 1.6 for homozygates. But in the phase three, it's interesting that no names mentioned here, but this is really a mix of studies that epidemiologists would consider as best practice, like the nested case control studies in this list, and other studies that were almost in the convenience category class, a series of cases with a hospital based controls or other control information. And there was really little, for the least the strongest hit relation between what we'd identify as best quality and what we'd identify as least good quality. So at least for the strongest hits, it seems to be fairly robust to quality, perhaps because as Wendy pointed out, there's often not likely to be a very strong relationship between probability of participation and genotype. Once you get into the weeds a little bit for these less strongly associated SNPs, it's an open question, but I think sample size is really the dominant factor here. So how are we going to really deal with the problem of replication? I just want to point out that as is already being described, posting these data early and often are going to be key. Here's a resource from the NCIC GEMS project where you have been able to obtain, since October of last year and April of this year, the Illumina platform for large-scale analyses of prostate cancer and breast cancer. And so essentially you can get instant replication. You don't actually even have to have an initial genotyping project. You can just have an idea for a gene and you can basically dial it up on this database and immediately get the p-value ranking, the p-value from a large-scale nested case control study of prostate, a different one of breast, and that is all essentially available and open access, no registration, no problems. So basically instant replication is increasingly becoming state of the art and there's going to be a large number of resources. DB GAP has already been mentioned. Dan Levy is going to talk about framing and welcome trust, tell you that you can get access to their data with a registration application procedure. The diabetes genetics initiative have had data up for about three months now on a variety of biochemical phenotypes and so if you have a hypothesis, for some phenotypes you can already go and test that hypothesis in silico. Just very quickly to run through how this worked in the CGEMS project for the same gene FGFR2 initial hit. Here are the six top SNP hits for breast cancer and the necessary study in the CGEMS project and the main point here is that there were these two SNPs, two top SNPs out of the top six extreme p-values that did replicate but six of them did not, including a gene when I saw it, TLR1, TLR6. This has been associated with prostate cancer and a number of studies seemed like a home run here for breast cancer. Zero evidence of replication. So again replication is key and just looking at the name of the gene on the initial list will take us down a lot of blind assays, alleys, as well as blind assays I guess. Four SNPs on the initial chip and here are three studies replicating the initial association with about the same strength of signal but again one of these studies actually wasn't significant, two of them were, so we need to put together as many studies as we can to get pooled analyses and even here there's a substantial unfinished gender because these are essentially linkage analyses. We have to find the causal variant and then that, presumably, will tell us about mechanisms of carcinogenesis. So to finish, the hits keep coming and the season still got a long way to run and there's this very, very substantial unfinished epidemiologic and public health agenda which I think we can all contribute to whether or not you're in the GWAS or even genetic epidemiology business at all that these studies will contribute to. Gene environment interaction, obviously what will knowledge of the main effect of these genes help tell us about new hypotheses for environmental risk factors, et cetera. Gene-gene interaction, a lot of interest in trying to summarize all of this in pathway analysis, a huge area. What are the clinical implications for risk stratification of a screening, for instance? How are we going to manage the science of if we agree that there are some of these snippets that's worth screening for? What are the interventions that could be proposed? What level of evidence are you going to want to see before really knowing you can safely recommend those interventions or not? What are the health policy implications of increasingly being able to identify high-risk and low-risk strata for a wide variety of phenotypes across the population? And the good news is that most of the data for these analyses is going to be either publicly available or are relatively cheap. That once the GWAS are done, once the multi-stage designs are done, once the GWAS are put together, then we're back to essentially single-plex genotyping or very low throughput genotyping for the hits that come through that are validated. And that, again, is much, much closer to everybody's budget if you have the samples than these initial, very expensive, large-scale analyses. Just like to thank all the team at the Harvard Courts, particularly Peter Kraft, for some of those analyses, the BPC-3 investigators. Again, these are examples of the large-scale multi-center collaboration that we've been talking about and also the CGM's team at NCI, ACS, and Harvard School of Public Health. Thank you very much.