 So, as people are coming to the microphone again, I'll raise a question with you, Nancy. Could you perhaps describe how we might use epidemiologic data to enhance the annotations that we already have? I mean, it took me a long time to know what an annotation was as an epidemiologist. So you might explain that. You did quite nicely, but then how we might use some population-based data in this. So I mean, there's, I think there's obvious things that come to the fore right away. And how you could use epidemiological data to annotate SNPs, other than to say associations with other phenotypes are a hugely valuable annotation to think about going forward. So at some point, ideally, you put in a SNP list, you know, of interest because they came out for a lung cancer study, and you try to, you know, get an understanding, okay, what are these SNPs about? And some of them may have come up in other cancer phenotypes. So that becomes, okay, maybe it's a generic cancer predisposition susceptibility locus. Maybe they, maybe instead it comes out in studies on asthma. So it's about lung function or lung development. I mean, so I think you start to get, build up a whole set of information about polymorphisms when we start to think about the annotations being not just physical, where they located, but also, you know, what have they been implicated in, what does it appear that they predict in terms of phenotypes? And one nice thing about that is it's really useful information even if we're not looking directly at the causal SNP, but rather we're still just thinking about correlations because a really important follow-up study is going to be the deep re-sequencing to try to get at real causal variance. Debbie talked about some of that. But to even know, you know, what we should be re-sequencing, I think it's going to be really important to get really good depth on the annotation information. Do you want to comment one? Yeah, I think that the looking across multiple diseases is interesting even outside of a particular disease type. So there's an example in the type 2 diabetes studies that have published. Many of them have found association with a locus that looks like it's upstream of a gene called CDKN2A. And independently there was a study published on heart disease where there's also association. But what's interesting is it doesn't look like it's acting through diabetes. It looks like it's a very separate locus there. And so you have two different diseases at the same locus which makes you start to ask if the genome isn't actually a very small place. And if you have a higher probability, this is also a locus actually that's involved in cancer risks susceptibility. So if you have certain places in the genome where when you get variation you're just more likely to get disease of any type, I think it's going to be very interesting to see. Now that's a very good point to make that we're actually, as Laura said, although the genome is a very big place, it's also in many ways becoming a very small place. And so the Wellcome Trust papers published an association between, given a region which I forget now, and type 1 diabetes that was also identical to the same association was found at Crohn's disease. And who would have thought that? And there were some for rheumatoid arthritis also that overlap. Yes, exactly. So that I think will help us a lot. Debbie? A lot of times we annotate genes based on what we know they function in. And actually we have no idea how many different processes in the system. We have 30,000 genes actually encoding a really complex machine if you want to think of it that way. These things are going to be redundant parts that are used in many different ways. And it may be through association studies that we'll really get an idea of how redundantly they use. They also may give us clues to regions of the genome that have been selected over time, sort of the thrifty gene hypothesis where we've had regions that show unusual properties that protected us and passed eons ago, but today they are at such high frequencies they're now causing diseases that we really didn't have showing up in the population ages ago. Debbie's point is really good. And I mean even the names of genes are somewhat historical and almost misleading sometimes. So you'll see things called kinases because they have a kinase domain and then it may turn out that the most frequently observed splice form of that gene never has the kinase domain as part of it. So its main function may be something that's not anything like a kinase or maybe it is. I mean that's the problem and the alternative splicing certainly multiplies the number of gene products that we need to deal with in terms of association studies as well. And what is alternative splicing? So if you envision the linear set of exons that make up a gene, we often see that different exons are knitted together to make different proteins. And sometimes it's a tissue specific differentiation. So you may see exons 1, 3, 5 and 7 expressed in liver and 2, 4, 6 and 8 expressed in brain and so different subsets that way. You may just remove the last exon for some expression at a different time in life. So there's temporal differences sometimes, tissue specific differences and that adds another complexity to figuring out what the function is of associations that get examined. Which one won't get really from sequence, from genotyping data? Right and so when we talk about the non-coding sequence polymorphisms that get associated with disease, some may be about cues for alternative splicing, some may be more directly about expression of that gene, and some may be about expression of genes hundreds of KB away that in DNA secondary structure is sort of right on top of what it's working on. So, great. Yes? To impute the data, and Mark Daley used the LED frequency, and Gonzalo Abecas used the genotype. So I think they impute the data a little bit different. It's not the 100% like a consistent with a different to method. Do you know how similar do they work or if it's not working in a similar way and how can you deal with the inconsistent problem? So the imputation methods that I talked about from Jonathan Marcini and Gonzalo Abecas are looking at the phased haplotypes in the HapMap data. So the first thing is in both what Nancy and I talked about, Yord and the kind of multi-marker methods that have come out of the road, which I think are similar to Tuna. Tuna used the LED frequency. Right. It's just estimating allele frequencies. It's estimating allele frequencies. It's not directly imputing genotypes. Right. So you're starting with the same underlying data. So if you started with the same, if you took the same portion of a haplotype in both of these methods, then in theory you should get out the same results from both of them. I don't know how far out the genotype data goes when you're imputing in Tuna, but in the methods that take advantage of the phased, HapMap data, you're really taking advantage of any LD information that exists along the chromosome. So my guess is you may be getting a slightly more precise estimate for the imputed data than you are for the allele frequency estimate, but most of the times they're going to give you very similar sorts of results. The difference comes in in that if you want to do case control studies where you're looking at differences in allele frequencies, you get very similar things. If you want to look at quantitative traits, you then need to have the individual genotype data. And so at that point, the allele frequencies are not going to be useful to you. Andy. So I wondered if it would be worth just discussing briefly the problems of admixture or the perceived problems of admixture. I think there's a general appreciation that admixture is not a huge problem in Caucasian populations when we look at European populations. But clearly one point of interest in the US is to look at associations in cohorts like African Americans, African Asians, people from Southern America that are admixt into the US population. And I wonder how that will affect the results and particularly to do with imputation methods where you're looking at a very admixt genome. So we've studied Mexican Americans a lot and have looked at this from a variety of perspectives. And in fact, so you can certainly use estimates of ancestry as covariates in the analysis and also or the principal components from something like Eiganstrat. In fact, in our studies on Mexican Americans, those are so highly correlated it doesn't make any difference which when you use the correlation to the first principal components from Eiganstrat versus the ancestry estimates from something that's more population genetic based like structure was 0.95. So it made no difference which way you did it. And in fact, for the phenotype we were interested in actually made no difference in the analysis, which simply implies that for that particular data set there wasn't really a big difference between cases and the controls we had in terms of the ancestry proportions. But it's absolutely going to be possible to do that I think for whether you're looking at African Americans, Mexican Americans, other populations of mixed ancestry it should be possible to allow for the possibilities that the individuals are not identically admixt and take that into account in the analysis. I want to come back to the question on imputation because I think one thing that's very important no matter what message you're using to do this is if you're finding SNPs, particularly in a region where you didn't have any evidence of association from genotype SNPs or SNPs that appear to have a much stronger association than genotype SNPs, it's very important to go back and actually genotype those SNPs in the original population to verify that the imputation is working because if you do imputation across a genome and you take the most strongly strong associations from the imputation there's some not insubstantial fraction that upon re-genotyping are going to have a weaker association. So you can't just take the imputed data and believe it. I'd like to add to that a little bit. I think that imputation for European and Asian samples is can be taken with can is much better than imputation for African descent samples. I think there's a lot more missing variation in that population in the HapMap sample. So I worry a little bit about collapsing the data set and using imputation until we know a lot more about that. I think the way it's being used now, we're looking at mostly people of European ancestry that we have a better handle on the variation in the data sets for. And you saw that the, at least with respect to TUNA, in the Mexican Americans we didn't have near as tight a correlation as we did for the European samples. Nevertheless, the correlation is quite good. And so I think it's just an additional measure of uncertainty for populations outside the HapMap. And of course there are a variety. So we imputed using both European and Asian chromosomes in the Mexican Americans. And for many SNPs that have very similar frequencies in the two populations, you get exactly the same results with TUNA regardless of whether you were looking inside the Asian or the European. But for a sizable proportion of SNPs, you only even get an answer in one population or the other because the SNP simply isn't polymorphic in Europeans, for example, but only in the Japanese plus Chinese. Or it's only popular, only informative in the European subset. So there's some very interesting differences. And I think you want to take into account sort of local versus global ancestry in trying to interpret some of those things. So with the imputation programs, I know that one of the things that has been tried is actually taking all of the HapMap chromosomes and using them as your starting set. And if I'm remembering correctly from the slides that I've seen, that actually does a pretty good job in most populations, even those that are, you know, if you take a panel of these populations from all across the world, it does do a better job in some of those or an equally good job than taking any one of the specific three HapMap European origin, African and Asian origin. Another comment, Debbie, last comment? I believe that it should still be used with caution on the African data set that is underestimated on the chips and underestimated in HapMap. Yeah, of course.