 So I was asked to speak about the challenges of an epidemiologist working in genomics. And Terry has mentioned that there is this need to bridge the chasm between geneticists and what many of us consider ourselves as traditional epidemiologists, who are now wondering how can we apply this genome-wide association study technology to our studies. And there's going to be a wealth of data that will soon be available, if not already available, for those of you who are interested in analyzing some of the results of these genetic technology, the genome-wide association results will be available for people to apply to be able to use and analyze on their own. So what I thought I would do is try to tell you a little bit about what some of the challenges are that you might encounter. One of the reasons Terry asked me to speak is that I was one of the collaborators on one of the first genome-wide association studies, which was published in 2006, and I was the epidemiologist on this group in this group, and we were looking at genetic predictors of sudden cardiac death. And using 100,000 polymorphisms throughout the genome, we were able to discover an association with a gene not previously considered a candidate gene for arrhythmic death, which is called CAPON, or NOS1AP, which is a regulator of nitric oxide synthase. And using a two-stage design, we were able to genotype the extremes of the QT interval, which is a marker of abnormal repolarization, in the German Chora study, and then replicated these results in an additional Chora cohort and in the Framingham-Heart study. And as shown on this slide, we found that a polymorphism in CAPON was significantly associated with prolongation of the QT interval. So we're currently in the process of looking at polymorphisms in CAPON and whether these polymorphisms predict risk for sudden cardiac deaths in the Eric atherosclerosis risk and community study and the Cardiovascular Health Study, CHS. And I additionally replicated these results in a genetically isolated population, the old order Amish, and found, again, similar associations between variants in NOS1AP or CAPON and abnormal repolarization as manifested in the QT interval on EKG. But much of the work that many of us have done previously have been with candidate gene association studies, and the multi-ethnic study of atherosclerosis is now undergoing extensive genotyping and analysis of hundreds of candidate genes to look for associations with subclinical atherosclerosis. And we've been involved in a number of those studies as well. But I wanted to acknowledge my background, which is in, as many of us would call, traditional epidemiology. I started my work at the Framingham Heart Study, and my early work was with my first mentor who was on the panel as well, Dr. Dan Levy, and we were able to show that left ventricular mass is a heritable trait, and that's led to further studies to identify genes related to left ventricular hypertrophy, which is a potent predictor of cardiovascular risk. So as a traditional epidemiologist, I wanted to talk with you about some of the issues you might encounter when you start to work in the field of genetic epidemiology. And I think, as Terry mentioned, one of the biggest challenges is the confusing genetics nomenclature. And so now I think it's a little bit easier to identify, when we're talking about a specific polymorphism, what polymorphism it is, because as new polymorphisms are discovered, they're getting cataloged with RS numbers. An RS number is a reference SNP, a session ID, and the DB SNP database has a listing of all the polymorphisms that have been identified to date, and extensive description of these polymorphisms. But what I find particularly challenging is that when we find a new association, I want to go back to the literature and see if this has been shown before. And before the RS numbers, there were all different types of nomenclature that were used for various SNPs that changed over time, depending on what was known about that gene. So that gets pretty confusing. And then you think you got that under control, and then you come to issues like, well, I just looked at the results of this analysis, and when I went to DB SNP, I'm a little confused because they said the polymorphism was a T to A change, but on DB SNP it says that it's a C to G, or there's something different, and they're like, oh, well, maybe that's forward strand versus reverse strand, and I'm like, you know, you guys are totally confusing me. So it's not all that straightforward, and every time I start doing another analysis, I come to something else that just isn't quite making sense to me. And then I thought I sort of understood a dominant model, a recessive model. I learned all that in medical school, and then I realized when I'm talking to someone that they sort of are interpreting dominant and recessive in a different way, and it depends on whether you're talking about dominant relative to the minor allele or dominant relative to the major allele. And so all these sort of communication issues where we're trying to learn this new language, and then we think we have it under control, and then you realize that we're still not all talking the same language. So I think that as we realize what some of these issues are, we can try to come to a consensus about exactly what all the nomenclature means and try to make it easier for those of us who don't have the extensive background in genetics. And then I try to go back to my biochemistry, which seems to be longer and longer in the past from medical school, and I thought I sort of had the refresher course and understood, you know, what an exon is, and an intron, and an exon generally is translated into the final protein. And then you look at, well, someone's describing an untranslated exon, and so then you have to go back and say, okay, well, I know what's an exon, well, it's a region of DNA that's transcribed into the final mRNA, but not necessarily translated into the protein. And so it gets pretty complicated. But not to scare you all away, I think that generally it's pretty understandable and that our genetics colleagues are usually pretty helpful at trying to explain the things that we don't quite understand. Another very significant challenge, and of course many of us are in academics and our career depends a lot on publication, you know, publish or perish, and it's, of course, key to be either first author or senior author, and so, you know, when you're doing these studies it involves a significant amount of collaboration, and collaboration is great. I've met a lot of people that I never would have had an opportunity to collaborate with, but the collaborations involve people of varying backgrounds. Of course, there's the phenotypers, many of us as the epidemiologists were the phenotypers we spent years and years writing the grants to recruit these people into our studies and do extensive phenotyping and cleaning the data, et cetera, et cetera, and then someone comes around and wants to, you know, genotype the population, so there's the genotypers, then there's the statistical geneticists, you know, I know how to do statistics, but it comes to haplotypes and all this stuff, you know, you've got some people who really have expertise in statistical genetics, and then there's bioinformatics, tremendous amount of bioinformatics with these databases with 500,000 SNPs. So how you work all that out in terms of who's going to do what, who's going to be the first author, who's going to be the senior author, not an easy issue, and so it's something that I think is important to decide before the work is done so that there doesn't end up being too many issues afterwards. And then the other issue is that once you find an association, often you go to another cohort in order to replicate or validate your studies, so then, you know, you're bringing in another whole group where you have the initial discovery and then you're just asking for replication, so it gets pretty complicated. Another issue that's come up recently when we've had our Publications Committee meetings in MESA is we have the epidemiologists who are writing these proposals and they say, and I'm going to adjust for, and they list 10 or 15 different covariates in the model, and of course, as epidemiologists, we worry a lot about confounding. I mean, that's sort of what we do, right? Well, remember, a confounder is a variable that's associated with the outcome, in this case phenotype, but also with the predictor, which in this case is genotype. So few of our traditional confounders are actually going to be associated with the genotype, so it's hard to imagine that if you're looking at a gene related to left ventricle hypertrophy that it also would be related to smoking, for example, or other environmental factors that we often control for in our epidemiology studies. So we need to sort of redo our thinking a little bit about how many of these covariates we want to put in the model, and many of these covariates we might put in the model more as a way to get a more precise estimate of our outcome, but realize that they're not exactly confounders. The other advantage to putting the known covariates in the model is that then you can try to estimate how much of the variability in a phenotype might be explained by the genotype after accounting for what's known about predictors from previous studies. So it's sort of a new way of us sort of thinking about covariates. And of course in Epidemiology 101, we learned about how important it is to choose the appropriate control groups. And so the cases and controls need to be collected in a similar fashion. And what becomes more complicated in genetic epidemiology is that we need to make sure that our control group is derived from the same ancestral background. And we'll talk a little bit more about confounding potentially related to admixture, and also that they have similar environmental exposures. So getting to admixture or population stratification, you try to learn, understand what this is all about and how I can control for it. And then you hear one person say it's really not a big deal, and somebody else says it really is a big deal. And so you have to kind of integrate all these different ideas about population stratification and how much of an issue it really is. So there are a variety of ways you can deal with population stratification. You can try to get a sense of it just by self-described race ethnicity. Just ask, are you white? Are you African-American? Hispanic, et cetera. And often that does a pretty good job at categorizing correctly what someone's ancestry is. But a more sophisticated way to get at that is using what's called ancestral informative markers, or aims. And ancestral informative markers are polymorphisms that have allele frequencies that differ based on the parental population. And so it's a way that you can potentially estimate the ancestral proportion of an individual. And so some people have advocating using ancestral informative markers in your analysis to further categorize the background of your individual subjects and your cohort study so that you can appropriately adjust for potential confounding due to admixture. And we'll probably hear more about admixture analyses further in the program today. So another issue that I've been dealing with is, well, working in the multi-ethnic study of atherosclerosis, which is a very unique cohort because it has four different racial ethnic groups. And that's wonderful for trying to understand more about racial ethnic disparities. But when we do these genetic analyses, we perform the analyses stratified by race ethnicity. And then if we don't see any statistical interaction, then sometimes we combine all the groups together. And so there are questions about when is it potentially appropriate to combine different racial ethnic groups into the same analysis? And it gets pretty complicated. Well, gene environment and gene-gene interactions are really very fascinating. And although we feel like we've advanced a lot in our ability to perform extensive genotyping and also to analyze this very complicated data, we realize that most of the time we're looking at single polymorphisms with one outcome, whereas really what we're eventually going to need to do is look at multiple genes and environmental interactions because these disorders are very complex disorders. And it's going to be multiple genes that each have small effects and potential interactions with the environment that are going to explain somebody's risk for having a specific outcome. So there are a variety of different ways that we could potentially put multiple polymorphisms, multiple genes, and then environmental exposures into the same model. How many interactions with the environment should we be testing for in our analyses? We already have issues with multiple testing. And so then if you, in addition, start adding in multiple interactions, it gets pretty complicated. And of course there are always issues with power. How much power do you have to show that these interactions do or do not exist? So how can we combine multiple genes and polymorphisms into the same prediction model? The major issue, of course, is the multiple testing issue. You heard Terry talk about looking at 500,000 polymorphisms throughout the genome very soon. If not now, we're going to be using a million polymorphisms. And so we always were taught in epidemiology that this fishing expedition is poor science and was always looked down upon. Well now, fishing expeditions are the way to go. And that's what we're supposed to be doing. We don't know a lot about these genes. And so we're trying to discover the function of genes that we might not potentially know all that much about. The key, of course, is replication and validation. So I guess we think of genome-wide association studies as a really big, sophisticated fishing expedition. And having grown up on Long Island and recently taken a cruise to Alaska where they speak about that there are, I think, seven different kinds of salmon, I sort of think of this as a fishing trip to Alaska, fishing for seven different kinds of salmon, as opposed to the routine fishing expedition that my brother did on the Long Island Sound. So of course, what p-value do we use? And you'll hear lots of different people discuss what they think is the best way to adjust for multiple testing. And the traditional way is the Bonferroni adjustment. But that's incredibly conservative. If you're going to look at a million SNPs, and then you divide your p-value by a million, then unless you have a huge data set, you're going to have very limited power to find these associations, which each individual SNP is likely to have a pretty small effect anyway. And so you're going to have potential for many false negatives. Of course, the issue also with the Bonferroni correction is that many of these SNPs are correlated with each other. They're in linkage disequilibrium. And so adjusting for 1,000 SNPs is not necessarily the right way to go. So we need to figure out what to do about that. So there are many new procedures that have been developed and are continuing to be developed, including the false discovery rate, which is a way to potentially figure out what the likelihood is that any individual discovery is a false positive. But the real key is the issue of finding an initial association through this very sophisticated fishing expedition, and then replicated or validated in an additional population. Well, how do we determine what cut point we use? What p-value do we decide is the first step to move from stage one to stage two, which is used for replication? And when do we consider something a true result as opposed to potentially a false positive? So some other issues that I find challenging is the issue of lack of reproducibility. Some of the epidemiology colleagues say, this genetic stuff is just never going to work, because every time we find something, then it doesn't get replicated, and it's just doesn't seem like it's going anywhere. Well, it seems more recently we've been able to replicate many of these findings and have much more enthusiasm for the potential to be able to find genetic predictors of complex disorders. But when we do find something that looks like it's real and then it doesn't replicate, well, what does that mean? Well, is it a false positive? Or is it that the replication cohort has a different environmental exposure, potentially different haplotype structure, different ancestral background? Maybe the study design was different. And so you see some debates in the letters to the editor about whether the lack of replication means that your initial finding is a false positive or whether it really is just a difference in the study design. And so we have to deal with those issues as well. And then Hardy-Weinberg equilibrium. So when we get our thousand SNPs back from the people who are doing the genotyping for us, we tend to throw out the ones that aren't in Hardy-Weinberg equilibrium. For those who don't know about Hardy-Weinberg equilibrium, it's a test that we perform to look at whether the relative frequency of alleles for a SNP are stable in a population, meaning that they're not changing over successive generations. And you can generate a p-value to tell you the likelihood that your polymorphism is in Hardy-Weinberg equilibrium. And often if it's not in Hardy-Weinberg equilibrium, it's due to genotyping error. But also it could be because some of the people who have the at-risk allele may not be in your sample because they've died off or they've just have the risk allele due to the association with the outcome that you're looking for. And so it gets a little complicated about how you determine when you exclude SNPs based on lack of Hardy-Weinberg equilibrium. And then what genetic model to test? There are a variety of different models. You can use a two degree of freedom model. You can use an additive model, a dominant model, or recessive model. And of course when we're doing these genome-wide association studies, not only do we not know what some of these polymorphisms are, but of course we don't know what the most appropriate genetic model is. So again, issues of multiple testing arise if you want to test every potential model as opposed to one model. And if you test just the additive model or just the two degree of freedom model, then maybe you're missing the real approach. So I think these are some of the issues that we grapple with. And you hear different ideas about what the best way to go is. And then something I'd never dealt with as an academic epidemiologist was patenting results. And so this came up in one discussion. Epidemiologists rarely patent our findings. And there is a history in science and genetics of patenting scientific discoveries. So I think that many of us feel that this can hinder scientific progress potentially. And it's sort of just a totally different thing that we haven't necessarily dealt with before. So coming from Hopkins, I heard Victor McCusick speak about genetics. And he displayed this abstract in his slides. And so I asked him for a copy because I just thought it was absolutely amazing. So notice 1958. So this is a paper that Victor McCusick published, obviously. I'm just going to read it for you. There is a clinical impression of longstanding that heredity plays some role in the pathogenesis of all four major types of cardiovascular disease, atherosclerosis, hypertension, susceptibility to rheumatic fever, and congenital malformations. These are diseases of multifactorial causation. To point to a genetic factor in coronary disease does not deny the possible importance of diet, stress, cigarette smoking, et cetera. To point to a genetic factor in susceptibility to rheumatic fever does not deny the paramount importance of streptococcus. If one demonstrates a heredity, hereditary influence in some cases of congenital malformation, the greater importance in other cases of factors operating in the intrauterine environment is not excluded. But what I really thought was key. And it seems like it was just written last year. Study of genetic factors is important because potentially it will permit recognition of genetic susceptibilities for more effective application of preventive measures. And because from our understanding of the mechanism whereby the gene or genes operate in these disorders can come preventive or therapeutic measures for breaking the chain leading to disease. And so I think we're all at a very exciting point in our careers being able to analyze this exciting genetic data. And I'm very optimistic that we will be able to break the chain leading to disease and lead to some very important preventive strategies that can help have an impact on public health. Thank you very much.