 So, I'd like to cover case control and cohort studies. I won't talk about, to a large degree, about the design issues involved in these, because I realize I'm, you know, sort of preaching to the choir, talking to a group of epidemiologists, but we'll make a couple of comments about that. And then, Canada Gene Studies, genome wide association, and a little bit about randomized and experimental designs. You may have seen this, that came from Francis Collins in nature of 2004. This was actually shortly after we had started working together, he had asked me to come over and just help out with the workshop they wanted to have on the possibility of maybe doing a very large, and when genomicists say very large, you know, they remember they think in terms of three billion base pairs. So he was thinking of a 500,000 person cohort study of genes and environment. This piece came out after a year long effort to sort of map out what such a study might look like. And a panel of experts that we had brought together recommended that obviously a study would need to be large because you want to capture lots of different diseases in the diversity of the U.S. population, you'd want to fully represent the U.S. minority groups, have a broad range of ages, broad ranges of genetic backgrounds and environmental exposures. You probably want to have some family-based recruitment for at least part of the study to account for population stratification that Tom will be talking about tomorrow. Lots of data to be collected on these people if you go to all the trouble of recruiting them. You want to have very technologically advanced measures. You want to collect and store biologic specimens. You'd need a sophisticated data management system, a whole number of things that were sort of recommended as the next big cohort study, big, big, big cohort study, as it were. And this received sort of a variable response. This is another Gary Larson, now Statecom, let's hear what they said to Bill. So there were those who said this would cost way too much. It was way too big. It wasn't necessary. Everything was already being done, a variety of responses that one often gets. And we then went back and sort of tried to lay out in more detail what some of the problems are with case control studies because particularly in the genetics literature this has been really the darling of the approach, primarily because it's easy. People think it's easy. It's hard to do well, but it's easy to collect cases and compare them to a bunch of controls. And then there was this kind of back and forth with Walt Willett's group suggesting that, well, if we needed to come up with a new cohort study because we all recognize the strengths of cohorts, maybe we could just merge the cohorts we have rather than come up with an entirely new one. And we were asked to kind of write kind of a point, counterpoint to that, saying, yes, that's all well and good. But if you were to do that, you would end up with studies that were kind of a conglomeration of existing cohorts, recognizing that just a simple example, the age distribution of existing cohorts, which is shown here in a survey that we had done of cohorts in preparation for this planning process, did not at all approximate the age distribution of the U.S. population, which is shown here, according to the U.S. census, and a variety of other things that would be shortcomings of existing cohorts, recognizing that there are many strengths and we should make use of those strengths, but in addition should also do a large cohort study. We've talked to a fair amount with geneticists about the pros and cons of case control studies recognizing that a major strength of them is that it's probably the only way to study rare diseases or those of long latency, again, not something I need to emphasize with this audience. Existing records can be used, you can study multiple etiologic factors simultaneously. They may be and often are less time consuming, often they're less costly, and if the assumptions are met the inferences are reliable. And then one tries to sort of communicate, but there are some real challenges with these. You're relying on recall or records for information on past exposures, it's very difficult to validate that information at times. Selecting the appropriate comparison group can be difficult, and no matter what group you select, reviewers or colleagues or competitors will criticize it. Multiple biases can give spurious evidence of an association, you usually can't study rare exposures and temporal relationships between exposures and disease can be very difficult to define. So when I sell these things to my genetics colleagues, they say, but this is genetics, you dumb epidemiologist, this is different. Genes are measured the same way in the cases and controls, not a problem. So information on the key exposure is really very easy to validate. There's no recall or reporting involved, and the temporal relationship, the genes were present since conception, so that's a piece of cake. My response to that is often that biosphere ascertainment of cases and controls is still a major concern, and we'll talk more about some, especially about some of the biases involved in collecting these, and cases in most clinical series are highly unlikely to be representative. And that assessment of risk modifiers or gene environment interactions is highly likely in a case control study to be incomplete or flawed for a variety of reasons. So we go back and forth on this, and as you can imagine, sometimes being the only epidemiologist in the room, it's a challenge, it's why it's been nice having Tom around. So weaknesses of appreciating weaknesses of case control studies, I think at times epidemiologists tend to view them as this Larson cartoon, the monster calming in the window, where the geneticists tend to view us as kind of chicken little, you know, complaining that the sky is falling, and, you know, can't we look at some of the successes that there have been, and there have been some major successes in this area. And probably the truth lies somewhere in between, and we need to be sure we use both designs, but recognize their weaknesses. So thought I'd talk a little bit about candidate genes, especially in this election year. Genetic studies prior to 2005 were almost exclusively this kind of work, candidate gene work. The goal was to characterize what we would call a candidate based on what we might know about the genetic, the biologic pathway, the genetic mechanism, et cetera, what might be related to the disease. Usually these kinds of studies weren't intended to find genes or find variants related to disease as linkage studies were. Linkage studies, you set up all those signposts across the genome and you try to figure out which of the signposts seems to be inherited with the disease and then maybe something near the signpost is related to, is a variant related to disease. But these were generally done after the potentially disease-related variants were identified. One could assess generalizability of observations in families, family-based observations such as the BRCA1 variants, which showed very, very strong influence and risk in Ashkenazi Jews, but much less of an effect and much lower prevalence in other groups. Assess the importance of allelic variation at the population level. Again, population attributable risk penetrance, the geneticists often call it, which is the likelihood that gene will be expressed or the disease will be present in someone who carries a particular variant. And then identifying modifications of genetic associations by environmental factors. So these are all the sort of things that population-based epidemiologists would tend to do with candidate genes. One of the first and perhaps best characterized of candidate gene studies was the angiotensin converting enzyme variants. The ACE enzyme had been identified long ago. This is just from a textbook, but this was probably in guidance work long ago. In terms of the pathway that leads to increased blood pressure and vasoconstriction. And the gene was identified through a variety of very elegant steps that were very difficult at the time, 1989, 1991, et cetera. Linked to elevated blood pressures and rat models, et cetera. And then finally mapped to the human chromosome 17 by Jean Metre. And then also showed to be associated with levels of ACE enzyme levels, which is very nice. If you think you found your gene, you would like to see that it's associated with differences in levels. And as you can see, this was an insertion deletion polymorphism. Tom mentioned in DELs before. There was a small segment of DNA that was either present or deleted in carriers of this variant. And it was found through RFLP analysis. And again, that's the 250 base pair insertion. And people who were homozygous for the insertion had lower ACE levels than those who were heterozygous and much lower than those who were homozygous for the deletion. And this is true on a log scale as well. What really sort of shook the world was this Cambien paper. And some of you may remember when this came out, this was typed actually in a prospective study, the ECTIM study. I've forgotten what it stands for, maybe on one of these slides. But at any rate, identifying this particular deletion polymorphism as a potent risk factor for myocardial infarction from this French group. And what they showed was that in 1,304 MI cases and a group of matched controls, sorry, this is the total number of cases and controls, that those with DD polymorphism had somewhat greater risk. The insertion deletion heterozygous had a lower risk. And those with the insertion polymorphism was basically not changed. This was a relatively weak association, although it was significant given the numbers that they studied. But when they stratified on low risk versus high risk, this association really came out. And people who were homozygous for the deletion had a much higher risk than Oz ratio of three versus those who were heterozygous or carried the homozygous insertion. And those who were already at higher risk based on other cardiovascular risk factors really didn't seem to have much association with this polymorphism. So this caused lots of excitement. Many, many prospective studies rushed to try to replicate this finding. Not many were able to do so. And in fact, in hypertension, you'd see papers come out like this. My colleague, Chris O'Donnell, from Framingham showing an association with the DD polymorphism in hypertension, for example, with a reasonable association. But really nothing in women, a lot of times in consistencies. And it really wasn't clear what was going on with this. Well, what was probably going on with it was perhaps a spurious association, an association that seemed to make sense that was found in one study that, for whatever reason, lack of luck or whatever identified it, but then when we tried to replicate it, it was not found subsequently. And actually, Hirschhorn did a very nice review of this issue in 2002 in genetics and medicine, showing the number of association studies that really started to take off as the genotyping technology got better. And as we identified more and more polymorphisms, Stanley had this huge peak. And this pretty much continued on up. It's leveled off a little bit into 2002. And yet, of the 600 papers that they identified that had reported an association in more than two or three studies, I believe, only six. So 1% of them were significantly associated in more than three quarters of the identified studies. One of them is near and dear to us in cardiovascular epidemiology, the Factor V Leiden variant with deep venous thrombosis. There are a couple of others here. The ApoE is one of the strongest associations ever been found with Alzheimer's disease. But these were the only six, really, that came out as being robust and replicated. We did a similar sort of exercise in carotid atherosclerosis. Basically took all of the variants that had been reported somewhere as being associated with coronary artery disease. And said, well, we recognize that coronary disease is sort of a distal phenotype. And there are many things that lead to coronary disease. One of them is atherosclerosis. That may be a little bit closer to the genetic product, whatever it might be. And so perhaps as kind of an intermediate phenotype or a more objective measure, it would be something that would show a stronger association. And actually, when we looked at that with a variety of variants, you notice the ACE insertion deletion. There were 13 studies showed an association. One of them showed it with the different allele. 18 showed no association. So the summary sort of favored none. And many favored no association. And many of the variants then that had the strongest evidence initially, really, pretty much were equivocal. The only one that kind of came out was the MMP3, matrix metalloprotein ACE3, probably because there were only four studies at the time we did this review that had looked at it. As it was looked at more, it sort of dropped out as well. So candidate gene studies were not terribly fruitful. Initial enthusiasm has been markedly damped by failure to replicate findings. And the point has been made that you can probably find a study or a story or some kind of a biologic pathway that will fit almost any candidate to almost any disease or trait. And understanding of the genome function was just too preliminary at this point to project more than really a handful of plausible candidates. I would point out that the APOE was never in a million years thought to be related to Alzheimer's disease. I mean, that was, in a way, a fluke finding. It was found in a linkage study. But it was incredibly strong. It was replicated again and again and again and again. And now we have some really good pathophysiologic reason for that to be there. So this wasn't a fruitless effort, but it came up with a lot of false positives. So a paradigm change was needed much like these fellows are having here. Hey, they're lighting their arrows. Can they do that? And we needed, perhaps, to burn the barn down a little bit. And the way we did that was, as we talked about last time and Tom has mentioned, through genome-wide association studies. So here's a caryogram with all the various bands of the genome shown here. These bands tend to sustain. They're just GC-rich regions. If you're interested in cytogenetics, there are people who spend entire careers doing this kind of work. In 2005, we had basically no genes for common diseases, common complex diseases. And we refer to complex diseases as the ones that you had hoped were single gene Mendelian, but you didn't find the Mendelian gene. So it must be complex, but at any rate. In 2005, there was one variant identified for complex disease. It was here on chromosome one for complement factor H in H-related macular degeneration. In 2006, there were two others, both on chromosome one for QT interval. And sorry, QT interval is down here and inflammatory bowel disease. And then another on chromosome 10 for age-related macular degeneration. 2007 really was the year of genome-wide association studies. We sort of broken it up into quarters. And just to kind of page through this, the second quarter was incredibly productive. Several studies simultaneously in prostate cancer, breast cancer, diabetes, the Wellcome Trust Case Control Consortium, all kinds of findings, all of them replicating, really very, very exciting. Third quarter, fourth quarter, and then just the first quarter of 2008. So this has been really an incredible progression. I think when genome-wide association started, there was a prediction that within five years, we'd have identified five variants for maybe 10 diseases. And we've done far better than that. So 2007 was sort of dubbed the year of genome-wide association studies. This was science, I believe, that dubbed it's breakthrough of the year. It was that measures of human genetic variation made possible through the HAPMAP and similar kinds of genetic tools. And just the number of studies in our catalog, which I'd encourage you to look at and let us know if there are ways you think it can be improved, or things that you find wrong with it, or whatever. But there have been 53 traits now that have had published genome-wide studies, some of them and things you wouldn't think would be of particular public health importance, although many are restless legs. This one that's often identified as, gee, why do we have genome-wide studies of that? And we don't have it of fetal malformations and that. And if you've ever dealt with somebody who has restless leg syndrome, it's actually a very common and very troublesome kind of condition. And here's our catalog here. So just to give you sort of a screenshot of it, if you Google GWAS Catalog, I think it comes right up. And these are the pieces of information that we're pulling out. We decided we had to track this anyway for some work that we were doing. And we thought maybe in sort of the genome way we would make this information available to the scientific community. So we collect all of this information. The stuff in white is kind of the easy stuff to pull out. It's the things in blue, the gene region, the genes that are in that region, the strongest snip and the risk allele, risk allele frequencies, p-values, and odds ratios are a little bit harder to pull out. And we want to be sure that that information is right. So it tends to lag a little bit. We'll identify, study, and then say, this stuff is pending. We've kind of committed to doing this through the end of 2008. And if we're all still standing at that time, we may continue. We may not. And then this is just an example of sort of the full catalog. And it shows you disease trait, replication sample size, region gene, strongest snip, et cetera. Some studies had more than one snip identified. And we tried to show them there. We kind of picked the top five just as a starting point. And we may try to go back and pull out more of those. And we tended to pull out the top five sort of new ones. So this isn't a real good resource for telling what's been replicated. And that's been a criticism in one we're trying to address. King. I'm just curious for the strongest one. Strongest, I'm sorry. I mean, you've been trying a lot of the genes and so on. I wonder what percent of variation actually that's explained by these? Right. So your question is for the strongest risk allele, or risk alleles, what percent of variation? Or alleles. I mean, you identify a bunch of them, I mean. Right. Yeah, and it's very small. It's probably less than 5% of the genetic variation. I showed you a couple of examples where the authors are saying that it's explaining more. And I question that. But it's very little, 5%, 10% at most. So. We're really not calculating. Yeah, often it's not. We're not calculating. Yeah, there's no sort of no estimate. So as I say, we'll do this as long as we hold out. Anyway, and then just to remind folks what a genome-wide association study is. It's a way of interrogating all 10 million variable points across the genome, recognizing that since the variation is inherited in groups, you don't have to test everything. All 10 million points, you can test just a few of them, 300,000 to 100,000, for example. And what one could do, taking the Samani study from New England Journal of Coronary Disease, just there's most strongly associated. And when I say strong, I'm coming from the genome world, which is actually the p-value. So we're talking about the significance level. And these folks calculate significance levels to three digits, I mean, three digits on the exponent. So 10 to the minus 100. When I was a baby epidemiologist, they taught us nothing less than 10 to the minus fourth was even worth reporting. You just report those all as 0.001 and move on. But times have changed. So they're snipped with the highest p-value that survived replication, et cetera, was this one. I was taught by a friend that I found something very useful, was just to sort of look for snips by their last four. You get asked your last four, your social security number, and that. So I know this one fondly is 3049. But at any rate, you find it's very easy when you're looking through papers and trying to find these things. And what the 3049 C allele was found in 55% of the cases, only 47% of the controls. And the G allele, conversely, giving an allelic odds ratio, that is, for carrying the allele, either one or two copies doesn't matter just what's the allelic odds ratio, 1.4, very high chi-square and a very small p-value associated with that. One can also look at the three genotypes. So you could either say, do you have C or not? Or you can say, are you a CC, CG, or GG? And in the study, again, 31% of the cases were the CC genotype versus 23% of the controls. And then, conversely, in the GG genotypes, very high chi-square now with 2 degrees of freedom, because you have three risk groups, 10 in the minus 14. You can calculate a heterozygote odds ratio, which would be the risk of the heterozygote group to the ancestral allele. And a homozygote odds ratio, the risk of the homozygote variant to the ancestral allele. I might mention we've gotten away from the terms wild type and mutant wild type, because one could imagine, Martha and George sitting in their doctor's office and the doctor saying, George, you have the wild type of such-and-such. And Martha saying, I never knew you were such a wild type myself. But at any rate, and mutant because it carries certain connotations. And so we prefer to refer to either the ancestral or common allele and the variant allele. And what one does then is just calculate all of these chi-square values or whatever association statistic you have across all of your SNPs. So 300,000, 500,000 times. And as I may have mentioned earlier, because DNA is a linear molecule, you can just start with the P end of chromosome one. P, remember, the chromosomes have two arms. P is the part above the centromere. It's usually smaller than the part below the centromere. These were named by the French, so P for petite, and then Q for the long arm. But anyway, you just kind of line these up. And here are your P values. In this particular study by Beirut, they've just plotted P values across the genome. And you can plot them in many different ways. This one, as I mentioned before, the negative log of the P value. This is the Klein study of macular degeneration, which sort of really got this whole field going in March of 2005. And here's another somewhat more colorful way, even more colorful way of plotting them. So you can plot them, multiple studies on a single page. In fact, the Wellcome Trust, I think, was, I don't know, a $15 million exercise. And this has been called the $50 million plot, so at any rate. And as Tom mentioned, this sort of deluge of data, not only just data, but also positive findings, has been alluded to drinking from a fire hose. David Hunter and Peter Kraft at Harvard wrote this very nice summary on statistical issues in genome-wide association studies. And they conclude that there have been few, if any, similar bursts of discovery in the history of medical research. And I think that's probably true over this short a period of time. Lessons that we've learned from kind of the initial burst of genome-wide association studies probably the biggest lesson has been we really don't know much about disease pathophysiology or biologic pathways. And these are just a few of the signals in genes that nobody would have suspected as being related to these particular diseases. Macular degeneration, I already mentioned, related to complement factor H, which is part of the inflammatory pathway. Macular degeneration was thought to be an ischemic disease until this finding. Coronary disease related to CDKN2A and 2B. These are cell cycle variants that actually are related to cancer. Wouldn't have been on anybody's candidate gene list. Childhood asthma related to RMDL3, which I'll go into a little bit more, type 2 diabetes with another cell cycle variant. QT interval prolongation, Tom mentioned, related to a nitric oxide synthase variant. There have been a number of variants found in places where there really aren't any genes at all. And in the past, when linkage signals were found in these areas, they were discounted. Just like finding linkage signals in introns, people said, oh, that can't be right, or it must be a false positive. We know there are lots of those and that. But the 8Q24 association in particular has been found over and over and over and over again. And it's there. It's not going to go away. And it really is going to change our understanding of the biology of the genome to understand how a variant in a place that doesn't have anything to do with protein coding is so related to cancer. Hal Krohn's disease has several variants strongly replicated that are in areas without any known genes. And then signals in common across diseases. So diseases that people would not have thought were quite too terribly related. Maybe diabetes and CHD, you might have thought were. But even when you control for diabetes as a risk factor for CHD, these associations remain. These variants, as I mentioned, are related to cancer, particularly familial invasive melanoma. Again, not something that one would have expected. They're also associated with frailty in a study from the UK. Prostate cancer, breast and colorectal cancer are all associated with this 8Q24 region. Krohn's disease and psoriasis do have some characteristics in common, so maybe that's not a big surprise. But I think Krohn's disease and type 1 diabetes were never expected to be related any more than just in the immune signal, the major histocompatibility complex, where they do share a locus. But they also share one in this phosphatase, PTPN2, rheumatoid arthritis, and type 1 diabetes in another phosphatase. So lots of new things learned from these kinds of studies that really are, in many ways, kind of setting biology on its ear. Unique aspects of these studies are that they permit an examination of variability really at an unprecedented level of resolution. So much down to the 5 to 10 kb region, or even tighter than that. As Tom has mentioned a couple of times, they permit sort of an agnostic genome-wide interrogation, so you don't have to be able to identify your candidate genes of the places you want to look. One of the nice things about this is that once you've measured the genome, you can relate it to just about any trait. So once you have your genome-wide association study, if you have a cohort study or a group that's been characterized in extensive ways, you can then relate those things to anything. Whereas with candidate gene studies, you sort of had a separate set of candidate genes for each trait. And as I mentioned, most of the robust associations have not been in genes previously suspected of association and not in regions even known to harbor genes. But as Hunter and Kraft point out, the chief strength of the new approach is also its chief problem. With more than 500,000 comparisons, the potential for false positives really is unprecedented. So we were worried about false positives with candidate genes. Here it's really a huge, huge, huge problem. And again, Gary Larson knew this and said, God, colleagues, I hate to start on Monday with a case like this. And here you see this knife sticking out of the back and the butlers of the World Annual Convention. So the challenge is, how do you find the murderer from all of these false positives? And there are a number of ways of dealing with this multiple testing problem, probably the most familiar, the easiest to grasp, and the one most commonly used is the Bonferrani correction, just simply dividing your alpha level by the number of tests performed. There are a couple of others that have been in the literature that probably the second most common would be the false discovery rate, which is the proportion of significant associations that are truly false positives. So it gives you a different denominator. The false positive report probability of Schoenland-Wachelder at the NCI, probability that the null hypothesis is true, given a significant finding. Logically, I don't see much of a difference between these two, at least in the way they're described by their authors, but they are different mathematically and are said to give you somewhat different results. But probably the best approach is replication of a finding and replication many, many, many times. So in order to address this, we recognized that there was a need to define what replication consisted of. This is a report from an NCI NHGRI working group that was convened in November 2006. And there are a number of ways of going about replication. This is the design of a study to do replication given an initial study and wanting to have your replication sets all lined up. So this was proposed by the group who were studying prostate cancer at the NCI, Bob Hoover, maybe a name that some of you recognize. They were going to start with 1,200 cases and 1,200 controls but test as, you know, 500,000 tag SNPs. So using the Illumina platform, a very wide, dense genome array. Then in their replication study, they were going to actually expand the number of cases and controls that they did and do a smaller number, but still a fairly significant, fairly large number of SNPs, about 5% of them in their first replication study, their second replication study, almost similar number of cases and controls, but a smaller number of SNPs. And you notice this funnel kind of narrowing down in the third replication study down to maybe 200 or so and then maybe perhaps ending up with 25 to 50 loci when they're at the end of it. When they actually did this study, they ended up with about five or six loci. So replication is key. It's important that the initial study, one of the things we recognized in the working group was you have to have enough description in the initial study to be able to replicate it. And many times these, you know, my studies are very, very poorly described. One of the studies of Crohn's disease famously described as cases is Belgian and that's all. You know, and so it's very difficult then to know how really to sort of try to replicate such a thing. Participation rates and flow charts of selection would be very useful to have in order for an epidemiologist or others to know how selected a population is. Methods for assessing affected status very often are not described or just sort of, it was a clinician's diagnosis. Table one sort of describing cases and controls, how they compare and other factors, rates of missing data, assessing population heterogeneity, genotyping methods and quality control metrics. Very often these, more and more they are coming to be included in genome-wide reports, particularly because for our replication working group we had four journal editors as co-authors so that helped a lot. And then the replication study should have a similar population, a similar phenotype, the same genetic model, the same SNP in the same direction. It's amazing how some replication studies, and I'll show you some tomorrow, didn't even find sort of the same allele or the same direction of association and still claimed replication. And then adequately powered to detect the postulated effect. So how was this done in some of the genome-wide studies that are out there? The first or one of the earliest very big ones was this breast cancer study from the UK where they actually started with a very small, relatively small number of cases and controls, about 400 cases, 400 controls. But they selected them to be strongly familially loaded. So I believe these women had to have at least two relatives, first degree relatives with breast cancer. They tested 260,000 SNPs in them. Their stage two was 10 times as large, so 4,000 cases, 4,000 controls. 5% of their SNPs were carried forward into stage two. Stage three, four times as large again, six times as large again, brought forward only 30 SNPs. And then finally came out with six SNPs that were significant across all of these studies. And these are all of the cohorts that they used to get to their 40,000, 50,000-some subjects. So a huge, huge, huge interaction, really a global collaboration. One of the things that's important to screen out too, or make sure you don't miss are the false negatives shown here, and now Edgar's gone, something's going on around here. So you wanna be sure that you're not missing even subtle false negatives. And the way this was approached in the C. Jim's prostate cancer study was to take a larger number of cases and controls and more SNPs because they wanted to catch as many as they could. But then basically took 4,000 cases and 4,000 controls so they modified their design a little bit. They brought forward 5% of their SNPs as the Eastern breast cancer study had done. And they selected everything at P less than 0.068. I've forgotten how they came to this. It was some fancy false positive report probability parameter, but at any rate. What's interesting is when they then did their, when they compared their sort of first and second stages and the way these studies are analyzed, they're often analyzed together in a joint analysis correcting your P values for that joint analysis. Here are the SNPs that came out and the genes that were associated, the P values from stages one and two. But what's neat to look at is the initial rank from the stage one study. This particular SNP was ranked 24,000. So it was way, way, way, way down. You almost didn't pull it up into the 26,000 that were tested. And similarly these other SNPs that were strongly positive. One of them was just above, sorry just below that and several of them were not in certainly near top 100. So and the initial P values even, only this one was even on probably on anybody's radar screen. So recognizing that it is easy to miss important associations as well. And again these need further replication and further investigation. But it is sort of a harsh lesson I think. That if you just carry forward your top 100 SNPs you may be carrying forward all your genotyping error and not a whole lot else. Tom asked me to comment a bit about genome-wide association cohort studies. This is beginning to be done in cohorts that have been prospectively collected. People either free of disease or population-based samples with and without disease. However, they may come and exposures measured over time. Particularly in the Framingham study you may have seen this report in BMC Medical Genetics. This is the sort of the cover paper that describes the 100 case SNP genotyping study resource. And then there were 17 phenotype working group reports published in the same issue. And the Framingham has undergone 500K genotyping and that's available in the Framingham. It's called the SHARE resource. I've forgotten something SNP Health Association resource I believe is what it stands for. Those data are also available through DB SNP, sorry, DB GAP through a controlled access process. And the women's health study which is one of the Harvard women's cohorts has also had a genome-wide association study done in 25,000. These women were nutritionists, dietitians, and physical therapists. I think they were health professionals but not nurses. And these data will also be made available through a controlled access process. This is the DB GAP sort of entrance page. Database of Genotype and Phenotype was developed by the National Center for Biotechnology and Information in sort of recognition that genome-wide association was coming, that these data needed to be made available in a way that could be managed responsibly and still be accessible. And so it was developed basically as the Framingham resource and the gain study which I was involved in, genetic association information network. We're moving forward. We kind of developed together, developed policies and developed the resource. And you can see that this may not be the latest screenshot from here but Framingham Share is certainly on this. If you click on it, and I don't have a screenshot from Framingham Share but going down to the ADHD study which I know a little bit better, there's a description of the study that actually can go on for quite some time. This is a syllabus or a summary of it. You can search within for various things. You can also look at particular variables, details or Vs are for variables, D I think for data that's available, et cetera. You can also ask for information on how to apply for individual level data. And if you have an ERA access number at NIH, you're eligible to apply. Anyone who has submitted a grant should have an ERA number. You have to get certain credentialing through your institution in order to get such a number. And we felt that that was kind of the best way of credentialing a requester so that we didn't have kids in garages requesting data and that. It's not that hard to get an ERA number. There are some steps that you have to go through. And so if you don't have one and you want access to these data, it is possible to get one, but it is a bit more of a credentialing step. And then there are sort of how one gets to these and what the use restrictions might be on a given data set so that you're aware of those. And then maybe just to finish up and we'll let you out a little bit earlier on this lovely day. Genetic association and clinical trials. There has not been as much work done in clinical trials on genes and genetic associations. Certainly not as much as has been done in observational studies even though most of that was candidate gene work. Some interesting stuff that came out in 2000, the year 2000 on beta-adrenergic receptor polymorphisms in response to albuterol and asthma showing a really pretty profound association with a particular variant, the 16 arginine glutamine variant associated with actually worsening pulmonary function if you had the variant allele. TCS7 polymorphisms, F2 polymorphisms, this is that variant in diabetes that I mentioned earlier, has been typed in the Diabetes Prevention Program. Diabetes Prevention Program was that clinical trial that looked at incidents of diabetes in pre-diabetics. So people who had elevated blood sugar or impaired glucose tolerance or obesity in three arms, they were randomized to physical activity, I believe, increased physical activity and diet control metformin and then sort of health advice and showed that the increased physical activity and weight control actually was the most effective there. And in fact, the TCS7L2 and a couple of other genes have been genotyped in that study showing basically that the interventions work differently in some genotypes, not for TCS7L2 but for some of the others. And it's a nice way actually to make use of clinical trial information to see if you can find variants that affect or interact with various treatments. And this was also done in the all hat study, this paper from Borowinkle and Arnett and their colleagues looking at a variant and I've forgotten what MPPA is but at any rate, this was just recently published in January of 2008. And it is again a way of making use of clinical trials that have DNA available to test genetic associations. There really have been only two that we could find genome-wide association studies in clinical trials, one being this very small one looking at hepatic adverse events, elevated transaminases in people who received an anti-clotting agent called Zymolegatron which reminds me of sort of a transformer name but at any rate was associated with one of the MHC DRB1 alleles. But when you look at what their data consists of it seems to me that boy, this is just begging to be a false positive but it needs to be replicated. And then a second one, a little bit more robust perhaps in looking at a response to interfere on beta therapy in multiple sclerosis and showing an association with a particular variant. So to sum up then, Canada Genes Association studies have been enormously prone to spurious associations and have I think received some appropriate skepticism because of that genome-wide association really provides a new paradigm unconstrained by our current imperfect understanding of the genome. Initial findings have been really surprisingly positive. Sometimes we sort of pinch ourselves or are we awake but they're robust, they're replicated and now some important aspects of biology are coming out of these so they're probably real. It's beginning to be applied in cohort studies that more needs to be done in that and very little work has been done in clinical trials and treatment response. So I think one of the key lessons from this is that we need to get epidemiologists, clinical trialists and geneticists together and it's my closing, Gary Larson. What have I always said, sheep and cattle just don't mix and you can see they're having trouble here. So I think I'll stop at that point and be happy to take any questions. Why don't you go first? Good. Oh, yes, you need your microphone. Yep, I turned it off. Linda, we're gonna give you an alternative career here. Is it? So Larry, can we turn our microphone on or? Just a second. Just a second. Up, there you go. I just wanna make sure I understood you correctly about genome-wide associations and twin cohorts. Is it that in a co-twin design, you sort of toss out the exposure concordant twins and the only thing that blends to the association of the exposure discordant twins. So if you take exposure concordant twins and then the disease discordant, does that sort of speak to a genetic difference? It may, if you're talking about monozygotic twins. Monozygotic, yeah, absolutely. So exposure concordant? Disease discordant. Disease discordant. Yeah, so now it sort of adds an interesting spin to the co-twin design because usually we think about monozygotic twins as the association is adjusted for genes, right? Monozygotic twins because they're matched. But now I'm not sure you're saying that with this epigenetic stuff, they're looking for changes in the genetics. You know, if you have exposed, in a twin pair, if they're both exposed, one gets the disease, one doesn't. There must be some genetic change in the other. Well, no, I mean it could be, if you have exposure concordant twins and assume that the exposure is exactly the same in both of them, is that exposure than doing something to their genomes in a different way, I guess is what we're asking. And I'm not sure I'd necessarily put the constraint of matching on the exposure since it's so difficult to do anyway. It's sort of ideal. But epigenetic changes happen in ways, they generally are methylation changes and so they're thought to be related to sort of dietary folate and that, but nobody knows really how they come about and where they're going. So maybe it's both twin smoke and one of them has an epigenetic modification from somewhere else that would cause them, maybe their glutathione transferase gene is turned on in one twin, which is important in metabolism and nicotine, it's turned on in one twin and turned off in another twin. So that might be a design, you're right, for finding some of those genetic variants that are not on the sequence, it's not the sequence itself but something related to the sequence. But there's also a lot of other variability in there to try to tease out. Okay, yeah, thank you. Yeah, twin studies are challenging but there are real opportunities there and we shouldn't just discount them. Recognizing too that even dizygotic twins, they're matched for age, which is great. They're matched for a lot of exposures and half the time they're matched for sex, so that's cool. Sir? What is this really stupid question? Just show my ideas. No, I know these stupid questions. But let me ask a non-speak question. Well, recently we received some, well, sort of questions from the HLBI project officer was to reconsider. I'm sure they were brilliant questions. No, no, no, this is brilliant. To reconsider, oh yeah, you just showed that for these genome-wide associating study or different cohort study, you know, like share or care they plan to put on the web. I mean, so people can apply and to do the analysis and so on. One of the issues that concern a lot of investigators is the confidentiality. And in the past, we received a letter from all this, saying this is not a human. Not human subjects, for sure. But now it sounds like there is some different consideration on that, that they consider this letter is no longer valid. You know, that letter has been very controversial from the beginning. So this was something that was put up by the Office of Human Research Protections in August of 2004. And it was their finding that basically if data are de-identified and one can debate as to what de-identified is. But their definition of de-identified was the 18 hip identifiers, which is, you know, age and birth date and stuff like that. If those were off, then this was not human subjects research. Now that was a guidance that was put out to IRBs and it really was up to an IRB as to whether or not to accept that guidance. Most IRBs have accepted it and have basically said this is not human subjects research. The institutes at NIH or elsewhere can't tell an IRB what to think and how to act. Some institutes have asked the IRBs, are you sure, you know, do you really feel that this is human subjects research or it's not human subjects research or whatever. But others have just accepted whatever an IRB says. So I'm not sure what question you got back from. Well, Orris is saying that it's not human subjects. He's just, so they actually approved, they have administrative approval rather than going through the whole panel. Yeah, I think it's important to not to perceive or portray what's going on in DBGAP as posting the data on the web, because it's not that. What's posted on the web is a description of the study. The data still have to go, you have to go through a fairly complex process in order to receive them and you have to agree to keep them secure and to maintain confidentiality and not send them to anybody else, not try to identify anybody. And there are some fairly significant sanctions for not doing that, that have to do with your relationship with the NIH. I mean, there's, you know, people could sue you and that sort of thing, but it's much more a matter of if NIH were to find out that you had misused these data, you'd never get any more, that's for sure. And you probably have some difficulties with other aspects of interactions with NIH. Now the stupid question. Well, the genome-wide associating study, sometimes you just identify the reading, right? The reading that's associated with certain diseases. And so I remember last year, in science, I think there are five studies, they confirmed that the show certain reading that's associated with the MI, like the risk increase by 23% or something. I thought you were gonna go with the obesity one, I'm not sure about the obesity. Yeah, I forgot exactly. But I wonder, if you identify a pretty narrow reading, why can't you just identify the genes? I mean, just by the bullet and do it. Oh, sure, and in some regions that you identify, there's just one gene, and isn't that wonderful? And you say, oh, there's just one, it must be, like I showed you there, the IL-23R, oh, that must be it. Well, yeah, but we also know that there are some regions where there aren't any genes, and you've got an association. So maybe it's something about the way that region interferes with the way something else happens in the genome. Or there are regions that have two or three genes, and I'll show you an example of that tomorrow. And so trying to really figure out which variant it is that's causing your phenotype is a real challenge. And so you go through various steps of trying to identify what's in the neighboring region, what's it in linkage disequilibrium with, is it conserved across species? So did evolution somehow think it was important to keep it the same? And so if it's heavily conserved, that suggests that it's pretty important. And then there are other ways you can knock it down, you can sort of give an interfering RNA and see what effect it is. If you reduce the function of that, you can see if it's expressed. There are a variety of ways of testing that, which I'm not a molecular geneticist, so I can only kind of skim the surface of that for you, but we'll go into that a little bit tomorrow. Hi, I think you are already familiar with the Korean Genome Epidemiology study. Yeah, so it was a very big study and it's still ongoing. We already collected there for more than 200,000 people, but it's still ongoing. Last several years, we had lots of debates about the study design. And most was focused on the sample size and what kinds of phenotype we have to measure. But I think we had a little attention about the representativeness of a study population. So some of the centers are recruiting court members from health screening centers and even from some hospital because it is easier and cost less, but I'm not sure about the representativeness about our general epidemiology studies. Can you explain a little more about the importance of the representativeness of the general epidemiology study? Yeah, well you're probably in the center of places who know about representativeness and the importance of it in epidemiologic studies. I think I could comment on it for genome-wide association studies. It doesn't seem to be quite so critical, at least for the genetic variance we found to date because they seem to be, usually what you expect, either you miss a whole bunch of stuff because you bias yourself toward the null, or you find a lot of spurious things because it's biased. And yet the one study that I think we would all have said don't do it this way was the Welcome Trust Case Control Study where they used blood donors basically as part of their controls. And then the 1958 birth cohort, sort of the survivors of that birth cohort as the other half of their controls. None of them were in the same cities in the same places as their cases and they hadn't been ascertained in the same way. And probably that's why they didn't find an association with hypertension because hypertension was likely very common in the controls as well and they hadn't phenotyped their controls. And yet the associations that they found have been replicated again and again and again. So maybe if all you're looking at is that DNA sequence, maybe it's not so bad. But once you wanna get out into either understanding gene function or understanding gene environment interaction and how those associations are modified, how they change over the lifespan and that I think you're sunk with a study like that. So that's where Representative helps you. We're gonna talk about some of the issues of bias which the non-representativeness of the cohort could be anyone, a bunch of them. But I agree with Terry that at the current, it's been amazingly robust to identify some of the polymorphisms as it was. But I think if you were to, like you're doing with a study that large, you'd have the opportunity to have the power to look at quite a robust view of the genetic causation of diseases. Kind of what my reading of literature is is that we've been kind of skimming across the causative genes and identifying the perhaps the prevalent and large odd ratios one. And the more usually what the bias does as you all know is bias toward the null. So your sensitivity to find all of the genes may be hindered or may be even made spurious by a non-representativeness. So as we go through and look for the first cut major genes, I agree with Terry that it's been amazingly non-sensitive to bizarre groups being compared. But I wonder if we really wanna get down to saying this is really what the whole biology of this disease is and down to some of the very small polymorphisms that you're gonna need to get to some of those more large and elegant and well-represented studies. I mean, I think it's a... It's a real sort of creative tension. And when you talk to the Welcome Trust folks who designed this study and that, they did it at a time, this had not been done before. I mean, they really, there was the macular degeneration study and that was basically it. And they just wanted to find something. They sort of said, we don't wanna find it all. We don't even wanna find a majority. We just wanna find something and they achieved that.