 I was asked to describe sort of what this new approach is that everyone's so excited about in terms of genome-wide association, and I hope I'm being picked up now. And Larry had asked me to make the point that there's a revolution going on. So I went for the most revolutionary picture I could find, which is Willard's Spirit of 76. And actually, it is well worth making that point that really something dramatic has changed here. And Larry has talked about it a bit in Emily and Francis in that we're much more able to scan the genome and look for differences between people than we have ever been before. So technologic advances now allow us to measure hundreds of thousands, now millions of variable points across the genome at a relatively low cost, certainly not the 50 billion that Francis had mentioned earlier, probably about $500 per person depending on the platform that you use. And using relatively little DNA, so it used to be we needed close to a microgram of DNA in order, and you only get maybe 20 or 100 micrograms out of a blood sample in order to measure really even just one genotype. And then that went down quite a bit, but it was still so much that you couldn't really measure this many variants in one person without a whole lot of blood or DNA. And now we can do all of these measurements in a microgram or even considerably less than that. What this also means is that these technologies can be applied to unrelated individuals. And you heard earlier that there have been many studies funded by NIH and many other groups around the country and the world basically have identified lots and lots of people and have studied them for the development of a whole bunch of different diseases, schizophrenia or autism or psoriasis or whatever. And those folks, many of them have DNA stored in freezers around the country and the world just waiting for these kinds of technologies to be available and now they can be applied. We can also identify a multitude of subtle genetic effects that increase the risk of complex diseases. And Neil Rish, who many of you may have heard speak, whoops, oh that's the advance. Oh, that's not even advance. There we go. Okay. Neil likes to say that genetically complex diseases, which we describe as diseases that are due to multiple genes are only complex because we looked for single genes and didn't find them, so they must be complex. And remember that when we talk about Mendelian diseases, we're really talking about a single gene and those are sort of the lampposts that we had been studying for so many years. So what is a genome-wide association study? It's a method for interrogating all of the 10 million or so variable points across the genome. We've heard already that this variation is inherited in groups, luckily for us. So we don't have to measure all 10 million points. We can just measure a small subset of them. These blocks are longer in people who are more closely related. They're very long and identical twins, in fact, they're the entire genome. But in siblings or parents in offspring, there are about maybe 10 million base pairs long or perhaps a little bit longer than that. And the less closely you're related, the shorter those blocks are. So when you have people who are not related at all, except for 100,000 years ago when we all came out of Africa, you do need to test more and more of these SNPs. But you don't have to test them all. And so we now are able to do studies in unrelated people, assuming about a 10,000 base pair length that's shared, and that does vary by population. So in older populations, like populations in Africa, that length is much shorter. In younger populations, American Indians, other such populations, those lengths are much longer. One of the challenges in studying populations of recent African ancestry is that you do need to test more spots. And until we realize this, a lot of times we would look and really not see an association in Africans, and yet we would see it in non-Africans, and we'd say, oh, must be something funny here, let's just focus on the non-Africans. And I think Vince will talk a little bit later about how that's been a challenge to deal with but one that we can deal with. So just to kind of go back over, again, this concept of linkage disequilibrium, this is a paper from the New England Journal just very recently talking about what SNPs and genome-wide association may mean for medicine, and it shows here a chromosome in pulling out a gene, and here are various SNPs. These red things are the exons, and you see SNP1, 2, 3, and 4, just, again, a hypothetical gene. What's shown down here, you may see these kinds of triangles. This is very much like, remember the AAA would give you these maps, and they'd say, how far is it from New York to Chicago, or from New York to San Francisco, or New York to Tokyo, and that's basically what this is. So this is the correlation between SNP1 here and SNP2, and when this block is very dark, it means that, boy, if you know SNP1, you can be pretty sure, like Larry's gray sock, that other sock is probably gray, and SNP1 and SNP3 are pretty closely correlated as they're SNP1 and SNP4, but when you get down here to SNP1 and SNP5, they're not well correlated at all, so probably what happened between SNPs, probably 4 and 5, was that there was a recombination event that DNA crossed over, and there was some advantage or there was not, it was just a random event, but at any rate, those two are not well correlated. So say we just look at these SNPs 3, 4, and 5 in this very nice diagram that they did here, showing here SNP3 and SNP4. This could be a G or a T in SNP3, a C or a T in SNP4, but notice that every place where you have a G, where this person has a G, they have a C, every place. Every place where person 2 has a G, there's a C, or where there's a T. So these two are very closely correlated, if you know one, you know the other. As opposed to here SNP4, here sometimes you have a C and you've got a T, sometimes you have a C and you have an A. Sometimes you have a T here and there's an A, sometimes a T and a T. So those SNPs are not well correlated, and this is the concept of tag SNPs. So this one can act as a proxy for that, it can't act as a proxy for that, so you need a different one. So that's all that we're talking about here. So mapping those was what the HAPMAP did, was to really define which ones are closely related to which others. And while that was going on, and partly as a stimulus, stimulated by the HAPMAP project, genotyping technology became much, much more efficient and less costly. This is a slide from my colleague Steven Chanick showing that way back in 2001, we were probably spending about 100 cents per genotype, maybe a little bit less than that, for the standard genotyping technology of that day. And over time to 2005, those costs and these are the different platforms have gone down, and the number of SNPs that one can measure has gone up fairly dramatically. So here at the end of 2005, we were measuring between 100,000 and 500,000 SNPs at the cost of about a penny in genotype or even less. Those costs have continued to decline. This is only through October 2006 from my colleague Stacy Gabriel, and I should probably update it further. And that shows now we're showing these not by cost per SNP, but cost per person. So for a person's entire genome, both sets of their DNA, initially starting at about maybe $1,600 per person for the Affymetrix platform in July of 2005, that has declined dramatically. And other products have come on the market that have more and more SNPs, and Affymetrix now has one that's a million SNPs as does Illumina. These probably have dropped down to about $2 or $300 per person genotyped, and the 1 million SNP ones will come down in cost as well. So this has been really quite a dramatic change, and it has enabled us to afford these kinds of studies. Larry talked a bit about the chips, and you see them around there. This is the data that you get off of these. So when a genotyping lab does this, basically their computer produces a picture like this, which for SNP RS290510 shows you the three different genotypes. So here you have someone who's a homozygous for one allele, I don't know which. Here's the heterozygote, and here's the homozygote, and likewise here, you can probably ignore these for the moment. But anyway, these are the numbers of people, and shown up here is basically the intensity of the light that's reflected back in red by the computer. And then there's a clustering algorithm, and these algorithms are very important and very complicated, and they also change fairly rapidly, that tries to basically read three different intensities, assuming that you have a SNP that is polymorphic. So you have two different copies, you have the A and the T. You could have picked up a sample that just by chance only had T's. That SNP would be called monomorphic in that population, and in that case, you should see everybody clustering at one end. And sometimes the computer algorithms get confused when they see that, and they try to make it into two or three or whatever. And so it's important when you have a SNP that you're very interested in, you really want to take a look at these plots. You can't look at all 300,000 or one million, but you can look at the two or three or five or 10 that you're quite interested in. As you can see here, these are called, these purple ones are called, etc., but then there are a couple of folks that are kind of hanging out here that the algorithm doesn't quite know what to do with. And so there are errors or challenges in the technology and being able to read these. These would be called not called or missing SNPs. There are different rates of missing SNPs in different platforms, plus different genotype DNA quality will give you different rates. And all of these things are things that are recommended to be reported in the report of a genome-wide association study. Unfortunately, these days the reports are so short that you end up having to look at that in the supplementary material. But most labs that are doing these now will report out their quality control, and it's very, very good. It's like 99.7% fidelity for these measures. So if you wanted to look at a dataset from a genome-wide association study, you could actually look at the Coriol website, the National Institute of Neurologic Diseases and Stroke has done a study of Parkinson's disease, 297 cases, 297 controls. And you can go onto their website, agree to keep the data confidential per person and not to try to identify anyone and to use them only for scientific purposes. And then basically you'd have a chance to look at these data. So if you pulled up chromosome 22, which I picked because I'm a wimp and it's the smallest chromosome there is, and it's still a huge dataset, the first two SNPs in that and the first three cases in that dataset shown here. And here are the alleles for, so allele one at this SNP for person 14 is a T, allele two is also a T, so they were home as I got. Person 20 is a heterozygous. And then for the controls, the first three controls are shown here. And for allele, sorry, for this SNP, this allele, and you notice that this one is the frequency of the A allele is much less. It's about 8%, actually. And when you get these results back, they actually give you a file that says, okay, in your sample, we had 8% A's at this point. We had 50% T's at this point, whatever. So what you can then do is do what Emily and Francis were showing you is basically count up all the cases you have and the controls that you have and see how many of them have A's and how many of them have G's. And if you were to do this, and these are totally made up data, do not report this. But suppose you took a look at this one SNP, which was the second one that I showed you here. So allele 2, the one that only about 8% of people have an A at that spot. Say you took a whole bunch of people, a thousand people, that you collected from greater, you know, Richmond, Virginia, and you genotype them, and you'd find that maybe about 8% of them have an A, the variant allele, at this particular point. So 920 of them don't have the A, they have the G variant there. And then you follow them forward in time, and you say, how many of these people actually develop Parkinson's disease? And you find that, gee, of the 80, 10 of them develop Parkinson's disease. And of the 920, only 40 developed Parkinson's disease. You could then estimate a risk, a relative risk, it's called. And this was sort of the standard measure of disease risk for many, many years until we got other computer programs that started calculating other things, which we'll talk about. But basically, you could look at the risk in the exposed, which is 10 out of 80, or 12.5%. Compare it to the risk in the unexposed, 40 out of 920, or 4.3%. And you would get a relative risk of 2.9. So if somebody carrying this A allele is 2.9 times more likely to develop disease than somebody not carrying that allele. And that's a measure of risk. Usually, we see estimates of things like smoking or family history in the three to four-fold range for common diseases. The measures that we get for genes for common diseases are much less than that. So they're much less than 1.5, usually in the 1.2, 1.3 range, typically. Well, there's a measure called the odds ratio, which you're much more likely to see in genome-wide association studies. And there are two reasons for that. One is that you have to have a certain study design in order to be able to calculate a relative risk because you have to know what the denominator of your population is. So you had to know that there were 80 people total of whom 10 had the disease and the allele, and 920 total of whom a certain proportion had the allele and the disease. Sometimes you don't know that. In a case control study, you won't, and we'll talk about that in a second. In addition, there are many algorithms or many modeling systems that basically focus on the odds ratio because it's computationally simpler. And so just to talk about odds and everybody really intuitively, I think, knows what odds are. Odds are related to probability. They're actually the probability of event and event over the probability of it not happening. So the probability of it happening over 1 minus the probability of it happening, which is the probability of it not happening. So if the probability of a horse winning a race is 50%, we all know the odds are 1 to 1. If the probability is 25%, the odds are 1 to 3 for a win or 3 to 1 against a win. So those are odds. And again, if the probability of a person who's exposed to a given risk factor, if their probability of getting a disease is 25%, their odds are 25%, over 75%, 1 to 3, pretty simple. When we don't have denominators for risk estimates, which is typical in a case control study, we calculate an odds ratio. You may have heard about this, if any of you took sort of basic statistics long ago as a cross product ratio, AD over BC, and I'll show you a 2 by 2 table where we get these names of these cells. And again, it's computationally easier. And if the disease is rare, the odds ratio approximates the relative risk. It always tends to overestimate it a little bit. So your relative risk might be 2.1 in that example that I showed you earlier. If you calculated an odds ratio, which we would do by just multiplying, we call these cells ABCD, very novel. But anyway, if we just multiply these two AD times BC, we would get an odds ratio of about 3.1. So not that far off. So here I actually took the data that Francis is going to show you from the Helga-Dotter paper, which is under embargo. And I won't give you the name of the SNP, but many of you probably have already seen it anyway. And basically took the data that they had in their tables, which you had to back calculate a bit. But at any rate, figuring out how many, they had basically a group of cases, 1,507 people who had myocardial infarction, 6,700 who did not have myocardial infarction. But we really don't know what the denominators of these are. These are sort of cases that they identified through a whole series of different studies. And I'm sorry, this is not in the book. We were trying to update things and be as up to date as we possibly could be. So my apologies. It's not in the book. What I had given you was a totally made up example from the Parkinson's data, and I thought a real example might be more fun. So anyway, so basically if you look at the frequencies of their alleles in their cases and controls, you can calculate these numbers. And then you can look at the odds in the exposed. So the odds of disease in the people with the variant allele would be 813 over 3061. That's probability P over 1 minus P. And the odds in people who did not have that allele would be 794 over 3667. If you wanted to do this cross product, it would be here 813 times 3667 over 794 times 3061 equals 1.23. And in the paper, they quote 1.22. So hopefully I did my math right, and maybe we just have a little bit of a rounding error. And again, just remember that's embargoed until Thursday, I think. So the thing that's important that's really conceptually very, very different for somebody like me, and Larry was describing, that we used to do these one at a time. At the very end of my talk, I have some slides that they made me take out because they said they were boring. But what they show is basically genotypes for one person in chromosome 22. It's a slide of them, several slides of them. It's basically about one page of word perfect single, sorry, word single print, single line, single space type. For chromosome 22 for one person, what you get from the NINDS website is about seven pages. And that's only chromosome 22. So there are lots of other chromosomes that are much bigger. And trying to manage these data is just mind boggling. So basically, new approaches are needed for accessing, manipulating, and visualizing these data. And there have been some very creative approaches to doing this. But it does require an entirely new perspective. So we're no longer looking under the lamppost, essentially saying, gee, I know there's a gene related to angiotensin converting enzyme, ACE, which I know is somehow related to hypertension. So I'm going to relate my ACE gene polymorphisms to hypertension. And in some studies it was associated and in some studies it was not. In this kind of a paradigm, we're basically saying, we don't know much at all about the genome. We're going to interrogate across the entire thing and see what sort of comes up as being associated. And we do have to recognize that when you do two or five or 10 kinds of tests like this or hundreds of thousands of them, it is possible that the differences we observed just happen by chance. Differences do happen by chance. That's why people gamble. And what you want is to try to sort of filter out the ones that might be due to chance versus the ones that are likely to be real. So I'm sorry, I know it's before lunch and we haven't had a break, but I do have to give you a little bit of statistical epidemiologic kinds of stuff. So you probably all have heard of P-values. P-value is the probability of finding a result as extreme or more extreme than you observed in your study by chance alone. We used to focus on P-values of about P less than 10 to the minus 4th, 0.0001. When I was in epidemiology school, I was told don't bother to look for P-values any smaller than that, they don't mean anything. And then they, no, surely this is what they used to teach us. And then the geneticists came along and said, hey, we can test 100,000 or 500,000 things. We actually want to know if our P-value is 10 to the minus 10 through the minus 20 through the minus 30th because we want to correct for the number of times we've looked. And when you're looking for a million times or so, you do want to have a much smaller P-value. You may have heard of type one error or alpha error. This is the probability of finding a difference when really in the truth of the universe, there isn't one. And it's also called sort of a spurious association. This has been the bane of what we called candidate gene association studies because you test the ACE gene and the angiotensinogen gene and lots of different genes for relationship with hypertension. And if you did them in small samples and you just happen to get lucky or unlucky, you might find an association. Very few of those associations were subsequently replicated. A type two error is the probability here. So here you find a difference where there isn't one. Here you don't find a difference when really there is one. This is one that we tend to worry about a bit more because we're concerned that we have done a study that basically isn't big enough in order to be able to detect a difference. The difference was smaller than we expected it to be. We didn't look hard enough, whatever we might have missed it. Power of a study is closely related to type two error. Basically, there are two things that could happen. If there really is a difference, you can either find it or you can't find it. Sorry, you can either find it with power or you don't find it. So if you don't find it, you've committed type two error. If you do find it, it's just one minus the type two error. So that's the power of your study. We usually like to have studies that are powered for about 80% power. So you have an 80% chance of picking up a difference if it's really there. Those people actually prefer 90% or even a little bit more than that to pick it up. And then the effect size is the magnitude of risk associated with the variant. So those are those measures that I mentioned. Relative risk 2.9, odds ratio 1.23. There's also something called a hazard ratio that you'll see in some of these papers which is the risk of a disease occurring over a given time period and it takes into account the amount of time it takes for a disease to develop. And just be aware that you need very large sample sizes for these. If you're looking for very small p-values, which we are tending to do because we make so many comparisons, if the effect size is smaller, so if you're looking for a 1.2-fold relative risk or 20% increase in risk, you need many, many more people to detect that than for a three-fold increase in risk or a five-fold increase in risk. Now, you might ask yourself, well, do I really care if the increased risk is only 1.2? And it used to be, again, that we would sort of say, mm, you know, that's probably not all that important. But we are finding that there are genes that actually are quite important, pathophysiologically, and sort of hence to treatment that have risks about this size. So we probably do want to detect those. Allial frequency, if you have an allele that's only present in 8% of the population, you're gonna need a lot more people to be able to find the association with the gene in the disease or the trait than if it's present in 40 or 50%. And the measure that you're measuring, if it's very variable, if you have a lot of error in your measurement, it's gonna be more difficult to separate out your groups with one variant versus another if that variant is having an effect on that measure. So the more variable that a measure is, blood pressure is a very variable measure. It changes minute to minute, essentially. And so that's another challenge in needing a large sample size. And you'll see displays like this. This is probably the first and best known, truly genome-wide association study published by Klein et al. Looking at age-related macular degeneration. And what they did was plot with these little lines here every spot along the genome that they had tested, 100,000 of them. This is log, sorry, minus log 10 of P. So remember, your logs were those exponents. So if you're looking at a P value of 10 to the minus fourth, the minus log 10 of that would be four. And so here is your four level, here is your six level. And just above six, 10 to the minus seventh is where this one that turned out to be very, very suspicious for being causative. Compliment factor H was hiding. So that's one way of looking at them. This is another way much more colorful from the Broad and MIT folks looking at their diabetes scan. Basically what they did was to color code by chromosomes. And as you can see, this chromosome is very big. This gap here is the centromere where you can't measure it. And then they get smaller and smaller. And here's my chromosome 22 way down there. But anyway, looking at the SNPs that are associated and just plotting the minus log of the P value. So here's one that's really, really associated very strongly, at least highly unlikely to be due to chance. Could be due to things like genotyping error. Or it could be due to things like, you pick to sort of a funky population in that. So you need to be able to replicate them. But at least it's not due to chance. Okay. This same group published recently a genome-wise scan for prostate cancer. What you'll sometimes see is that instead of showing you the entire genome, although often they'll do this, they'll say, you know, this area looks very interesting. We think it's interesting because there's a clump of them. And because we know that this particular chromosome, which Francis will talk about, is related to disease, whatever it might be. And that was done here for prostate cancer. This is the AQ24 region, which everyone has sort of scratching their heads. Why is it that this is related to prostate cancer when there don't seem to be any genes there? And it's probably because we know so little about the genome that we'll learn a great deal about it from this kind of example. What they did was to look at this particular area and here's the SNP that they found most strongly associated with it. And then basically, statistically, they adjusted for the presence of the SNP. So you're using a model, you're calculating out an odds ratio for each one of these things. Here's the p-value for that odds ratio. And once you basically hold this constant statistically, go this way. Once you hold this constant, then all of the rest of these kind of fall down to the bottom. So their association is much less strong and that's because they're correlated with this one. And here, this one is sort of the next most strongly correlated and once you adjust for that, then all of the rest of these kind of fall down below the threshold. And it's about five times. It's really a very nice progression in the paper, Heyman and all, in Nature Genetics. And this is one that you'll see from Francis in a little bit, sorry to have stolen it from you. We're looking at chromosome 11. Again, here's a SNP that was of great interest. Here are a bunch of SNPs that are associated with it. Once you adjust for that, all of the rest of these fall out. And here's one of those triangle triple A diagrams that we showed you previously. This shows why. That basically, there's a strong block of linkage disequilibrium. All of these things are correlated with each other. And that's basically what you're picking up with this one SNP. Okay, I mentioned about how genome wide association studies, sorry about how candidate gene associations have had some challenges in replicating their findings. As you see here, 600 associations, only six of them were significant in more than three studies. This is a nice paper by Joel Hirshhorn. And this is not to say that candidate gene studies are bad. What it is to say is that it's very easy for us to find spurious associations when we only look once or twice. And what this taught us was that when we start doing things like genome wide association, we have to replicate multiple times. And as we've seen, replication really now is considered to be the Sinequanon. So you'll see these papers coming through where they've done three or four or five studies at a time showing, yes it does replicate in all of these populations. So large sample sizes, multiple studies are needed to replicate the findings. These produce massive data sets. The analysis requires a huge and a very specialized effort. And better analytic methods are needed. And we recognize that if we make these data widely available, that will stimulate the analysis of these methods. In addition, once you measure somebody's genome, you can relate it to anything. So you've already got it measured. You can look at their height. You can look at their weight. You can look at lots and lots of different things. So these data sets are very rich. And one of the things that we are focusing on a great deal at NIH is making sure that the data sets are made available to lots of different investigators so that you don't have sort of the syndrome of this first fly on a beach whale who lands on it and says, dibs, this is all mine. And we certainly don't want that happening with genome studies. And there has been a tendency for that to happen. So we are pushing very hard to make these data widely available. So the revolution is probably here. Extensive characterization is now possible. It can be applied to unrelated individuals to find genetic, putative genetic causes of diseases. Many existing studies are out there basically waiting for this technology to be applied to them. But we do need new approaches to manipulating the data. And we need responsible approaches to sharing data so that participants are protected. And the investigators who produce the data sets also get some recognition for their efforts. And we believe strongly the collaboration for both replication of findings and investigation of function is absolutely crucial. So I think at that point I'll stop and be happy to take questions. Yes, Jim. I'm just curious, since the p-values are so important in terms of giving some sort of a credibility and you know that you're getting multiple comparisons so you have to have smaller p-values, isn't there some statistic that accommodates the fact that you are doing, I mean, how does that work? Sure, well, there are a couple of different ways people debate what's the best way to correct for that. I mean, you could say that basically a p-value, it's very interesting how we ever got to, back in the olden days, the p-value was 0.05. And if you had a chance of less than one in 20 of picking up a difference totally by chance, people were sort of comfortable with that. Well, where did that come from? Well, the way it was explained to me is that if you flip a coin, when you think about flipping a coin, you get heads. And you say, well, you know, I could have gotten heads anyway, it's about 50%. And you flip it again and you get heads. And you say, well, you know, 25% chance of that. And you flip it again and you get heads. And you kind of say, no, it's a little bit odd. But you do it a fourth time and you get heads. Now you want to look at the coin. Okay, so that's something that's unusual and that's 6.25. So maybe that's how we got to a 5% being a level that people were uncomfortable with. But you know, but it really is totally arbitrary. And when we look more than one time, we may say, well, you know, if I have actually checked 5%, you know, I take these 5% of differences as being statistically significant or unlikely to have happened by chance. Well, if I do 20 differences, you know, out of 20 roughly, on average, one of them is going to appear to be different even though it really isn't. So maybe I need to correct for that for the number of times that I've looked. And one way that people do that is to divide the p-value by the number of times that you've looked. That's called a Bonferrani correction. It's thought to be very conservative because it assumes that every test you're doing is independent. And these tests are not independent because they're all stuck on, many of them are stuck on the same chromosomes. So there are other ways of doing this. You can permute the genotypes, basically, so you say, I'm going to randomly generate genotypes and see how often, just where I know that it's random, how often I see an association with my trait. And that's another way of correcting their couple of others. Here, slide to the microphone, Dr. Collins. Yes, thank you. Please page back to the macular generation. You just went past it. So see that p-value of 4.8 times 10 to the minus 7. Where does that come from? Well, that comes from the fact that in this study, they tested about 100,000 SNPs. And they were assuming sort of this conservative correction. So they said they wanted to achieve effectively a p-value of 0.05, but they're doing 100,000 independent tests. So they've got to divide 0.05 by 100,000, and that gives you that dashed line. So they were arguing that any result that fell above that, that is that the p-value was even better, was likely to be significant and not nor. Isn't anything that fell below that dotted line might be significant, but you haven't proved it. Yes. Yeah, you said that that a p-value like that wouldn't, can't happen by chance alone, but... I didn't say that. If it would be very likely, but if you did, you know, yes, you might come across one like that. Okay, but actually, but you said, but it could be a genotyping error. And what, I guess I don't know, what is a genotyping error? Like I showed you before with these calling algorithms, sometimes they get confused, and particularly if you genotype your cases in your controls differently, sorry, in sort of different batches. So when you're doing this test, say that for some reason your controls, the DNA is different, it's come from a different, so it's buccal swas, you know, for whatever reason, when you do this test, instead of getting these nicely separated, they're actually more of a smear together. So just from that kind of an error, you could sort of generate a difference between them that where really there is no difference. That's a technological error. I'm not explaining it well, and I'm sorry, I'm caffeine-deprived, but also you could get errors. It is possible or conceivable, and maybe Larry you'd wanna comment on this as well, that there might be other genes or other variants in the region that would interfere with this that might be related to cases and not in controls or something. What other kinds of genotyping errors can give you spurious associations? I think those are the main ones. I mean, some things don't behave well in the assay in cases and controls and get tossed out, they never get into the final data set. The other issue that is the population stratification one that you might wanna discuss is that's another place where you can produce false positive results that Terry could explain. And population stratification is another really horrible name for differences in the sort of the ancestry between cases and controls. So say you selected all your cases, not that anyone would do this, but it does happen so much in not very good studies. Say you selected all of your cases from people from Finland and all of your controls from people from Japan, they're going to have different allele frequencies just because of the population history, and so any differences between them, if there are differences in disease as well, you're gonna start describing those to the disease where they might not be related at all. This actually has been used as a way of finding genes that might be related to diseases called admixture mapping, and has been a technique that's been used in the past. It's not used very often anymore. But that's another thing that could cause it. And if there are systematic differences between your cases and controls, those old-time epidemiologists call those confounders, it's just a confounder between bite-down genotypes instead of environmental exposure factors. Does that help? Could you go back to the science embargoed slide and talk slowly about the probability issue again and how you got to where you did? I can go back, whether I can talk slowly is something that I just never quite got into it. But anyway, what we have here, what they basically published in the paper or in the tape when you may have it there is the number of cases, the number of controls, and the ideal frequencies. So for this particular step, they gave the number of cases, what is it, 1507? The number of controls was extended to 6728. And then they said the controls, 0.453 of the cases, had the allele A at the point where the controls had the controls, where the controls. Yes, all the controls had a allele A. So basically what I did was just take those proportions and multiply them by 1507 and figure out how many people had both cases and had a allele A, how many people were controlled and had a allele A. Subtracted that from this total number to get this number and this from that total number to get that number and then did the cross talk. Sorry, there was a allele A in the other example and I'm persevering here, I don't know what you're doing. Does that help? I think so. Okay, guys, it's just a manner of filling out your YouTube ID table for us actually. I think I'm still not understanding why it is that genome-wide association studies are, as I think you're saying, less likely to find spurious associations than the earlier single candidate gene approach. Could you take another stab at that? Please, please don't leave this room thinking that genome-wide association are less likely to find spurious, they are more likely to because you're doing many more tests. The reason now that we're less likely hopefully to find spurious associations is that we recognize that so many of them are possibly spurious, that we do replications of them and we require in the same study, even the same population, more likely a different population, a different, completely different study. And so you may see in one of the science papers, there were several different groups around Canada and Texas and different places. Some of them use different phenotypes, which is a little bit risky. Some use MI and some use coronary calcification. And if the genes you're looking at is related to both MI and microinfarction and coronary calcification, then you're golden. And it actually gives you more reassurance that you're finding something that's likely to be important to history. The reason why Canada gene studies were particularly likely to give false positives is because there were so few true positives, right? So if you're assuming that the genes that are on your short list must contain some that are actually right and you keep trying over and over again, sooner or later by chance, you're gonna get one that looks encouraging. There's a natural pressure, of course, to publish something that looks positive. You say, well, let's put it out there and see if anybody can validate it. In most of those instances, the validation didn't happen, so you ended up with a paper reporting the finding and then a paper refuting the finding. So we filled up the literature. One of the things that's different here is with a genome-wide association study, for most of the diseases we go after, if you have a sufficient number of samples, there are going to be true positives. And as long as you're rigorous about your statistics and making sure your cases and controls were well-matched, you're going to have real results to be able to write about. And so when you see those publications coming out, it's likely, if they did everything right, that the top tier of what they have found will be validated. And I think that's certainly been true in the last month or so with these particular studies. So again, if you're trying to do a study where you know you're doomed to failure, but there's still a pressure to publish, there's going to be stuff coming out. If you have a study where you're almost certainly going to find something, then if you do it right, you'll publish something that's right. In these replication studies, very often it's not the most strongly associated sniff in the first study that actually survives through replication. There are those who argue that the six that are really, really extreme are the ones you should be nervous about or others should say, oh, no, I like the ones that are really, really strong. Regardless, when you do the replication study, very often it's not even the little ones that replicate. So you see a large number of those replicate in the second sample and then a small number of those, they're associated with both studies and replicate them in the third and maybe even a fourth and fifth before you say, well, you know, this looks pretty good. I think I'll buy it. And then still, you put it out in literature and you know, 20 groups try to replicate it and hopefully most of them do.