 What we thought we'd do this morning is get a little bit more into some of the details of genetic association studies And I think we forgot to mention yesterday that we did want to get some course evaluations from you So we'll be passing those out at the break and if you you would fill them out at the at the end obviously That would be very helpful and just just leave them at the front there. So, okay So the things that I was going to cover this hour were discreet and quantitative traits measures of association false positives quality control, which is a big issue then sort of initial looks at the data with what are called qq plots odds ratios which all are quite familiar with but how they're calculated in genetic studies and then a little bit of transmission and interactions And because we're being quantitative. Yes, I know that Sydney everybody knows that but look four wrong squared minus two wrongs to the four yes divided by three by this formula do make a right so So Gary was also a very quantitative person Quantitative genetics was is actually something that we don't tend to learn in high school in college We tend to focus more on the on the qualitative Components, but the quantitative is concerned with inheritance of differences between individuals that are a degree rather than of kind I kind of like the way they put this so I quoted Falconer in the K This textbook of quantitative genetics is really it's very readable and really the classic in the in the field So if you're interested in reading more, I would highly suggest it So what are the differences? They're continuous gradations obviously rather than sharply demarcated types The effects of the genes generally are small in order to give you a smooth distribution as opposed to say multiple means We're in qualitative traits the effects are large And then usually there are many genes and quantitative traits where there tend to be single genes Inherited in dealing ratios for discrete traits and I put a question mark there because actually, you know That was kind of the central dogma until a few years ago when it became very clear that there were important complex traits that were inherited in families and Very sort of predictable ways that didn't seem to follow Mendelian ratios Models for single gene traits So you have a big a allele that gives you a pink flower and a little allele that gives you a white flower and if You have three different genotype groups If a if big a is dominant anywhere that you have a big a you're going to have a pink flower as you see here Anywhere that you have if a is recessive you need to have two copies of big a in order to have a pink flower And if a is co-dominant there are a variety of terms for this But co-dominant is one that you tend to see fairly often You'd get with two copies you'd get a really pink flower with one copy You'd get sort of a pinkish whitish flower and then and then with no copies you'd have your white flower for Quantitative traits say that your big a allele gives you x units increase in height and your little a allele gives you x units decrease in height And say your your if these are of equal frequency your population mean then would be some centered value zero Minus x and plus x would be that the extreme values for your big a and little a Alleles so if you have two copies of little a and a is completely big a is completely dominant both your Big a homo zygote and the hetero zygote would be at this at this end of the spectrum if a is only partially dominant You wouldn't be right down here at the middle But you would be sort of a little bit closer over to the to the big a homo zygote If a is not dominant or co-dominant you'd be you know sort of smack in the middle and you can even have it's although It's rare a being over dominant if it's a hetero zygote And this may be some some kinds of interactions between the two homo zygots or some other thing again This doesn't happen very often, but it is wise to be aware that it could happen The quantitative traits that to date have had published genome-wide studies actually yesterday There was another one in protein levels, but but for for the time being there are these That that we can list the Framingham study as I mentioned had these 18 groups of Traits many of which were quantitative and I never quite know how to count them as is it you know Is it 16 or is it 34? But at least it's around around 20 ish And we went through an example yesterday in terms of how one one looks at associations to get allelic odds ratios So this again for a discrete trait for a myocardial infarction 55% of your cases have the C allele 45% sorry 47% of the controls have the C allele This gives you a chi-square of 55 and an odds ratio of 10 I'm sorry and a p-value of 10 to the minus 13th You can calculate what's called an allelic odds ratio which is the odds of having disease if you carry one or two copies of the Allel so it doesn't matter how many you have just having the presence of the allele and here that would be 1.4 you can also calculate these by genotype group again We did this yesterday so the CC homo zygote 31% of the cases and 23% of the controls Well the GG homo zygote 20% of the cases 28% of the controls and again a strong Chi-square with two degrees of freedom now and a p-value at 10 to the minus 14th Here you can calculate the heterozygote odds ratio specifically which which sort of lets these float float free It might be that the homo zygote doesn't have all that much increased odds over the heterozygote It'd be a little hard to explain biologically unless you truly have a dominant trait Which we tend to be thinking now may not be very common at all these days at least not in complex diseases Calculating a heterozygote odds ratio between these two groups is 1.5 and a homo zygote between this group and that group would be 1.9 and this is the The way that these data are displayed and again We talked about this yesterday with the log of the p-value here And this is a nice example because they really found just sort of one very strong Association on chromosome 9 9p21 and this association has been found in many other studies as well And mentioned that there are other ways of looking at these in addition Here's another one one for a continuous trait now serum uric acid levels What this group did was to regress inverse normalized levels why it was inverse normalized was just to make it a transform into a normal Distribution against the number of alleles this was an additive model So if you have one allele you get you know sort of part of the effect and two alleles you get twice the effect and they're sort of forced into that kind of a distribution and Even one can use Covary it's and you can insert any covariates that you want in here So so it's nice to see that that while with the qualitative traits they tend to be relatively simple analyses just with chi-squares essentially or even Fisher's exact test with the quantitative traits people are getting a little bit more sophisticated in terms of linear regression and adjusting for covariates This one is a another Example in the uric acid literature and showing here in two different cohorts here now are the mean uric acid levels in the three genotype groups a AGMG in these two different cohorts and showing the additive effect for a single allele as having one allele Sorry having one g allele drops your your Genotype mean as you can see here in both of these cohorts Association methods for quantitative traits that tend to be used as I mentioned linear regression sometimes a multivariable adjusted residuals Just to take out the noise effects of covariates One can use linear regression of log transformed or centralized BMI. This was done in the frailing paper Variants components was very very popular at one point in linkage analyses and it sort of carried over into Into these studies as well. And so you can do a z score analysis Asana did it quantile normalized height there as I said a variety of ways of doing this Ways of dealing with multiple testing and I think we talked about that a little bit yesterday the family-wise error rate is sort of the general term that's given to corrections for Either Bob Ferrani the A prime or the CDAC which is a way of sort of correcting for the number of times that you Subtract your your p-value your your type one error from from the universe of possible type one errors You mentioned also the false discovery rate proportion of significant associations that are actually false positives and the false positive report Probability of walk holder at all which sounds very much like the false discovery rate But it's a little bit different. There's also a Bayes factor analysis Bayesian analyses are sort of creeping into Analysis of genome-wide studies particularly with UK influence So so the welcome trust did part of this in there one of one of their papers And it is a challenge because you have to you know basically identify a reasonable alternative model that you're comparing your data to So not a lot of people use it, but you may see more and more of it So moving on to quality control as these folks are doing hold your horses everyone Let's let it run for a minute and see if it gets any colder. So Fortunately with genotype and we tend to sort of take it at face value and and there's this sort of this in silico truth That that you just assume that it must be right if it comes out of the machine And there there are actually a number of things that one needs to do to be sure that the data that you get are not Artifactual these include being sure that the samples that you have are the right samples and that they're high quality samples So we talked yesterday about how many satellites variable number of tandem repeats are used in forensic Genotyping to identify individuals and there are automated kits that are very easy to use and relatively inexpensive Identifier is one of them. I'm sorry. I don't know that the company name or their their location But but some of the genotyping labs will run these snips there. I mean these these markers They're highly polymorphic there. I don't know 13 to 20 of them And use that essentially as a barcode on a sample so that when whenever then you go to test that sample again You run this barcode on it and make sure that you have the right sample and it's a better label at anything You could stick on the outside of the two Blind duplicates, of course make make sense what tend to be done rather than blind duplicates from within the study is duplicates of known specimens like the Seth samples or any other standard samples from the hat map Gender checks. It's surprising how many women end up, you know in the prostate cancer samples And how many men end up in the female breast cancer samples, etc This is a little bit more of a scary one because while you have a very good genotype measure And generally for for sex. You don't have the best, you know Recording of that information and it's sometimes Incomplete or incorrect Surprisingly enough and so those corrections can be made but you wonder what other kinds of corrections needed to be made That that are not they can't be picked up Genotypically a cryptic relatedness or what we sort of refer to as is outbreaks of twinning When there are duplicate samples in there that you didn't realize and suddenly you have people who have exactly the same markers They probably are not twins But those are also things to be looked for and we'll talk a little bit about that a Degradation of the samples or fragmentation is something that the laboratory should be testing for The call rates greater than 80 to 90 percent And we'll talk about those heterozygosity and plate batch calling effects So all of those are things that one looks at in the samples and then once you've done the samples You also want to look at the snips within samples So again, you would look at duplicate Concordance rates for a single snip because some snips are hard to type just like some samples are hard to type Mendelian errors if you have family data you can then see if there are errors in transmission So you might have siblings where there are more than four alleles which you shouldn't have Or you know, certainly a parent offspring where the offspring has has things that the parents couldn't possibly have transmitted to them usually Only one of these will be accepted for a given snip and and sometimes not even not even that many but usually not more than that Hardy Weinberg errors we talked a little bit yesterday about the expected proportions of the Common allele homo zygote the heterozygote and the variant allele homo zygote and those should be in binomial proportions if those Conditions that I mentioned apply so things like random mating which actually doesn't apply No in and out migration no selection and that but in general most sniffs tend to be in Hardy Weinberg equilibrium and usually that the threshold for throwing out a snip on on Violations of Hardy Weinberg are fairly high. So so you know 10 to the minus seventh you have to have a pretty strong Difference in distributions in order to question that that snip heterozygosity and we'll talk a little bit about this there You know based on population genetics There should be a proportion of people who are heterozygosity at a given sniff It's usually in the in the 30% range It seems to sort of fall there a little bit higher for populations of recent African ancestry a little bit lower for populations that are Younger as it were or inbred populations where you wouldn't expect to have quite so many alleles floating in the population call rates for a given sniff there are some sniffs that you just can't you just not sure from the The intensity intensity data that are generated what to call one of them and so you leave one out I'll show you some examples of that And you'd really like those call rates to be greater than 98% and that tends to be the standard these days It had been lower in the past and a lot of those that were lower actually turned out to be artifact And so so we've sort of learned a lesson there Minor allele frequency which is the frequency in a population of the uncommon allele and generally Alleles less than about a 1% frequency. There's some real difficulties in trying to genotype those now This is going to be a challenge when we get to Sequencing for even rarer and rarer sniffs that may be only point five percent or point one percent of the population they may not be able to be typed on on the Existing platforms because they're not terribly sensitive to to rare alleles and then press the most important thing is Validating the most critical results on an independent genotyping platform So you take those the results from your aphometrics array or whatever and then type them on on something That's it's just specific to a couple of snips And I mentioned Hardy Weinberg yesterday and again here it is here it is again So the ideal conditions for it and these are the proportions that you would expect to see So these are some data from the genetic association information network testing the two platforms that we used on the the HapMap data the HapMap samples are Available from Coriel repository. There's a small fee. I believe for requesting them and sending them out But there they've been genotyped by you know everybody in the world So the data are very stable and and what we show here is with the perligen platform of 481,000 sniffs the aphometrics platform of 439,000 that were used and this is the 5.0 platform that were used in the gain study These estimates of coverage are the coverage the number of sniffs in the genome estimated That are covered at point eight or greater with an r-squared of point eight or greater And this was really quite good for the the European population a little bit lower as expected for the Yoruba population One can can use just a single marker Estimating is one one marker in LD with another marker not on the platform Or you can use information from a couple of markers nearby and use a multi marker imputation Which gives you much more information using the LD that's Surrounding that snip and there you can you know you can increase your coverage slightly and the aphometrics platform sort of similar numbers a little bit Lower the average call rate in both these platforms was over 98 percent, which is great homozygous genotypes the concordance with what the HapMap was 99.8% in the Prologin for both and a little bit higher for the homozygotes tends to be a little tougher to type the heterozygotes And I'll show you show you why in a sec and then Some of the the sample in QC metrics for the aphometrics 5.0, which had about 500,000 markers and the 6.0, which has about a hundred a million markers just shown here These are our data again from from gain These are this was actually after we started doing the study the previous ones were sort of gearing up for it in looking at The HapMap samples and what you see here is that we threw out about point four percent of Total samples because they didn't pass a variety of QC metrics And point five five percent didn't didn't have a greater than 98 percent call rate These numbers are obviously not exclusive and we need to have sort of a flowchart of things as they kind of drop down through these metrics With the 6.0 the percent failing was higher, but there are many more sniffs and many of them are much much tougher to type So about four percent The samples failing and about 1.4 percent without a 98 percent call rate And that's 98 percent for that sample across all of the sniffs that are typed and then Looking at the the snip by snip quality data quality information again about 500,000 and about a million showing here We dropped out about six percent of samples that did not sorry six percent of sniffs that did not pass QC only 0.04 percent of the of the HapMap, sorry of the The 5.0 snips had a minor a little frequency of greater than less than 1% While in the much more dense platform as you would expect a larger proportion of them had less than one percent minor a little frequency Greater than a 98 percent call rate about 8% of the snips fell out on this metric 9% on The 6.0 platform if you drop that level to 95 percent people tend to want to do this because they don't want to lose data And there are many various combinations of babies and bathwaters talked about Discussing this but at any rate you'd lose about 4% in the in the 5.0 platform and about 4% here The Hardy Weinberg here now said it 10 to the minus 6 so really very stringent and less than a percent dropped out for that Mendel errors in this particular study with the 5.0 platform We dropped out a lot because we had family samples and in most studies you don't have family samples You really can't tell about the only way you can tell is if you have more than two sips and you can actually compare You know see that they have more than four alleles. So here we lost a lot And less than or equal to one duplicate error again relatively relatively small So you can see here that it's the the call rate that really tends to drop out your snips And also the Mendelian error in that one particular study This is a plot of heterozygosity. What what one does is just basically, you know calculate the the Proportion of heterozygos in in the population for each snip and and as you can see it really kind of clusters around this 0.27 to 0.30 Range you can't see very well, but there actually are snips kind of shown along the kind of fall out here and What I what I've done here is just to kind of blow up this area in the in the hundred You know the frequency of a hundred kind of dropped out those so that now that you can see there are some samples that are coming out here And some out on the side as well So so we did have some outliers and this tends to be done as in as basically a right now It's just kind of you kind of look at it and say hmm those look a little bit odd particularly way out here Reasons for loss of heterozygosity tend to be well cancers can do it, but when you're looking at germline cells Decreased heterozygosity usually is due to genotyping error increased heterozygosity hadn't been worried about terribly in the past And we learned a harsh lesson from gain that probably what this is telling us is Sample mixed on Contamination so sample then it turned out that there were about eight samples that actually had been contaminated with other samples In in this particular gain study, and we didn't pick this up initially and then realized hey There's a real problem here went back and looked at these and you can see that these are very small numbers of samples In fact, they're probably are about eight of them here And and that was what the problem was so so all of these metrics again Are evolving as experience accrues and one of the things that we really want to do in gain is is right up to this Experience and make it available to the scientific community these are intensity plots for Genotyping data what's what's calculated, and I don't understand the chemistry And I don't pretend to be able to explain it But I know that what you basically produce for each one of the five hundred thousand or million snips that is typed You produce a cluster plot like this where you have if there are two Alleles of the snip you have a little ale a frequency plotted down here and a little be Frequency plotted here and as you might imagine this would be the AA homo zygote because there's no B Frequency to sort of spread this out This would be the BB homo zygote because there's no a allele there And then this guy is the is the hetero zygote and you can understand why it might be difficult to type the hetero zygote Because you do have some bleeding over into this group and these X's are things that are not typed to basically This is all automated. You can't sit and look at five hundred thousand of these But but basically the the calling algorithm and there are a variety of calling algorithms There are probably about eight or ten of them out there But the companies you know sort of settle on on one that they use and here the calling algorithm decided I'm not going to call these they're just too close on the border These ellipses are kind of what's what the boundaries of the what the algorithm will call But as you can see sometimes it doesn't call even something that's inside of an ellipse So so that's what these look like This is a very nice looking one Although there aren't that many a a homo zygote since actually is the allele a is that is the T Variant and here, you know lots of homo zygotes here and Nicely nicely separated from the headers I go through Here is a little bit tougher one there were no homo zygote So this is a rare allele so there are no a homo zygote But there is a headers I go cluster and there's a homo zygote cluster for the common allele And again, no no no calls But here you've got much more of sort of a mess So so this is a you know badly clustering sniff badly performing and probably most labs would repeat this one and try to get it to the intensity to be much closer along the the the beeline without anything in the The a line intensity and and here you've got maybe these are the homo zygotes and maybe these are the header zygotes But but it's really very difficult to tell What's recommended now having gone through many of these studies is that when you whenever you see an associated sniff You know something that you want to call it 10 to the minus seventh or whatever level it might be that you look at these Plots and and make absolutely sure that they they look good that you have three Distinct clusters so you know that means you don't have to look at you know all 500,000 or a million But you do have to look at maybe 25 or 50 if you're carrying through 25,000 of them that makes it a little bit tougher and most people wait until They've gone through multiple Stages before they they start looking at cluster plots And just one one sort of kicker in this in this process. This is a paper published by Clayton at all in in 2005 that actually Colored their their snip Intensities by whether something a sample came from a case or a control so the red are cases The blue are the controls you can see the reds kind of clustering down here and even difficult to see them There and then this is just a different way of it's a normalization of the of the intensity Data and you can see that there is a systematic difference between cases and controls Is this could arise by different treatment of the case and control DNA samples? They tried very hard to control for that and and make sure that the you know the protocols were exactly the same They were done in the same laboratories at the same time on the same plates and all that and they still ran into this And and basically concluded it that probably had something to do with the collection at the at the site Happily this is something that can be adjusted for so you can sort of calculate vectors associated with caseness and control this at least in it for a given snip and and correct for it But it is a cumbersome problem and one that needs to be looked for another thing that can introduce Systematic error is if you've amplified the samples in one group and not in the other or amplified some of the samples and not others amplification is when you have a very very very small amount of DNA and you use polymerase chain reaction PCR to Make as much of it as you as you like now You don't always get good copies of that and it often gives you artifacts and sometimes they look like this So the laboratory that does these this testing should be aware of these problems and should discuss them with you But just so that you're aware of them I wanted to get in a little bit into the problem of sort of unknown Relateds or a cryptic relatedness Tom mentioned yesterday population stratification And looking at a population that's made up of a mixture of different groups What's shown here is from the CGM study actually a little experiment that they did where they mixed together a whole bunch of groups of different geographic ancestry and they started with the hat map samples So a lot of times when people do these kinds of analyses they start with three Population groups that are samples that are known to sort of cluster tightly and to distribute out differently from from each other So that's this is the Yoruba from the hat map the the south population and then the Asian populations And they also typed in African-American group and as one might expect they sort of fell on the mid-range between in this is this Principal component analysis between the Yoruba and the south population. They also typed a Native American and Latino group who tend to have more Asian ancestry alleles or at least to cluster more with Asian ancestry groups And you can see where the Native Americans actually clustered quite tightly here Latino a little bit Spread out a little bit more and then they did a third principal component So the way and it's a shame that Kang isn't here because he could explain this much better than I can but but anyway What one does is to is to look for vectors that are basically separating the the samples in them in the most Most effective ways essentially and you can do this up to you know hundreds of principal components But after the first several you tend to run out of steam. So it sounds like we have some competition here I hope you can you can all still hear me, but Okay, so so in doing this then with the you notice that the this Seth population for the third principal component actually moved in quite a bit So they got kind of pushed in the the Asians in the Yoruba stayed pretty much the same the Yoruba shifted over a little bit And here you're African Americans where you'd expect them sort of between the two these two groups And now the Latinos tend to sort of take off in this direction Between the Asians and when you go on to the fourth and fifth principal components You have this tremendous sort of clustering of samples that you really can't Separate out and then you have a few down here that seem to be driving this But it's not clear and they tried going to a fifth principal component and still didn't much and they kind of wondering You know what the heck is going on so in looking at this group They look very carefully particularly at these samples that seem to be outliers and that will be driving the vector Calculation and what they found actually was that many of these people were related. So these were two parent child Groups and then there were half sips here and these were half sips all along through here Who are all basically unsuspected in this sample when they then randomly chose one person from each of the related groups? They were able to separate out their clusters very nicely So this was kind of a surprise to many of us was presented at one of the last gain meetings by Gilles Tommas And it's it's something that you need to be aware of that that unrelated earth Sorry that related people do end up in your studies Even if you don't expect them to and you may not always pick them up What they pointed out was that the studies that they were comparing this again from the C gems They had multiple studies around the world the ACS the PLCO that this is a European study and another Northern European and another European study And you wouldn't think that there would be people in common across those studies But as it turned out certainly in the in the US studies There was a sip here that was participating in both the Harvard professional study the health professional study and the ACS There were two people who are sorry Seven people we call socially conscious socially conscious people who were participating in both studies There were sip here's in the PLCO and and three people in PLCO and the health professionals So so social consciousness is a risk factor for unexpected twinning in your data And then sip here's here and a father-son pair here So so this does happen It's probably in your studies if you're not aware of it You will find it with genotyping data and if you go looking for a structured population when you have two people that are that closely related It really sort of points the vector at it And it can also do other things in terms of clustering when you're trying to cluster your snips and your data That really can mess up your your analysis So so kind of summary points for genotyping quality control Sample checks are done for identity for gender error for cryptic relatedness Sample handling differences can usually be adjusted for but you have to be aware of them And doing an association analysis is often the quickest way to find your genotyping errors You'd like to think that that's how you find your real signals But it but it's not one of the things that we tried to do in gain was was actually to make all of the data The association data the results available to everybody at the same time We wanted gained to be you know as forthright and Sort of public community resource as possible And in doing that we basically said to the principal investigators and the genotyping labs We're not going to tell you who's a case and who's a control or what snips are associated with what we'll just show you the you know the data kind of Blinded as it were or a mask to the to snip identity and it made it very very difficult for us to be able to sort of You know separate out these things and certainly doing quality control and figure out who was related and who wasn't and we decided that Blinding was really not the best way to go about this and so we now have a sort of a back and forth with the Investigators who provided the samples and we allow a very short time for that because obviously quality control can go on for months and months But it is a it is a challenge And rare snips are the most difficult to call with with these kinds of platforms and just be aware of that Particularly as we get into rarer and rarer snips as we do more and more sequencing and inspection of genotyping cluster plots really is crucial Okay, once you have your data cleaned up what tends to be done is to look at the Association statistics and taking an easy example this again from the Easton study you basically take your chi-square statistics you plot them Along there's an expected distribution. I didn't label this sorry So this is expected chi-square and remember if you have no association whatsoever, and you just choose keep choosing randomly Samples you'll end up with the chi-square distribution the normal distribution squared And this is the observed chi-square. So if your observed population Basically follows the chi-square distribution. That is there's no association You would expect all of your plots because you're just lining up your chi-squares all of them to follow along this little gray line here Which is the the identity line if they don't fall on that line It means that they don't they're not following that distribution for some reason which could be genotyping artifact or it could be that There's a real association there that's sort of pulling them upward and what you see in this black line here is that indeed there are some Departures from from the chi-square distribution and some of them are actually like this one here Relatively dramatic from from what would be expected in the chi-square distribution these red dots are corrections for that the Population structure in this population So what tends to happen is if you have differences in ancestry between cases and controls they will inflate these statistics You can correct for that one way of correcting for it is this what's called a genomic control or lambda value We won't go into this in detail But you just basically divide all of your your statistics by whatever the lambda control value is and that was done here And that's what you end up seeing So this one actually for Eastern didn't look all that all that dramatic But then when they started doing the replications of their their top snips It did sort of pop out some some results You'll also see people Published tables like this where they just basically plot out a table of the observed they adjusted for that genomic control And the expected values by the significance level so 10.01 to 0.05 You'd expect to see 934 based on the chi-square distribution and they observed 1,239 corrected down to this level so you know a modest excess of Modestly significant or really not very significant at all snips and then going down to the increasing Levels of significance until they got to 10 of the minus fifth and they had 13 times as many They would have expected to see one and they saw 13 which suggested that they actually did have some important Associations outside of chance you tend to see these tables much less commonly people like to look at at pictures And and so you tend to see QQ plots much more commonly This is a very nice one from Halfler at all looking at multiple sclerosis and you can see the identity line here Is this red line this gray line is their initial? Associations so they're expected distributions and then what they observed and you notice this dramatic takeoff Multiple sclerosis has long been known to be associated with the MHC multiple Multiple histocompatibility complex a major histocompatibility complex alleles Which are very very diverse and are very heavily genotyped on these platforms So you have lots and lots of snips is actually not all that many but it looks like a lot That are sort of taking off for very high chi-square values if you then Take out the major histocompatibility complex snips and just plot what's left you end up with this here And you're still having some some strong departures from the expected distribution So you need to be aware that it's some sometimes people will take out a particular locus or a locus That's already known to be associated and kind of see what's left This was done for prostate cancer in the the Icelandic study again Here's they show them a different scale. So here's your identity level the blue is corrected The red is uncorrected you have you know very strong associations shown here But knowing that chromosome 8 that a q2 4 region we mentioned yesterday is known to be strongly associated with prostate cancer they basically took out all of chromosome 8 and then looked again and still you have a departure from the expected level even after you Correct for most of these kind of fallout once you correct, but you still have a few snips that that are outside the expected distribution And this is the the qq plot for my carlin fraction that I showed you yesterday I think I showed it to you yesterday, but it's from the samanee study in the new england journal That was part of the welcome trust case control consortium and here you can see there's quite a strong distribution from the number of that are departing from the expected distribution and these actually are all of the snips that were Had a p-value of less than 10 to the minus seventh which they sort of declared as their initial threshold And when one looks at the the actual plot now arrayed along the chromosome those 30-some snips are these here so this is a nice plot because they're all sort of you know hanging right on top of a particular chromosome and one can then sort of look at Not only this level of Significance, but this group is also departing from what you would expect on the in the expected distribution And that is is basically corresponding to this level of association. So these were snips that the the welcome trust group said you know they're not Decisive in our study, but they are sort of interesting and as we explained Yesterday with the 20 the snip number 24,000 in the initial association ending up being one of the top snips in a Replication study they said you know we certainly would would want to look at these as as snips of interest as it were in a subsequent study And then there's this area here that has a modest departure from from the expected distribution And that is probably all of this here and what those are most likely are snips related to structure of that population So difference in differences in ancestry that probably are just noise But one needs to be aware of and adjust for them And then this is the the plot from that that study these are the 39 Snips that they found to be strongly associated on chromosome 9 and here's the one that I mentioned yesterday 3049 Known by its nickname. So this is the most strongly associated snip and this is just another study that they did a Replication study and here again is that what I showed you yesterday the Boston to Providence Sort of map distances and you can see that here you've got one linkage block So maybe these snips are independent of the associations with these because you do have a recombination hot spot here And these are all sort of clues that you can get as to as to where the kind of operative or important snip might be Something that I think Tom mentioned at one point yesterday was the winner's curse Which is the tendency for initial studies to overestimate the the effect the magnitude of an association And this happens in in all kinds of studies It was you know described most colorfully I think in in genetic studies And it's shown here for what was probably the very very first genome-wide association study was done with only 20,000 snips so we tend not to count it in our lists of studies But it's it was actually Ozaki at all in a Japanese study of myocardial infarction Looking at lymphotoxin a and you'll see in their study. They found a strong stronger Oz ratio than any of the others found 1.71 Quite significant and then in subsequent replication studies. These all sort of fell back toward essentially no association when when a Meta analysis was done of all these studies. It was significant. Sorry. It was not significant 0.98 And probably this was a spurious association that that looked good because you know when one does lots of these You you end up picking the extremes and those are the things that get published So again a little publication bias a little winners curse Getting into very briefly gene-gene interaction kinds of studies This is a genome-wide scan looking probably looking at Alzheimer's disease 861 cases 600 controls this is a little bit hard to see but you probably can pick out this one red line going up here This is on chromosome 19 and it's the APOE Locus and it's very very very strongly associated with Alzheimer's disease This was known before the study was done It had been identified in linkage studies as we talked about yesterday And as you can see that the p-value here is probably about 10 to the minus 40th So very very strong association, but there you know, is there something else in here or isn't there? What was done by this group then was to yes, sorry there it is Was to then stratify and look just at the people who were carrying APOE 4 so sort of get rid of the the impact of APOE 4 they actually stratified people with it and without and now of course vastly expanded the Y-axis, but you can see that there are a couple that do sort of pop out as being associated with particularly here with Alzheimer's disease in the E4 carriers Suggesting that there might be some interaction and these were not associated in the E4 non-carriers And that the sizes of the groups were you know reasonable enough that you would expect power was not an issue in the other the other group So they then looked at the odds ratio of late onset Alzheimer's disease Associated with this particular SNP 3115 in the GG homozygotes by their E4 status and found that people who Did not carry any of the E4 alleles had no association with Alzheimer's disease with a SNP And those who were carriers of the E4 allele had about almost a three-fold increased risk if they also carried this particular If they also were homozygotes these are small numbers and they don't give you the numbers But you can see the confidence interval is relatively generous and in looking at everyone together There still was an association of this particular SNP so So all of that sort of coming coming together to suggest that there may be some interactions between this particular Variant, which is GAB2 I'm not familiar with it and it's being looked into by the the Raman group in terms of what that interaction might be These are this is a genome-wide scan for age-related macular degeneration I think I showed this yesterday as it was probably the first in the March 2005 You know earliest truly genome-wide scan at a hundred thousand or so They found actually two of their SNPs that went above the their genome-wide cutoff level Turned out that this one was genotyping error and it went away when they looked more carefully at the genotype So, you know to you know one out of two ending up being genotyping error One can calculate Population attributable risks based on odds ratios and prevalences from these studies that was done here And you can see that here the association. These are two SNPs that were sort of right next to each other You couldn't tell that on the plot, but they sort of calculated these for both of them Here's the more strongly associated SNP the frequency was quite high and the odds ratio for a dominant model was also quite high So population attributable risk, which is a essentially a function of the prevalence and the odds ratio was very high For the homozygote and even for the heterozygote Sorry for the heterozygote the homozygote odds ratio was higher, but obviously the prevalence was lower So the attributable risk was lower So the feeling was that this was this may even be you know almost a single gene disease These odds ratios have come down somewhat in multi, you know in replication studies So it now looks like it's about two or so, but it still accounts probably for a large amount of this disease What was interesting about this variant was that it was then taken forward into the nurse's health study and Various environmental factors that might be interacting with this very strong risk factor were examined and these are our data Schaumburg at all published these these are data that are available in any cohort study and that could be looked at for any of These SNPs, you know if they type the SNPs and his work that you know is begging to be done and really needs to be looked into So what they did was just a stratify based on obesity less than 30 or greater than 30 BMI and use the the group that had the This is actually I believe this is the variant allele that's protective But it's the the ancestral leal is the one that's a risk allele But at any rate this group being the comparison group and you can see that as you carry more copies of the H allele Your risk goes up whether you're lean or or obese But your risk goes up more if you're obese and similarly if you're a non-smoker You have increased risk if you're a smoker you get considerably more increased risk Based on both smoking and on the prevalence carriage of these this allele So so there's an interaction here neither of these were quite statistically significant They didn't reach a point oh five level, but they were certainly suggestive and again This is work that that is not going to be done by by geneticists It is work that could be done by epidemiologists and while at times when we raise you know We sort of raise the specter of don't you think you should be looking for gene environment interactions very often with the answer We get back as yes, but there you know There's multiple comparison problems and they think about it and say wait a minute you did you know a million SNPs I'm only doing five environmental factors or a thousand environmental factors. It's still not going to be nearly as bad So so that's work that that you should just be be aware it needs to be done and could readily be done One of the nicest examples that I've seen of gene environment interaction has to do with Some work that Jose Ortevas did on the hepatic lipase C or lip C genotype in relationship to HDL And basically these are these are smooth plots as you might expect and what he showed was a relationship between fat intake and HDL that varied by by the the genotype at this particular locus and So so basically in the TT genotype the more fat you eat the lower your HDL and the CC genotype the more fat you eat the higher HDL now these are ecologic data. They're not they're not Interventional data so you know forgive me for implying causality, but that was what was what was inferred essentially? What's interesting about this is say you looked at this middle band at people in a population who were eating about 30% fat, which is about the average American diet you would conclude that there was no association between lip see Genotype and HDL level if that was the question that you were asking Whereas if you looked at a population with very low fat intake such as from developing nations in that you would conclude that the TT Genotype has high HDLs the CC has low HDLs and that this is co-dominant with the CT being in the middle If you were then to look at a population with a very high fat intake such as Scandinavia You'd conclude that no, it's not the TT that's associated with HDL It's the CC and it's actually a dominant effect as you can see it here in the heterozygote and then the Variants, I'm sorry the TT has the lowest HDL levels So this is how you can you know tend to miss these kinds of associations or get inconsistent Relationships between genotypes and phenotypes and we need to be aware that this happens all the time and is rarely looked for So gene environment interactions can be an explanation for inconsistency in associations And I did come across another example of this. This was endotoxin exposure and allergic sensitization by CD14 genotype and again if you looked at the relationship of this genotype to Sensitive probability of sensitization to this particular Allergen you conclude that there was no relationship with Genotype if you had a sort of a middle of the range endotoxin load if you had a low range You conclude one thing about the CC homozygote and if you had a very high range you conclude the opposite about the CC So gene environment interaction can can be quite important and it's something we should be pursuing so I think I'll close by just saying there are challenges in studying gene environment interactions as you might imagine The genes actually are pretty easy to measure. That's why we can measure so many of them at such low cost The environment is very very difficult to measure and and you know Hopefully it's getting easier with more automated measures But it's it's something that that the environmental health sciences Institute is is investing a fair amount of money in and I know a lot of Work has been done here on that as well Variability over time it's hard to argue that the DNA sequence varies over time whether it's turned on or turned off Probably does vary with time. Otherwise, we would all keep growing and we keep developing in that And the but for the most part the sequence has low to none variability the environment obviously has can often have high variability Whoops and recall bias. There's probably none in the genes. It's certainly possible in the environment temporal relationship to disease As I mentioned yesterday is pretty easy for the genes. It's kind of hard for the environment and just to close with being aware of One's environment. I guess I'll have the ham and eggs to the surprise of the chickens and pork here So thanks, and I'll be happy to answer any questions I have to say I was delighted to see that it was sort of rainy and gray today because I felt so guilty about keeping you All in on a beautiful sunny day yesterday, so yes when you're talking about sample handling differences Where along the chain are you talking about anywhere? Really anywhere? And you know some of that may be you know, it may have to do with the participants themselves So samples can be different from young people versus old people just in the in the way, you know Your your DNA might be isolated in that certainly in the in the transform ability of it at all But really any step along the way and I think there there is an investigation going on in many of the laboratories now trying To figure out what are the steps that you know, what are the things that really perturb it a lot? But granted that we can't really figure out what those steps are at least look forward and try to adjust for it If you can if they're systematic differences, and then what about you mentioned genotyping errors Is that on the platform itself or? It's it's probably calling error more than on the on the planet You know it's basically that the chemistry didn't work well enough to be able to separate out the intensity of the two alleles Now there are other things that could cause genotyping error What if you have three alleles that would be one or if you have a null allele? You're not going to have any intensity for that person It's not clear how to count that you know It's deleted basically so you don't have that that snip or if you have copy number variants You may have four or five or six copies of it And where does that show up in your in your intensity and there are algorithms being attempted to sort of come up with But most of it is is genotyping error and clustering error Kind of look at a principal component analysis the one over the side a few that are you sure there you want to go back Yeah, okay. I'm really bad at explaining principal But I'll do my best so I don't look at the screen if you have tend to have photic seizures Of course, so this one or pre before that previous. Yeah first and second Okay, I'll make you that's fine. Okay. Um, so what what kind of variable you Were used to construct this principal component What kind of this is this is a leal frequency so it would be frequency of and what I believe the principal components Does is pick out the snips that have the most differences between groups I should pick up use earlier or all of the earlier I believe it picks, you know, whichever alleles are the most different between groups and it could it could be one But it's probably many of them. Okay. All right. Thank you Yeah, and you know the nice thing about principal components is because they're linked, you know They're in LD you would get a bunch of them would sort of come down with one. Yeah, I guess that pick up several earlier because here There's a several principal components. So at this oh, there are many yeah Yeah, I mean it can go out to you know a hundred principal components more and then you start You know separating out sort of clans and you know towns and things like that. Okay. Thank you sure You just mentioned that the gene environmental interaction can be an Explanation for the different results in the different studies and of course through yesterday and today Other than over it was mentioned that we need replication studies So, how do you know then if we know that several cohorts do not collect the same environmental factors? so one Example for example in the nurses Health study show a positive strong association in nobody else shows that this Association was real there so How do you account for these differences because then you cannot make a statement? Yes This is a false association. Therefore, it's not yeah No, it's a huge problem. I mean obviously we can't make all the cohort studies collect the same information You'd like to be able to at least have some comparability between them or or collect some kind of watered-down version And the problem of course is then it's watered down And so are you missing a lot of the the variability within there? I think what tends to happen when there are inconsistencies is people just dismiss them and and that I personally think is a mistake You know there may be some very important information there The problem is you know when you have a million possibilities you want to at least separate down You know the still down the ones that you really think are really interesting or or potentially important and try to pursue those But you have to be you know selective Right and this goes back to then wouldn't be necessary And I know that it's already out the idea of having this huge cohort that your institute was thinking about Therefore because of these issues wouldn't be necessary for the future to have This huge cohort Yeah, we think it's really necessary We don't see how we're going to get sort of the bottom of this without a very large cohort that is well well phenotyped The large cohorts that are going on now are not that well phenotyped and the environmental characteristics that are collected are even poorer But but yeah, we think it's necessary