 So, what we thought we'd talk about next is kind of basic measurement approaches, some of the technology and the ways that these things are measured and reported just to kind of get some of the lingo together. We're going to talk about measuring genetic variation with a variety of different measures listed here and in your handout, then a little bit about linkage disequilibrium and why it's important in these measurements, and then shifts a little bit to talk about familial resemblance and family history. And I'm a big Gary Larson fan and this is, hey, what are you looking at, buddy? You want trouble? You found it. And it's understanding only German. Fritz was unaware of the clouds were becoming threatening, as you can see. So Tom has just thrown a fair amount of terminology at you, and I'll throw a little bit more. And really, a lot of the differences in communication between epidemiologists and geneticists are merely because of a little bit of language difficulties. So when we first started trying to measure genetic variation, there weren't very good measures of it. There were certain things that were known to be genetic, and among them were blood group markers, because they clearly clustered in families and were inherited in families. There are enzymes in that. And one of the very first linkage studies, where linkage is looking for co-inheritance in families of a trait and a genetic marker, was this one from the fellow that I actually trained with when I did my PhD, Alec Wilson, looking at relationships between the catechol-O-metal transferase gene, COMT, which is a gene related to adrenergic signaling in that. And 25, only 25 polymorphic marker systems, and they describe here that they measured the COMT activity in five large families. These were very large families from Ohio, 518 individuals. And then they tested associations with 25 genetic markers, including the ABO, the RH blood group, and then a variety of others. And there were only 25 across the entire genome. And found a LOD score, which at the time was thought to be quite respectable, 1.27. And this is, LOD stands for log of the LOD score. We won't go into, Tom's going to do linkage a little bit later. But anyway, 1.27, with only 25 markers, was actually pretty respectable. And a close estimated recombination fraction, meaning that the marker and the presumed trait locus were close together for this particular enzyme here. So this actually worked, which was exciting. Moving from that relatively rapidly into the, about the 1980s or so, were restriction fragment length polymorphisms. And we were just talking about that in response to the question. In terms of bacterial endonucleases that actually chop a DNA sequence at a certain point, so they'll sort of find a string of DNA, you know, CCGAT. And wherever they see a CCGAT, they chop the DNA. And that's probably the way bacteria insert things into their own and other bacteria's genomes that allows them to evolve. But for whatever reason they're there, and they do define polymorphic marker loci that can be detected as differences in the length of DNA after you digest the DNA with these endonucleases. So depending on where it chops, you may get a longer or a shorter piece of DNA. And you can use that then to establish linkage relationships in pedigrees. And this is an example of this from one of the first papers to describe it. Assume you had, here's a string of DNA, actually two different strings of DNA. And you have your two, this does not want to stay on. Your two places where endonuclease B can chop right here. And here's endonuclease A and it may chop here and here in this particular person, but that CCGTA may be here and over here in this particular person. And so then when you go to run these in a gel, you chop them up, you can label them. And you see that this particular person, or these were both the chromosomes of one person, they would have two different fragments here, suggesting that they have a polymorphism there. Whereas they have the same site for endonuclease B and you'd only see one fragment. And this was the basis for RFLP based measurements of genetic variability, a very laborious, very challenging process by which you had to find all of these endonucleases and then actually chop up the DNA with them. And what was foreseen then, and certainly came to pass, was that since they're being used simply as genetic markers, any trait that segregates in a pedigree segregates means that it's inherited in different ways, in the ways that Tom showed, either dominant inheritance or recessive or whatever. And that such a procedure would not require any knowledge at all of the biochemical nature of the trait or of the nature of the alterations in the DNA responsible. All you're doing is putting little signposts in the DNA and then trying to find them based on the size of the fragment that you're able to detect. And this was used very successfully in a variety of traits. It was used to identify the neurofibromatosis gene. Barker and colleagues in 1987 looked at 15 Utah kindreds and showed the gene responsible for neurofibromatosis was located near the centromere, near the middle part of chromosome 17, the part that attaches to the synaptic apparatus and allows the chromosomes to separate during cell division. And this is an example of this. It's kind of a nice one, but you'd need many, many, many of these in order to come up with the LOD score. This is a family where mom has two chromosomes that have the same polymorphism. Dad has one of each. And then each of the kids, all of the kids are affected. And so is dad. And basically you can see here's one band here at the 2.4 and a second band in everybody but the person with the 1.1 variance who was not affected. So this is just demonstrating co-segregation of the disease with the A2, the 1.9 kilobase allele and not with the A1 allele in each of four affected offspring. And as I say, you'd need to do this in many, many people in order to be confident with it, confident in it. So those were RFLPs. They were cumbersome, difficult to work with, and there weren't very many of them across the genome. Some mentioned variable numbers of tandem repeats, mini-satellites and micro-satellites. I haven't been able to find anyone who can explain why these are called satellites, but regardless, their repetition in tandem of a short, maybe six to a hundred base pair motif that spans about half a kb to several kb's. And this really opened the way to DNA fingerprinting. This is still used in forensic sciences to identify, was used actually the genome institute and NCBI were involved in identifying the remains from the 9-11 disasters and other things. And these are still used in forensic databases. Provided the first highly polymorphic multi-allelic markers for linkage studies and were associated with many interesting features of human genome biology and evolution. There are a lot of these across the genome. There's sort of curiosities at this point, but one that's quite well-known, I think to cardiovascular epidemiologists, is the 5kb Kringle IV repeat. Kringle is actually a name given to a particular region of a protein that kind of goes in loops, and it looks like a Danish pastry that's called a Kringle. You may see them around here in the morning. But at any rate, in the apolecraprotein, A, protein, and in plasminogen. So these are common in cardiovascular epidemiology. And this is just an example of that. Here's the gene for ApoA, C-DNA. This is the complementary DNA. What you do is you find an RNA. It's very easy to pick up an RNA with that tail of the poly-A tail that Tom mentioned. If you have a column of, basically, T's that the A binds to, you just run your mix of things that might include a messenger RNA down that column, and the A's will stick to the T's, and you can pull them out. This is probably how the cell does it too, just in a much more elegant way. But then you can make a complement to that RNA, which is much more stable, and that's called a C-DNA. And it's just a way of looking at structure. So here's a Kringle IV repeat, and there could be anywhere from 1 to 37 of these in humans. There's also a Kringle V region that tends not to be repeated. And what's shown very nicely here is just a run in order gels from a variety of people that have either a 12 repeat. Here's the 12 here. And they also have on the other allele 24, and there it is there, or 13 in the 25, I think, 14, and on up. And you can just see them kind of laddering up here, which I thought was a nifty picture. So that's what that looks like. And this just shows that the molecular weight of this, which is related to the number of Kringle repeats, as you can see here. Here's the number of Kringle repeats. It's a ratio, actually, to that single KV, five repeat. And the molecular weight goes up, and as molecular weight goes up, the lipoprotein little A levels go down. LP little A has been associated with coronary disease. It's not entirely clear how that association works or why. But at any rate, this was a nice example of it. Also called sometimes variable number of tandem repeats are microsatellites, which are much shorter. There are two to six base paired motifs. Most of them actually are dye, tri, or tetranucleotides. So two, three, or four repeated anywhere from 20 to 50 times. And these are highly polymorphic in a population. They were extremely useful for mapping and linkage studies in families. And you may be familiar with the Marshfield Clinic, produced the Marshfield map. There were similar maps, the decode map, and a number of others. They placed about 400 of these microsatellites across the genome and provided the primers so that you could test these in your own studies. And these could be highly automated. So the National Heart, Lung, and Blood Institute and the Centers for Inherited Disease Research at NIH both funded very, very large linkage studies, not only in humans, but there was a dog map and a couple of other animal maps as well. And this was used in great abundance up until about probably five years ago or so. In fact, CIDR retired its microsatellite pattern just last year, and there was sort of size of relief or size of sadness depending on how you look at it. So these were used for linkage studies. And they produced things that looked, graphs that looked a lot like this is from my former colleague Dan Levy in the Framingham study, where basically one had one of these markers, maybe every 10 megabases or so, if you had 400 of them across the genome, so every 10 million bases. And you really didn't need them any more frequently than that because studying families, particularly smaller families that are closely related, you don't get any additional information in this interval because families share such large pieces of their chromosomes essentially. So once you've put in 400 markers, you really don't get much more independent information from 800 markers or 1200 or 1600 across the genome when you wanted to look at a specific region than you well might, particularly in unrelated people. But anyway, this is what microsatellites did for us. But unfortunately, a lot of these really didn't turn out to come up with much in the way of genes and some new tricks were needed. High above the hushed crowd, Rex tried to remain focused. He couldn't shake when nagging thought. He was an old dog and this was a new trick. So it was time for some new tricks in this field. And the new trick, as Tom mentioned, was we're single nucleotide polymorphisms. These had been identified and sort of discovered along the way that other polymorphisms had been identified. And they were thought not to be terribly useful because the dogma had been, you needed something that was highly polymorphic in a population, meaning that most people in this room would have two copies, and those two copies might, two different copies, and those two different copies would be likely to be different from the person sitting next to them and different from the person sitting next to them. So that there was lots of variability in that. Well, with a SNP, most SNPs are bililicts, so there are only two possibilities. It's either an A or a T, as you can see here, C or an A, or C or a T here. Most of the rest of the genome, 99.9% of it is the same, but in just these couple of spots, you have a little bit of a difference, single base pair spelling change in your DNA. And how could that possibly tell you much of anything unless you measured thousands of them, people said, or maybe tens of thousands or hundreds of thousands, and the technology was not available at the time these were first identified to be able to do that. Of course, the technology has caught up and actually far surpassed our ability to understand it, but now we have the technology to be able to measure these and analyze them. What was needed was some way of mapping the relationships among these. So the linkage maps that Marshfield and Decode and others put together, they were able to put together because they had large families that they could follow, they could genotype and look at segregation of their markers throughout those families. With these markers, families really wouldn't help you because so often they would be shared among family members, you really needed to look across unrelated people. But just to give you an idea of what this looks like, here's sort of a generic chromosome and here's like a segment of it that contains a gene and your generic gene has these red things that are exons and then there's maybe some snips in the exons and there may also be some snips in the introns in between or in the promoter or untranslated regions on either end. Usually there are more snips as we mentioned in those regions than there are in the exons because they tend not to be as well tolerated through natural selection in the exons. And then there are these sort of patterns of association among these and these triangles tend to throw people. I know when I first saw them it was like, what in the world are these things? You see them a lot in diagrams here or some stretches of DNA or genes and then you see these triangles and they're labeled with various numbers and that sort of thing. And really we've all been looking at these for a very long time, we just didn't realize it. These are essentially correlation matrices and if you've ever gotten master tables from the AAA, you ask sort of how far is it from Boston to Providence and 59 miles and Boston to New York is 210, Boston to Philadelphia, et cetera. Well if you were, say, instead of putting these numbers in, maybe you color code them so that the cities that were close together were dark red and the cities that were far apart were bright white. You could color code them like this, turn them on their side, make them into squares and there's your linkage diagram. So all of this is, sorry, your LD diagram. So all this is a relationship among various snips and when you see these, don't let them throw you. It's really just Boston to Providence when it's nice and dark red like that. So what that meant then is that one tag snip can serve as the proxy for many, many snips and so you have these stretches of here, two chromosomes in one person and two in another and two in another and you can see that these white places are where everybody is the same and then there are some polymorphisms here and for instance, here's this snip three which is actually very closely related to snip four. Every place that you have a G in snip three, you have an A in snip four. Every place you have a C in snip three, you have a G in snip four and likewise or in contrast in snip five, sometimes when you have an A in snip four, you've got a G in snip five, sometimes when you have an A, I'm sorry, these are perfectly well correlated as well. So these are a block as a snip two and snip one and so these form a linkage block and this is a little hard to see. Yeah, so here you have an A and there's a G here, sometimes you have a G and there's a G here. So knowing snip four doesn't tell you a lot about snip five, but looking at snip five and snip six, they actually are very closely correlated as is snip six and snip seven and they form another block. So these are just linkage blocks of snips that travel together and could be measured together and then you may have, sometimes you have one that's just kind of out there by itself. And so taking away the intervening sequence that doesn't contribute a whole lot of information, you could just pick one of these snips and you'd get all of the information that was in between so you just pick one. I picked the one with the prettiest color but you could pick whichever one you want and similarly you could just pick one here and you'd still get all of this information intervening and you can kind of stick those together and the sequence of those what are called tag snips because they tag that whole area is also known as a haplotype and maybe you have 35% of your population has this particular haplotype and 30% has that and 10% has this one, et cetera. And then you can basically identify different types within a population and then use those in terms of association relationships to various traits. So there are a number of ways of sort of estimating the correlation between snips. The two most common are D prime and R squared. The wantons D is shown here, it's just the probability of the two, say ancestral alleles traveling together versus minus the probability of the two variant alleles traveling together in order for the variant allele to get, sorry, the variant allele and the ancestral allele traveling together. So in order for the variants and the ancestral to get hooked up together, you have to have a recombination event there and the further apart in general that snips are, the more likely there is to be a recombination event. So if this doesn't happen very often, D is very big, there's a D prime and I confess I've forgotten what the max D is but it's just a way of correcting D prime by constant. But one of the problems with this measure is that it tends to overestimate linkage disequilibrium, particularly for rare alleles because you're looking at the probability of a crossover event measured across populations. If the alleles are very rare, the probability is going to be low that there's a crossover event just because the alleles are rare. Whereas a correlation, just a simple correlation coefficient in R squared is actually a much better, more reliable measure and there's a better discussion of this in Devlin and Rich. So D prime varies from zero to one, zero is there completely in equilibrium, one there in complete disequilibrium and when D prime is zero, typing one snip gives no information at all about the other snip. But as I mentioned, it doesn't account for allele frequencies and R squared is the preferred measure. So when R squared is 1.0, two snips are in perfect LD. So every time you see a snip A and one of them you see snip B and a snip G in the other and the allele frequencies are identical for both snips and typing one snip provides complete information on the other. So that's when you have an LD of 1.0. You might have an LD of 0.98 and perhaps that's because the allele frequencies aren't quite the same, but for the most part they travel together. So what can LD do for us? It's actually very, very useful. It can mess you up as well as really being helpful and in design it's used to estimate the theoretical power to detect associations because if you knew that two snips were correlated with an R squared of 1.0, you'd know that your power would be the same measuring snip A as measuring snip B. If on the other hand your R squared is only 0.5, your power is going to be much less to detect an association with snip A if you're measuring snip B because they're not well correlated so you're adding some noise essentially. And it does help you then to evaluate the degree of completeness of your sampling and the choice of the most informative genetic variance to genotype. And just note that sample size increases by about one over R squared to achieve the same power to detect an association with your snip that is not quite as tightly correlated as the one that you really want to measure which you hope would be the disease causing snip. So I realized that went by a little fast. Any questions on that, LD concept? All right, so what you'll often see then in genome-wide association studies is basically a plot across the, you know, one of the nice things about DNA is that it's a linear molecule so you can just kind of line up all the snips as they occur on the DNA. And what's shown here is for a group of British cases and controls with coronary artery disease and then German families with coronary disease, you see the association statistics here. And what they're generally plotted on the y-axis is the minus log of the p-value just because it makes it easy to sort of relate to them. So a p of 10 to the minus second or .01 would be a two down here. A p of 10 to the minus 10th would be a 10 up here. And these, as you can see, are very strong associations. So 10 to the minus 16th, 14th, 16th, et cetera. And then you'll see this linkage block here. And remember this is just the years of block, you know, Boston to Providence, these things are very close together, they travel together. So that if you were to be looking at, say, these two snips, they travel together, they're not gonna give you too much independent information. Up here, for example, these now are, seem as though they may be in slightly different blocks and certainly these are in different blocks from those. So if you were trying to pick things, snips that you would then type in a follow-up study, you might wanna type those that are in different LD blocks. And one of the neat things about genetics is that it is constantly changing and things that were held to be God's solid truth last year are no longer. One of the things that was widely known and widely taught was that recombination happens at random across the genome and there's no rhyme or reason to it, it's a totally random event. That is clearly not the case. What happens here where you can see there's been a recombination event here, but this block tends to be pretty much intact as does that block. This is just shown in this family study and shown here in the hat map where there were many more snips typed and many more people examined. But what's become very clear is that there are hotspots of recombination and so recombination is not a random event, it actually happens in particular regions much more often than in other regions and that really threw off people when they were sort of trying to map genes and figure out where they were located based on linkage information. This is another kind of similar example of the kind of statistics that you get out of these kinds of studies. Again, plotted the minus log 10 of the p-value and in this particular region there are three genes. There's the interleukin-12 receptor B2, the interleukin-23 receptor and then sort of a hypothetical protein. When they say hypothetical protein, what they mean is that there's a region of the genome that's called an open reading frame which could be coding for a protein. Basically, it doesn't have a stop codon for a while and so that's a good thing and probably it codes for a protein. Yes, sir. Okay, so this was the SNP study done in this particular study here where they only typed a relatively small number of SNPs in this region so you'd notice that the blocks are bigger. This same region was typed much more densely in the hat map so there's like three million in the hat map across the genome and here there were only probably 300,000 or so and there are more people. So you can still see sort of the same blocks. They're not lined up very well and that was a mistake of the editors but you still see sort of the same blocks there. Okay, great. Okay, so yeah, so here you have these three genes, this hypothetical protein and then these other two and here's your association signal and you're thinking, well gee, it kind of looks like it's in that gene but if you then look at the LD patterns you can see that there are, oops, there are your genes, that there are actually two blocks of linkages disequilibrium. They're not real great. I mean, they're not real solid but they're certainly there and it's pretty obvious that it's probably not this gene that's associated with the signal nor this hypothetical protein but it's probably something in here, in these two LD blocks. So it can be very helpful for sort of narrowing down an association region. And these are used, they're plotted in different ways. Sometimes you'll see people plot D prime against R squared back in the earlier days, you know, like way back in 2006 when people sort of were used to the D prime measure which is shown here in blue and weren't as used to the R squared measure and didn't like it because it didn't make as pretty pictures as the blue one. You sometimes see them plotted together. This is TCF7L2. It's the strongest genome-wide association signal found for type two diabetes to date and this is the sort of the gene is shown, this is the direction of transcription and then how the various SNPs are associated. And this is a similar sort of plot of linkages disequilibrium. Now in the three populations studied in the HAP map and I'll talk about the HAP map I think a little bit more later but what they did was to look at the Yoruba people from Abaddon, Nigeria, which is a population of African ancestry that's African ancestry populations are very old. If you follow the out of Africa hypothesis which is no longer a hypothesis it's really pretty well established. The most human variation was in Africa and remained there and a small piece of it then left and went into Europe, Asia and colonized the Americas. So the African populations, recent African populations tend to have less linkages disequilibrium because they're an older population there's been more time for it to break up than younger populations. The CEU is the SEF population it's a European ancestry group and this is the Han Chinese and Japanese from Tokyo an Asian population and they also have had less time for their LD to break up and so you can see these triangles are a little bit denser in these two populations than they are in the Yoruba and you see that over and over again in populations of recent African ancestry and we'll show you in a bit how useful that can be. So what was desired then was to produce a hat map to do more efficient association studies in unrelated peoples. We wanted to use just the density of SNPs that you needed to find association between SNPs and disease. So you don't want to type any more than you have to but you don't want to miss any regions that have a disease association and the goal was really to produce a tool to assist in finding genes affecting health and disease recognizing as I just mentioned African ancestry populations differ in their degree of LD. Recent African ancestry populations have shorter stretches of linkages disequilibrium so you need more SNPs for complete genome coverage in that group. SNPs were really a gateway then to genome-wide association studies and Tom has mentioned those and we'll be talking about them a lot. In fact, a lot of the perspective that you're getting from Tom and me comes from the fact that genome-wide association is sort of all the rage and it's all the rage because it's working where many of the previous methods of interrogating the genome really didn't work in terms of identifying genetic variants likely because particularly for complex diseases you were dealing with genes of very small effect whereas linkage studies worked great for Mendelian diseases where the genes are a very large effect. So SNPs are much more numerous than others. There are other kinds of markers that I mentioned. They're much easier to assay. Genome-wide studies attempt to capture the majority of the genomic variation which is 10 million common SNPs that are present in about 5% or greater of the population and this variation is inherited in groups as I mentioned so you don't have to test all 10 million points and the blocks are shorter as I mentioned so you need to test more points the less closely people are related and now we can do studies with hundreds of thousands of markers and this was then the impetus for developing the HAP map this was published nature in 2005 but the data actually were made available almost as they were produced as soon as they were Q-seed they were made available through the HAP map website and basically were used for many, many genomic discoveries including the TCF-7L2 example that I showed you. The more expansive and expanded HAP map was published in 2007 last year of over 3.1 million SNPs these again are the common SNPs that were identified and put into linkage patterns. At the same time and perhaps stimulated by the HAP map genomic technology improved dramatically so this is a slide of art from my colleague Stephen Chanick at the NCI. Back in 2001 we thought we were driving a really hard bargain if we could get a single SNP genotype for about a dollar so here's 10 to the second cost per genotype in sense, in American sense. So back in 2001 with the TacMan assay which was sort of the gold standard at the time a dollar a genotype was really good and we were getting at the NIH and people wanted three and four dollars because they weren't using efficient platforms. That was one of the reasons that we produced some of the large-scale genotyping services that we did because they could be done much more efficiently and over time these costs came down these are the various platforms and the various producers and you'll notice also that the numbers of SNPs that were genotype went up and in fact the flexibility of the platforms went down a little bit too because you basically had to buy into whatever 10,000 SNP platform a particular manufacturer was providing or 100,000 SNP or whatever. Early on when these things were expensive people didn't want to measure 100,000 they just wanted to measure 10 or five or maybe 50 but over time this sort of paradigm has shifted and the cost has continued to come down I haven't updated this slide in a very long time but it believes me it continues to look like this. The million SNP chip was introduced by both of these companies about six to eight months ago or so and the cost of those are running in around the 500 to 600 dollar range now so truly dramatically increased capacity and decreased cost. So what that means is that in 2001 if you wanted to type all 10 million SNPs which is what you'd have to do since you didn't have the linkage to see equilibrium patterns at a dollar a SNP it would be roughly the budget of the entire National Institutes of Health which wasn't likely to happen in a 2000 person study. In 2008 we can type about a million SNPs at a cost of about .05 cents for about a million dollars so about $500 per person for a million SNP chip and really these are still a good piece change but it's manageable whereas before it really was not. This is just sort of an overview of the coverage of the various more recent platforms. The AfroMetrix GeneChip 500K was used for the Welcome Trust Case Control Consortium that we'll talk about I think at some length and in several of the other studies that were reporting out in early 2007 it gave a relatively poor coverage and an R-squared of .08 so that's the question asking what proportion of the SNPs in the genome are you covering an R-squared of .8, sorry, .8 or better and in the Yoruba it was only 46 in the European population and the Asian population it was a little bit better. The SNP array is 6.0 and I left out 5.0 sorry. These numbers are much much better and the Illumina platform similarly these numbers went up and up the Prolegion 600K about these kinds of numbers so we're getting very very good coverage now and it's only continuing to improve. Something just to be aware of is that the polymorphism literature can be a little bit difficult to follow because sometimes the polymorphism is named for the amino acid change. The angiotensinogen gene M235T is the methionine to 3 and E I believe. The nucleotide sequence so here the I forgot what this is, angiotensin receptor I believe and this is a nucleotide change so it's an A to C change in the C DNA the complementary DNA we talked about at position 1166. It could be in the promoter region this is a minus six usually when you're numbering promoters it starts upstream of the initiation site so it has a negative sign. Could be for a restriction enzyme site so these are various restriction enzymes that cut the DNA in different places. They could be for the gene product such as ApoE2 this is a particular protein that's produced by the ApoE gene. There are a number of legacy systems particularly for the major histocompatibility complex the immune system is used for typing for bone marrow donation and that sort of thing and it's a very, very, very, very polymorphic locus and it has a legacy system of naming that goes way back so. Could be from reference SNP numbers these are from DB SNP that Tom mentioned to you. The reference SNP is the sort of the consensus sequence the submitted SNP is what's submitted by whoever submits something to DB SNP says we found a new SNP here it is and here's our SS number and as Tom mentioned good sources for this information are OMIM, Hugo and the UCSC genome browser actually is a neat one. If you haven't looked at it it's we won't show it to you here but you can Google UCSC genome browser that's how I find most things genomic and if you put in either a gene I tend to remember ApoE because it's cardiovascular and just ask it to show me the segment of the genome around ApoE it will show you all of the SNPs in the region it'll show you the conservation in various different species and a whole bunch of other things so it's really pretty cool. I don't have time to go into other genomic technologies one to be aware of that's sort of coming on the horizon and will probably drive genome wide association out of business is sequencing the system measured variation at every point in every gene or candidate region in dozens to hundreds of people to find all of the functional variants that's the way that it's used now we anticipate that within probably not too many years the thousand dollar genome as it's been called will be a reality which means we can sequence an individual's genome for about a thousand dollars remember that the first genome project probably cost about two and a half billion dollars so that's a several orders of magnitude improvement in cost and those costs are coming down day by day gene expression is measuring changes in messenger RNA which is the transcription part in cases in control in controls or in response to stimulation and you'll see some expression studies epigenetics are to measure changes on top of the DNA it's what the epipart means that either turn the DNA on or off or at least make it available or less available for transcription so depending on how DNA is methylated it may, the polymerase RNA polymerase may not recognize a site as a transcription start site and may kind of skip over it and then not transcribe that or the DNA may be wrapped around histones which are the proteins that kind of bind it up into chromatin and it may be wrapped so tightly or in such a way that it's not accessible to unwinding to then be transcribed that's what histone deacetylation does that can turn genes on and off so we're not gonna talk about those very much so it's just pause for a breath for a second and this is, I never realized we'd have to know so much geography and you may not have realized you have to know quite so much molecular biology but that's probably the most of at least genetic structure and function that we'll need to know so just to summarize on genotyping points before I get to familial information there's been unbelievably rapid progress from small number of blood group markers to more than 10 million SNPs, CNVs, structural variant sequence variants and the technology is continuing to change it's one of the challenging things about this field I haven't talked at all about copy number variants they're sort of the latest, greatest new thing and they basically are being typed through SNPs so I won't go into them much but we can talk about them if you like and as I mentioned, there's more to come in lecture four on genome-wide association studies quality control is a major issue and we'll be talking about that as well but I did wanna talk a little bit about familial resemblance this may be a group of gentlemen whoops, no video signal, that's not good so familial relationships, okay basically there are a couple of ways of looking if I'm gonna touch it anyway, let's see come on, touch screen to enlarge, yes so the tree is more similar among related than unrelated persons, makes sense that would be resemblance and clustering is often a measure of risk of disease in the relative of somebody who has it being greater than the risk of somebody who doesn't have it or of people in the general population this has been called the sibling relative risk I like to call it the relative relative risk or Rich's Lambda sub-ass, it's also referred to one can also look at distributions of a continuous trait this doesn't have to be related individuals but it's also called mixtures of distributions or commingling analysis where say you find two or three means in a population so instead of a nice mean distribution you see like a big group and then a smaller group and then a smaller group that suggests that maybe there's a major gene that's producing each of those three you don't often see those kinds of things and when you do they're not necessarily related to genetics but in cholesterol measures for example people with heterozygous familial hypercholesterolemia will give you a bump in kind of the middle of the distribution with a long tail and then those who have the homozygous state will be way, way out here but a little bump in that too so that's another way of looking at them this is an example of relative risk as a sibling relative risk and it's actually a risk of a good thing living to age 90 at various ages depending on whether you had a sibling who was a centenarian or a sibling who had died at age 73 shown here is in people who were age 64 who had a centenarian as a sibling there was really not any and a greater chance that they would live to be age 90 but as they got older there was much greater risk and particularly when they got up into their 80s they were much more likely to make it to age 90 if they had had somebody before them to whom they were related who had made it to age 100 so that's a nice example of a relative risk you can also find these with larger families then it's easier to at least assess relative risks in larger families this is a group the group in Iceland is blessed by having a relatively small country that has not had a lot of in migration and out migration and does have a total national obsession with genealogy so they absolutely love genealogy they can all trace their ancestry back to the like the 10th century or so when they meet each other they say oh I knew your grandmother she was my uncle's school teacher so anyway and this is a truly representative pedigree of people with atrial fibrillation here going as you can see here six generations with the various affected individuals shown and this allowed us to then look at the risk ratio these were basically prevalence ratios of atrial fibrillation in first degree relatives in second degree relatives third, fourth and fifth and you notice that this kind of decreases in almost a having exponentially which is very consistent with the inheritance of a major gene and in fact R&R and others then published of the genome-wide association study of atrial fibrillation just last year and showed that they found a genetic variant related to this. So sibling relative risks are one way of looking at these for discrete traits, for continuous traits you can look at correlations among relatives this is when Gerard was looking at Gerard Archibald I think one of the earliest geneticists looked at relationships among relatives he studied height and showed that basically an offspring's height is the midpoint of the two parents heights and one can regress that basically so you can regress one relative's value on the other in just a simple regression analysis shown here the height of the offspring is the midparent mean plus via beta coefficient plus the population mean and then twice this parent offspring correlation is an estimate of heritability or the proportion of variants in the entire population that's explained by presumably genetics probably some shared family environment as well if the trait is under genetic control you expect the correlations among closer relatives to be greater than those among distant relatives and here are some familial correlations after Wendy Post at all in hypertension spouse correlations are often used as sort of a control for familial correlations if there's a high spouse correlation we generally assume in the US spouses are unrelated and so that suggests that shared environment may be more important than genetics but in 855 pairs the correlation between spouses was 0.05 the expected would be zero parent offspring pairs was 0.15 expected if it was a single gene that was causing this would be 0.5 because parents and offspring share half the exactly half their genes siblings share on average half their genes their variants and the correlations here were similar suggesting that there might be some environmental factors as well that are bringing this down and avuncular pairs which are niece uncle nephew aunt et cetera were smaller than that and that would be expected as well so this is suggestive it's not real strong but it's some suggested familial correlations for a continuous trade and as I mentioned assessing the familial and genetic nature is generally done by looking at heritability it's often designated as either a capital H or an H squared or sometimes sigma squared G over sigma squared P which is the proportion of the phenotypic variance P explained by the genetic variance sub G and I just reiterated that here it's both a population and an environment specific parameter so it changes from population to population depending on how much environmental influence there is there will be if there's more environmental influence adding to the total phenotypic variance this proportion is going to go down if you can keep the environment constant it's gonna everything is gonna look genetic and so this proportion will go up keep in mind that it's valued does not indicate the role of genes or variants in any specific individual but it allows you to sort of predict the expected degree of familial aggregation of a trait and it was anticipated the traits that had high heritability should prove fruitful and identifying trait-related genes probably the trait with the highest heritability that's known is height height actually did not yield itself very well to identifying genes in genetic variants or genes in linkage studies but actually has done has been really a goldmine in gene-wide association studies and just another way of looking at this percent of variants explained for angiotensin converting enzyme activity ACE activity in fathers mothers and siblings and these are just the major gene effect affecting this and the proportion of variants explained and just sort of to point out that up until now we really haven't we hadn't found any genes at all but even those that we've found really don't seem to explain the vast majority of the heritability that had previously been identified so height 90% variability the variants found today explain only about 3% of that does that mean that there are many, many, many more variants to be found or does it mean that environmental influences haven't been taken into account as well it's not quite clear type two diabetes has a sorry a lambda sub S or the risk to your sibling if you have diabetes is about three fold three to four fold so far the variants that have been identified have a lambda sub S of only about 1.07 C reactive protein was estimated has been estimated in the Reiner and Ridker papers that were recently published as having they've estimated about 10.5% of the variants explained by the variants that they identify I'm not sure that I trust that that seems awfully high and the total variance is 30 to 50% this needs to be replicated it's these are new studies and a recent psoriasis study for example the lambda sub S in siblings is four to 11 maybe about seven or eight on average there were about nine variants in this particular paper that were at 1.3 if you were to multiply all of those out if you had each one of them you might be explaining a lambda sub S in the eight to nine range so these it seems as though you're getting more and more of the variants explained these are also newer and newer studies and I suspect that they won't replicate keep in mind that the first estimates that you get of a relative risk in any risk factor whether it's smoking or whatever tend to be overestimates because you've had some variability in order to be able to find that estimate and we'll talk about that in a bit as well. Tom had asked me to comment just briefly and that's all I'll do on Hardy-Weinberg equilibrium because he'll be talking about it a fair amount in the next talk remember that he talked about Mendel's second law that the currents of two alleles of a snip in the same individual are two independent events and those basically segregate separately there are ideal conditions at which an equilibrium is established and maintained among them that was described by two actually epidemiologists Hardy and Weinberg and those conditions are random mating which we generally do not have in the U.S. no inner out migration, no inbreeding, no selection is equal survival of the offspring no mutation, large population sizes and the genes frequencies are equal in males and females very few of these conditions actually hold but they're not all that critical for estimating Hardy-Weinberg equilibrium and if alleles, big A and little A of a snip given snip have frequencies P and one minus P then the expected frequencies of the three genotypes and probably all of us learned this in high school than our P squared two times PQ or P times one minus P and one minus P squared and this is a very useful equation to test it used to be used to sort of identify whether there was selection pressure against one genotype or another these days it's actually more likely to indicate genotyping error particularly because heterozygotes on the current platforms are much tougher to type than the homozygotes so what you tend to have is fewer heterozygotes than you would expect by Hardy-Weinberg equilibrium so it's worthwhile keeping that one in mind I think that's about where I'm at so keep in mind that familial clustering is an indicator of possible genetic influence it's just a hint it doesn't necessarily mean that there are genes at play it may overestimate the genetic component due to either poor assessment of the environment or poor adjustment for shared environment among families and methods for assessing it include twin studies, perinosprine correlations sibling or relative relative risk and percent of variance explained and current genes that we've identified so far for complex diseases really explain only a tiny fraction of heritability and that unexplained heritability has been called the dark matter of complex disease genetics so I think I'll stop at that point and I believe there's a question there in the back so thank you, questions? I'm just curious about the... So I think they'd like you to use the microphone I'm sorry we wanted to tape this we're actually not live webcasting it but we wanted to have it available for posterity so that when Martha or others ask can you give a course we could say look at our website I'm just curious about the height I mean you said it's goldmine there's a genome why there's so much goldmine is there any... Oh it's a goldmine because there are like 20 different variants for it now but each one explains a very very very small proportion of the variants so variants T and then variants CE yeah so it's done very well diabetes has been another biggie Crohn's disease has come up with 15 or 20 or so but again they don't explain the heritability that has been estimated and my personal belief is that we've overestimated the heritability we're not accounting for the shared environment nearly well enough but that's just my belief