 Essentially, what this course is what I wanted to accomplish during my sabbatical, and some of my introduction to the course here, I think hopefully we'll put this into perspective about where the epidemiologist is in terms of a field that is moving as fast as genetics. And that is, it's kind of like two ways to interact with a steamroller. You can either be under it or driving it. Certainly prior to this, I felt like I was under it. So let me just give you some perspectives, just to say, in terms of conflicts of interest, the I don't have any or Dr. Monoglio has no interest in many of the genomics companies, et cetera, but we do make aches. So let me just put this into perspective. And some of you have been editors of journals and some of you have been on study sections and some of you have been in the audiences at meetings, and this is what happens, okay? This is a genome-wide association study of Crohn disease. It looked at 300,000 SNPs in 540 patients with Crohn disease and 928 controls. This is an epidemiology study, okay, classic case control study. The analysis confirmed two genes that they'd previously seen, one in chromosome 1 and one in chromosome 15 and 16. And then there was this other region on chromosome 15 with multiple SNPs associated. This was replicated in another study with 1266 cases and 559 controls and 428 trios. The SNPs that were then associated, these new SNPs, were located in a gene desert, 270 bytes, proximal to a prostaglandin receptor gene, whose expression looks like it was controlled by this. And there you are. Are you comfortable yet? Certainly as I sat on or chaired study sections, there was a whole lot of words up there and a whole lot of technology that got them to those words that I didn't have a clue about. The last time I took a genetic course was 1971, okay? And I think our geneticist's friends would probably suggest there's been a little bit of activity in genetics since 1971. So this is where I think I'm a classically trained cardiovascular geneticist, an epidemiologist, sorry. And so this new genetic epidemiology, population genomics, obviously we want to, for this course, is filling some of the questions you might have about what all of this means, what are the strengths and the weaknesses of this kind of research. So who's the target audience? Bummer of a birthmark, Hal. Well, the target audience really is for investigators, epidemiologists or population-based investigators who really would benefit from being familiarized with the developments in the theory and methods of human genetics and genomics. In your handout, you'll see the learning objectives for this course. This will be eight lectures, there will be a number of case studies relative to some of the papers in the literature. As I'll mention in one moment, hopefully we'll have plenty of time for discussion. And this is being webcast by the NHGRI. And we want to thank Larry and Maggie for joining us for that, so that it will be archived if in fact there were other needs to go back and look at the material. Let me just say is that as a teaching tool, this is also a research tool, one of the things that we've relied on as a repository is something relatively new over the past few months. This is the NHGRI catalog of genome-wide association studies. This is an activity by several of our colleagues in Terminalio's Office of Population Genomics. And this basically takes all publications reporting genome-wide association studies being in March 2005 in which the platform used at least 100,000 single nucleotide polymorphisms, the density of that magnitude. And these were your typical search of the literature in terms of these studies. And you can see a variety of types of data that are all cataloged there. So we put in the website there for you, but this is a real asset if you want to go and really see. Now I always am dangerous to say how many there are in there. I will tell you, as of September 30th, there were 109 studies. If you count all the Framingham and a couple of others, there were multiple studies in one. But again, now it's probably closer to 150 or something as this field moves along very rapidly. So what I want to do for this first lecture is to provide an overview of the course, which I've just done. I'd like to also review the structure and function of the human genome just so that we can get some of those words and terms right. Review the patterns of inheritance, which are obviously going to be thrown around and will be models for some of our epidemiologic investigations in terms of genetic causes of disease. Talk about some types of genetic variations or mutations, which we would consider as potential causes for disease, and then at the end identify some online informatics resources, which talk about the genotype and phenotype of human genetic variants. And a couple of them, I think, should be certainly on your favorite button on your computer. So let's start with talking a little bit about the basic genetics, and genetics is the science of inheritance. It's about 100 years old. The genetic code, of course, begins with Watson and Crick describing the double helix, and that is a little over 50 years old. And the field of genomics as coined by Roderick is perhaps a little bit over 20 years old, which is that field concerned with the structure and function of the entire DNA sequence in a population or individual as defined by Roderick. You can see some of the variations, but the idea of genetics, the science of inheritance, the subfield of genomics being part of that, of looking at the whole genome. Now what we're really involved with here is taking an immensely complex biological system and trying to hone down to really identify those variations in the system, which are important in causing disease. So of course we have 46 chromosomes. We believe that there are 20 to 25,000 genes in there, functional genes in terms of producing a gene product or a protein. There are about three billion case base pairs that's haploid, six billion if you count each base. And of these, 99.9% of them are the same between all of us in this room. However, if you've got three billion base pairs, 0.1%, one in the thousand still leaves a lot of variation. And so that's where the money is in terms of identifying those variants that may be related to disease. So we really have, according to these figures, Simmons, your department has lost another number two double N and I want you to find it. So there's the haystack and looks like it's a number two needle. And so this is essentially the challenge is finding within the entire genome those issues which are important for human health. Well here's the haystack and we had a little conference this afternoon and we decided that there were about six feet of DNA in every cell. So can you imagine how many cells you have? So each of you have enough DNA to go from here to Bethesda or maybe further. And so here it is, all of this DNA pulled out from the chromatin pulled out and you can see these literally feet and feet and feet of this genetic material which is made up of, of course, a chain in which the two helices are cooked together by these base pairs and adenine, thymidine, guanine, and cytosine which are hooked on to a deoxyribose sugar with a phosphate to it and then these are linked together. And what you will see is when they say the three prime and the five prime end is to orient you which way the DNA is going for purposes of say transcription, etc. The five prime end here is where the phosphate is and the three prime end here is where this hydroxyl group is. And here you can see the deoxyribose hooked together with phosphates and these form a chain and then these bases are coming off and then they link with these hydrogen bonds to another chain and then wrap up in the double helix at Watson and Crick described. And the importance of these hydrogen bonds obviously is that these are relatively weak bonds which can separate and go back together and as obviously that's important because that's how DNA replicates and so this is what we're talking about is little variations within literally a single base pair unit affecting the structure of what they encode and that in fact going on to cause disease. So that what you'll see in the literature, certainly in the genome wide association gene study literature are a variety of terms, exons, introns and a whole bunch of regulatory elements, promoters, the polyate tail, enhancers, silencers and the control regions. And the point is that each of these have a meaning. Exons are the part which actually encodes usually in several different segments of protein. The actual parts of the gene product or the protein introns are between these exons kind of spacers which are then spliced out during the whole processing of the genetic material and then there are a variety regulatory elements which turn on and turn off the gene. So what we have here is this promoter which is a place can be acted upon to really start this transcription which occurs here and then these blue bars are the places within the gene which are actually coding the protein of the gene. In between these are these which are transcribed and but as you'll see later then spliced out and then you have a codon which says stop and this poly A signal here at the end which tops transcription with this whole region back here. And these are a variety of genes. So this is the breast cancer gene beta globulin. And you can see that the structure of these is really quite variable and part of the truly elegant variation in how the genes are structured. So the story does not get any simpler with transcription where you have your gene transcribed. It's an initiation sequence, a promoter sequence that starts, so this transcription starts and RNA is formed. The RNA then goes along and is a process where these intron areas are then spliced out leaving then your RNA ready for translation out into the cytosol where the ribosomes then make the protein products. So the point is that you have a lot of places to start and to finish and the structure and the splices a lot more than just a little code that is read out in some way but a lot of other elements that could be varying that could alter how that gene product is produced and we're going to talk about later with, for example, the GWAS study that I showed you on Crohn's disease, there's a gene out here in the gene desert. Well, there doesn't seem to be any known exons out there that encode anything. What is happening with this polymorphism? Well, again, you can see that many of these regulatory elements, et cetera, may be where that gene is acting and just important to understand that some of these other things act there and could be targets for variation. So the genetic code then attributable to Watson-Crick are those sequences of the four DNA pairs that are translated in triplets. So what you have is the triplets as 64 codons each which encode an amino acid with the exception of some SOP codons. So this is a genetic code, the 64 variations of a triplet, each of which could be one of four bases. Some of the 20 amino acids there for, there are only 20 amino acids that make up proteins which are encoded by this. So obviously, you have 64 and 20, there's more than one codon will encode some of the amino acids. And then three of the codons are stop codons which basically terminate translation of the RNA. Now there are a variety of opportunities to have variants in there. One could argue some of these variants are deleterious and may be called mutations, but this could occur at the level of the whole genome at the level of the chromosome and at the level of the gene. And you can see here examples of, obviously Trisomy 21 is actually having three, chromosome 21, so you'll have chromosome segregation. Chromosome rearrangement frequently occurs in cancer cells and we're going to talk a lot about single nucleotide polymorphisms today which is a single base pair of mutations but there's a lot of others that both I and Terry in the next lecture are going to talk about in terms of variables. So the whole thing becomes extremely elegant but also extremely complicated and I think it's our job really is just to provide you with some of notion of the level of variability we have here. Now when we talk about single nucleotide substitutions, there are a variety of terms you'll also see in describing the results of the study. Synonymous or silent mutations are one where you may change a base but the base that it's changed to also codes for the same amino acid. So even though there is a variability, there's no change in the amino acid produced and so there's really no effect on the amino acid sequence. It is a variant but it doesn't do anything. There's this missense or the non-synonymous in which that one base pair in fact changed the code for amino acid and that amino acid sequence change then can have of course effects on the protein structure and function etc. The nonsense mutations encodes the termination code on and obviously changes the way that the gene does or does not turn off and the termination mutation destroys the termination code on and so it continues to transcribe into the next gene and so it may not only affect the gene being encoded but also the adjacent genes perhaps encoding other things. So these are some terms you'll hear and so when they say we believe that the defect here is a missense codon and such and such you'll understand what that means. So obviously this is the genetic code. This would encode a series of amino acids which would be translated into protein. You can see where you could substitute a single substitute so your SNPs, single nucleotide polymorphism will be related to this. You can also have deletions or insertions in which of course this would cause a frame shift so that since these are in triplets all of them will be different and of course then the protein and the amino acids translated here will be obviously totally different than the amino acids translated here and obviously cause a perhaps a dysfunctional protein. So the deletions and insertions when they're talking about that's what they're talking about. Now Terry's going to talk more about these indels, the deletions and insertions. Some of them are simply only involve a short segment of DNA and perhaps only affect two alleles where the deletion or insertion occurs. Some of them are much more complex. One term would be the short tandem repeat polymorphisms where you have a variety of repeat units, nucleotide units repeated over quite a number of times and this could affect a number of alleles. Variable number repeats are larger nucleotide repeat units again can be repeated hundreds of times. And Terry's going to give you some practical examples of something that we in cardiovascular epidemiology for example study lipoprotein little a. Something that I think is perhaps the next area of particular interest in concentration are the copy number variants. These are larger segments of DNA which are duplicated and affect a few of the alleles and the important part here is that some of the newer genotyping technologies are going to be able to start measuring these and being included in genome-wide association studies. So currently we're particularly doing a lot with SNPs and I think this is one of the next areas that we're going to be able to measure. I want to briefly talk a little bit about inheritance. This is something that you should have had from your basic biology or human genetics course but just to say that it is important to just go over the terms again as we go forward. The Mendel's principles of inheritance obviously is segregation is that a pair of alleles, a pair of genes for any particular trait of variants in coding the trait, they separate and only one allele passes from each parent to the offspring and which allele passes is random so that goes back to Mendel's experiments with his peas and that these traits are encoded by different pairs of alleles are inherited in the presence of each other unless they're genetically linked and we're going to talk about that because this is important part of our ability to measure quite a bit of the human genome without measuring each and every base pair is the concept of linkage. We're going to talk a little bit later about OMIM the Mendelian traits catalog online Mendelian inheritance of man and at least of last summer this was the number of autosomal genes that are listed in that catalog there's x-link genes there's a few y-link genes and mitochondrial genes we'll talk about the inheritance patterns of each of these but you can see obviously most of them are autosomal and x-link and each of these would have inheritance patterns. So what classically we've been talking about say up to what would you say the late 1990s perhaps would be more of a Mendelian disease pattern perhaps maybe earlier than that would be conditions which are caused almost entirely by a single major gene so when we talk about a Mendelian disorder it's one of those major gene disorders and they look they express themselves as manifesting only one or two of the three possible genotype groups and so when we test for Mendelian inheritance obviously the diseases are being inherited in this way which is a predominant forms and we all know about our favorite Mendelian disease of familial hypercholesterolemia, Marfan syndrome, a variety of the classic described diseases with dominant recessive patterns. More current in terms of the questions of the day are this common disease, common variant idea in which you have common diseases so diabetes, cardiovascular disease, the cancers, etc. are not caused by a single major gene but rather attributable to a limited number of variations which occur in say 1% or perhaps 5% or more of the population. So you have multiple forms which occur in relative frequencies are not rare variants but relatively frequent variants which then when put all together would explain the genetic basis for some of our most common afflictions and I think this is where the genome-wide association studies and a variety of our current activities have focused on for the last 20 years or so. So I think you're all familiar with autosomal dominant inheritance and this is obviously what it looks like about 50% of the offspring are affected. There can be male-to-male transmission. A person who's phenotypically normal does not translate to their offspring because should be of penetrance and men and women are affected equally so this would be the autosomal dominant pattern and again you all have I'm sure the diseases in your own area of specialty of interest. There can be complete and incomplete dominance just a couple of terms they're used. Complete dominance is where the phenotypes whether they're heterozygous or homozygous, one or two copies of the gene are present are indistinguishable. An example of that would be Huntington's disease whether you have one or two genes doesn't matter. You are affected by this debilitating late onset neurologic disease. Incomplete dominance is that the phenotypes are more severe in the homozygous than the heterozygous state. I run a lipid clinic, we see a lot of patients with familial hyperclesterolemia, those are almost all heterozygous with cholesterol in the 300 to 500 range. The homozygotes we don't usually see because they show up as children with very high cholesterol, 1,000, 1,500 they're very rare and they also do poorly unless they're treated with liver transplantation or other heroic means. This is an example of incomplete dominance whereas the homozygotes really are even more severe than the heterozygous although the heterozygotes everyone is affected there as well. A couple of other terms. Autosomal recessive of course are the genes that frequently will be seen in inbred populations. I did my training at Johns Hopkins and Victor McCusick did many studies in the Amish community there. You'll still see in the literature a variety of other founder populations studied because this is where some level of consanguinity will have occurred and this will allow the the recessive trait to occur. So if it appears in one family member it's most likely to appear in a Sib ship because there's obviously there are two carriers each with heterozygous for the recessive gene up here. Male-female is equal. The parents of the affected children are asymptomatic. Again there are there is increased consanguinity and this is why some of these inbred populations are particularly involved with these these disorders. And so the offspring of two heterozygous parents such as here these are are likely to be 25 percent affected. 50 percent of these are carriers and 25 percent of these would be non-carriers just on chance. So your autosomal recessive pattern. X-linked has to do with both male and female offspring of female carriers would have a 50 percent risk of of inheriting the again this is X-link dominant. So this this affected person about half of their children this woman would be inheriting phenotype. There is no male to male transmission because obviously these individuals have their X chromosome from their mother. The number of women affected is much greater than men because obviously there's two X chromosomes and they will obviously have to get the X chromosome from their father there and that means all daughters would be affected and there can be some variation in the severity in the women because of course that X chromosome can be deactivated. There's the lion hypothesis the lionization the inactivation of one of the two X chromosomes occurs and vitamin D resistant rickets is an example of this. Thank you. X-link recessive is that only males are affected. So here we have the carriers identified so you have a male here and the all of their daughters would be carriers and then of this carrier woman they're related through the carrier females you would have the the male cases with the one X chromosome that they have coming from this this particular parent and hemophilia A or red green color blindness would be an example of that. Y-link inheritance again relatively less common only the males are affected obviously all sons but no daughters of affected men are affected so as to rule out other kinds of inheritance it's not X-linked since male to male transmission occurs and it's sex limited because it cannot occur through unaffected females and then the final kind of of inheritance and something we're not going to have a lot of time to talk about is obviously inheriting another kind of of genetic material particularly from your mother and that is mitochondrial DNA this has been a great interest in the tracking of of ancestry and particularly through the maternal line the origins of of of racial groups etc this inheritance is matrilineal so it it does not go through the even though this individual would be affected with mitochondrial DNA from his mother he does not pass it on the mother passes it on to all of the all the children both men and women so having said that one might say that the rules of inheritance look quite tidy in fact it's it's not really so we have sporadic mutations obviously you can see no evidence of there being any genetic issues in the family and all of a sudden she'll have a case and so the suggestion would be that this would be a sporadic mutation there is heterogeneity in the in as you'll see with many of these disorders there will be several gene defects which can can be causative of them and each of them can have some differences in their severity and familial hypercholesterolemia for example there are literally dozens of forms of this probably six or eight mechanisms in which the the individual has the same phenotype so within that phenotype there actually is some variability in in in the disease there can be non penetrance where a disorder simply isn't isn't expressed familiar combined hyperlipidemia is a good example quite a common lipid disorder for example but it's the person gains weight or diet exercise issues much more likely to express that gene than otherwise you can also have late onset conditions like huntington's Alzheimer's disease which may have a strong genetic component but may not look genetic because of deaths etc prior to their age of onset and you can have sex limited and sex influenced phenotypes in which you'll have differences in in the phenotype so there's a lot of the point here is is that the autosomal dominant x-linked recessive etc obviously are general rules of inheritance but there are these other things that that affect what's happening so just to comment on this common variant common disease hypothesis so what you can have is I've just shown you the the various issues that could be going on genetically with single nucleotide polymorphisms in exons in regulatory elements in several genes in several exons the important part of this though is is that their prevalence is common enough to start to explain a common disease and then put on top of that of of course is that these all could be interacting and I think Terry's going to talk a little about gene-gene interaction which is of another whole kettle of fish and the other problem is that all of these could be obviously interacting with environmental exposure so you end up with a disease model which which which is very complicated and obviously this I think should be part of the bailiwick of the epidemiologists who are generally used to teasing out these issues and alls we now have is a lot more complicated data in which to tease out some of these other complex associations and just to give you an idea this is a study of lupus systemic lupus rheumatosis looking at a genome-wide association study 720 women with lupus with 2,300 plus controls over 300,000 single nucleotide polymorphisms assessed across the genome then almost by really by requirement there have been two replication studies which constituted in total 1846 female cases and 1825 female controls this was as I said study all of women and what they found was at least 17 single nucleotide polymorphisms which were associated at the p lesson 2 to the 10 minus to the minus seventh power level of probability so what you have here is then obviously each of these would have a gene frequency of five to say 50% and when you put them all together then in the logistic regression model looking to see how many of these really are independent of others maybe one's just the neighbor and they're all just kind of co-associated you come up with these six models which have independent either a susceptibility or protective relationships all significant at somewhere between the 10 to the minus seventh and 10 to the minus 18th level of significance the C statistic is 0.67 and apparently would account for about 15% of heritability and so what you're now talking about with lupus and those of you who have clinicians who have taken care of lupus patients obviously many times you'll see a family history you get the idea and we'll talk about heritability I think of lupus a little bit later and now what we're seeing is really quite a number of gene variants which look like they play a role in this disease and and act so independent of each other so many of these genetic models I think to least my eye are very complicated and obviously are a challenge to to fair it out the last couple of minutes what I want to do is just to say that that was kind of a semester long genetics course in in about 30 minutes one of the important aspects of some of that basic genetics is now is that our informatics resources which have then taken all of that complexity and now is accessible online I think I've used the wrong colors here but basically are now all available through the National Center for Biotechnology Information of the National Library of Medicine and Terry and I took a day long course on bioinformatics I think we were the geriatric sets in that that course there were I think everybody was more facile with computers than we were by a long shot but the NCBI has a variety of resources and particularly OMIM and perhaps gene bank are something that you can very readily then go and identify the polymorphisms etc there were maybe in a paper maybe identified in one of your studies or maybe something that you would like to have measured in one of your studies in which you've had the wherewithal to collect DNA and form consent etc and so we just wanted to list those and we're going to talk a little bit about each one of them so OMIM is an outgrowth of of Victor McCusick's catalogs of the Mendelian inheritance of man and it was it was always a lot of fun to go to clinic with with Dr. McCusick his all of his patients were average height unfortunately they were either four feet tall or seven feet tall because his two areas of studies was achondroplastic dwarfs or Marfa on syndrome so so on the average everybody was about five and a half feet tall but they were all in between so OMIM is a is a catalog of human genes and genetic disorders and I think Terry and I would agree that if you want to know one online resource this is the one it's got concise information on most of the human conditions which have a non-genetic basis pictures full citations etc other materials are available in the gene bank and some of these I think for at least so in my simplistic computer skills are truly amazing collections of all the publicly available DNA and protein sequences not only in humans but down to the the roundworm etc again a huge number of records description of the sequence the references and this is all generated by the submitter who basically it's kind of like a a wiki Wikipedia kind of activity where each of these is put in by a submitter this is a little bit different than RefSeq which is that each molecule in the in the sequence is described again this is non-redundant whereas the other one is redundant it's linked to the nucleotide protein sequences in gene bank and it's updated by NCBI staff dbSNP identifies the 6.2 million currently known single nucleotide polymorphisms and this is obviously changing every day we believe there's up to 10 million of them available and over 200,000 of them are within genes so these are related so another resource for you so finally um from 1900 to the present human genetics benign disorders still current some of the things I talked about are still obviously relevant to the practice of clinical human clinical genetics etc 1953 obviously molecular genetic structure and function human genome obviously building on really what we're talking about 1980 the present would be really description of the gene variants and candidate gene studies including in the epidemiologic literature and about 2003 to the present sequencing of the entire human genome and now more recently human genome they all build on each other they all build on the on the structure and function of the human gene and at some point certainly in the possibility parts of your discussions you're going to need to talk about but in the end here's the problem you're drinking from the fire hose concept that has come up in the genetics quite a bit is too much data and obviously one of the things we need to tell you about is how to measure some of these things so you can boil the data down to something that we can all study on our study populations okay I think we have time for questions which is what this says actually we want to the questions question or should maybe Phil why don't you ask the your good question again so what's thought to be the purpose of introns I guess is this a religious question nobody quite knows it used to be thought they were just sort of junk DNA that's in there and and it just has to be spliced out it probably has to do with the stability of the messenger RNA and how it is sort of moves from the nucleus out into the nuclear you know out of the nuclear membrane and into the cytoplasm to be transcribed translated because it seems as though changes in the introns actually are not silent and they do affect both the way that the exon splice as well as the speed with which the translation transcription occur so it probably has to do with folding of the the DNA transcript essentially but I don't know and I don't know that that there are many who know you know exactly what it is I think it is interesting to look at the intron structure of some of the known genes it's really quite variable you know if you looked at that BRCA gene you know there's just so you know that the exons were very small and there's a whole bunch of them there's a lot of introns are you just wonder if had also something to do with perhaps the the evolutionary fastfulness as well in other words the ability to to over to have variation because you can then mix and match and these all these little little bits of protein rather than really relying on just one big chunk but I don't I don't know the answer either and they also clearly that the introns allow for alternative splicing so that you can you can take out different pieces of of the you know the of the exon so maybe you'd have the first five exons and then you might skip six or seven and then go to go to eight and the and the way that the cell knows to do that probably has a lot to do with the molecular structure of that molecule that the introns help to define you know having read all these G1s I mean it was almost like in the beginning as it was kind of an apology if someone found a gene in an intron and like it was or this can't be real and I don't by the time two years went on it didn't seem like they were saying that anymore that they'd seen this so often that it was really something that was very important in terms of variation another question some of the snippets in current genome-wide studies come from the the coding region not from the intron or exon region uh it it actually varies the answer I think is that they come from both and if you look at them I haven't seen a study which has taken say literally hundreds of steps and classified them maybe you know about that but there are substantial numbers actually actually the DB snip has done that and there are some as you say substantial numbers in exon but there's also substantial numbers in in introns and coding regions as well yeah so the early platforms everybody believed us as I think you know you're suggesting that that the only thing who possibly be important is the exon so let's just focus on on coding snips c snips as they were as they were often called and there actually were platforms developed that were just coding snips and nothing else um 10,000 or 20,000 of the of the coding snips the current platforms are developed in a different way and we'll talk about it a little bit in a minute um to in order to capture the breadth of variation regardless of where it is in the genome basically you know based on on how many other snips are associated with the particular snip you're measuring so so what the hat map did was to sort of define those patterns of association among the snips and and ask the question if you you know if you only had to measure one which would be the best one that would describe maybe 10, 15, 20 other snips and so those can be in coding regions or non-coding regions they're anywhere in the genome um depending on the on the chemistry of the particular platform um some of them are are in places where um endo-ribonucleus is actually chop the DNA the afymetric platform is built that way others are are just you know assays that are developed for the best coding snips essentially but if you look at one of the uh the the snip platforms a small minority of the snips on that platform will be to exons and and of all those listed in db snip um one of the smallest groups is the exons