 Glad to see everybody turning out at 9.30 in the morning here in downtown DC to have what I think is going to be a pretty interesting day, because from the perspective of many of us, we are witnessing a real avalanche of exciting data coming out of a new approach to trying to understand the genetics of common disease, an area that has frankly been pretty frustrating until recently with the power of the methods we were using to identify hereditary factors in diseases like diabetes or coronary artery disease as being quite limited and therefore relatively little progress. An unfortunate number of claims based on candidate genes that didn't hold up when they were attempted in a validated follow-up, all that is changing and I think you're going to see and already seen over the last month or so a real deluge of reports of people who have applied this new approach to common diseases and I think as reporters you're going to be challenged to try to figure out exactly how to interpret these, what the significance is, what the impact is likely to be on the future of medicine. We planned this workshop, oh you know, all of two and a half weeks ago, sort of realizing that this was coming, getting a number of queries from some of you about interests in getting more of the scientific background assembled into one place and we thought this would be timely to try to do this and I really want to thank Larry Thompson and his staff for putting together in an incredibly short period of time a program of this sort, finding the space, sending out the invitations, beating on all of us to come up with our presentations which I think were complete by about midnight last night so they're very fresh. 3 a.m. was the last, okay. I'm still working on mine. I do also want to thank the speakers, many of whom have very busy lives in their jobs at the Genome Institute for basically setting aside quite a chunk of time to come and meet with you but we're excited about this and we hope this will be an opportunity for a very interactive day. Each of the speakers has been asked to present for no more than 20 minutes so that there's plenty of time for discussion and questions and I think we also want to make ourselves available to you at the breaks and after today if you're looking for information about how to interpret these kinds of studies I think all of us would welcome those kinds of inquiries in the future. So I just have a few words of introduction and then we're going to go into the first presentation from Emily Harris but just to set the stage a little bit what's the fuss about anyway. The sequencing of the human genome which as you know was completed in April of 2003 was basically focused on a reference DNA sequence the 99.9% of the genome that we all have in shared form but obviously while that gave us an enormous wealth of information about how biology works it didn't shed a whole lot of light on how variations in that sequence result in individual risks of disease. We know that virtually all diseases have some hereditary contribution we know that simply by observing that family history is often the highest risk factor for disease like cancer or heart disease or diabetes. The reference sequence was a wonderful foundation but not for that part so we really had to develop a rich catalog of genetic variation and develop the tools to be able to survey that in people with diseases in order to see where those variations lie that play a role in that kind of risk. Essentially we've been trying to develop the tools to find the ticking time bombs in all of our genomes and we all have them and various estimates have been made about how many and I don't think we really know but certainly dozens that each of us walk around with. Generally common variations which may have no significance unless you get a certain set of them to push you over a threshold especially if the environment comes along as an additional risk factor and nudges you into a disease situation. These ticking time bombs though for common diseases like asthma or hypertension carry a relatively low risk. These are not like Huntington's disease or cystic fibrosis where the gene mutation is almost deterministic for common diseases. These are going to increase your risk by maybe 20% and otherwise you won't know they're there because you may very well not see the consequences if you don't have a sufficient collection of them. So how to find these ticking time bombs has been the struggle of the field for quite some long time. Most of the variation in the human genome and you'll hear more about this from Larry Brody is of this simple sort where it's a single letter that's different and of course we call these SNPs or single nucleotide polymorphisms. Here diagrammed as a simple C instead of a G and two different versions of the same chromosome. Probably something like 90% of human variation is of this type. The other 10% is more complicated involving either small or even large insertions and deletions which we have somewhat more difficulty measuring right now. So most of the studies that you're going to be reading about and we'll be talking about today will be focused on SNPs as the major of where the variation is in the genome and how it might correlate with disease. And the idea of doing a genome wide association study is a very simple one and isn't that nice. This isn't like linkage where you have to have log scores and all sorts of other complex concepts about how to analyze a family. This is really a very simple idea. The idea is you find individuals with the disease and you find individuals who don't have the disease here colon cancer is the example. And of course you want to be sure that people with the disease really have that disease and not something else that looks like it. And you want to be sure that unaffected people to the extent that you can are really unaffected and not just that you've missed the diagnosis. And of course that's always a problem for an adult onset disease because your so-called unaffected people might actually get affected later on and you don't have any way of knowing that and that's going to dilute out your power a little bit but we live with that. And then you're going to want to test all of the snips in the genome if you're going to be thorough about this to see are there ones like SNP B where there is a skewing of the appearance of the two different spellings here color coded orange and blue so that the people with colon cancer have more of the orange spelling than the unaffected do. Most of the snips you test are going to be unrelated to the disease so they're going to look like SNP A where the proportionalities of the two different alleles as we call them of the SNP two different spellings are the same in the affected and the unaffected. Now already this cartoon has committed a variety of serious distortions particularly by showing you only 10 cases and 10 controls because I think you could obviously determine without even being particularly mathematically inclined that you would occasionally see something like SNP B not because SNP B was involved but just by chance if you're going to test hundreds of thousands of snips as you will see we're going to then once in a while just as when you flip a coin 10 times sometimes it'll come up nine heads and one tails you'll see something like SNP B and you shouldn't be misled by that so despite Mark Twain's comments about statistics we need statistics in this particular kind of analysis in order not to leap to conclusions about positives that are really false positives. The consequence of that is you need very large numbers of cases and controls. The other violence that this cartoon does to the concept of what a genome wide association is it shows you a rather drastic difference for SNP B between the colon cancer and the unaffected individuals sort of nine oranges in the colon cancer and only one in the unaffected. In general that difference is going to be much more subtle because again you're looking at factors that play only a modest role in disease risk so instead you might have seen five in the colon cancer and four in the unaffected that might be the best difference you'd expect to see which again tells you you have to have very large numbers in order to be able to assess whether that means anything or whether it's just noise. Now in the past people wanting to conduct this kind of experiment pretty much had to really satisfy themselves with picking some candidate genes hoping that they picked wisely and finding SNPs in those candidate genes and then trying to see whether any of them looked like SNP B and that candidate gene strategy is pretty much the entire literature of association studies for common disease for the past many decades and some of those were very successful. The idea that HLA for instance is associated with type one diabetes or with a variety of other autoimmune diseases that's an association study that's held up very well but a lot of H a lot of association studies focused on candidate genes have not fared so well and the problem has been that we are when you're going after a candidate gene essentially committing the same kind of blunder as the guy who lost his keys after a night at the bar and came out realizing they were somewhere on the street and limited his searching strategy to one place namely under the lamppost because that's where he could see and of course the keys sadly are often not where you want them to be and so the candidate gene approach for diseases like diabetes or cancer or heart disease where frankly we don't know enough to know a good candidate gene when we see one has often come up empty or it's thought that it came up with something that were keys and turned out actually not to be keys so candidate genes have also had a checkered career of false positives obviously that's not what you'd like to do you'd like to sample the entire genome so just five years ago when people talked about doing this if you actually went through what you would have to do it was pretty daunting in 2002 if you were going to propose getting beyond the candidate gene and looking at the whole genome you would need to have a catalog of essentially all of the common snips in the genome and there are about 10 million of them that's sort of a useful number to keep in mind so 10 million places in this 3 billion base pair genome where there are common variations where the less common spelling less common allele is present at least 1% of the time you'd want to test all of those first you got to find them of course and then you want to test all of them again you want to test a lot of people so a thousand cases and a thousand controls as we'll see in some of the examples today is sort of a minimum if you're really serious about finding variations in common disease because you're not expecting to see big effects and then you would have to take each of those DNA samples and using technology you'll hear about genotype each DNA sample for each of those 10 million steps so we'll do the math that would be 20 billion lab tests genotypes and in 2002 a genotype cost you about 50 cents even in a very good lab so that would have cost you 10 billion dollars to do the kind of genome wide association that you're now seeing all around us so it was totally out of the question in 2002 I mean just so far out of the question that the idea we're now doing it is pretty astounding just five years later so what happened well one thing that happened was this project called the HapMap which many of you wrote stories about at the time when it was published in October 2005 so this was an international collaboration involving six countries and more than a thousand scientists all working together and basically the plan was to lay out the landscape of genetic variation across the whole genome first of all to build up that catalog of SNPs because before HapMap came along we only knew of about 2 million of them now we're pretty close to the 10 million that we wanted to see and but not only to catalog them but to take advantage of something about variation across the genome which turns out to be incredibly useful and time-saving and that is that SNPs don't travel independently of their neighbors these SNPs tend to be clumped together in terms of which particular spelling you're going to find in neighborhoods so that if you have tested this SNP and there's a SNP next door you can probably predict what it will also have on that same chromosome and this will you'll hear more about is a phenomenon the geneticists have labeled in their inimitable way with a term that almost nobody loves called linkage disequilibrium or LD linkage disequilibrium or what does that mean linkage equilibrium would mean that two SNPs that were next to each other were truly random as far as their association with each other that knowing the spelling at this one would tell you nothing about the spelling at that one linkage disequilibrium means that they're not random they're actually coordinated correlated and that turns out to save you a ton of work because the neighborhoods over which this linkage disequilibrium operates are actually pretty good sized in the neighborhood of 20 or 30,000 base pairs so something like 30 or 40 SNPs will all be traveling and lock step together on a chromosome and that means if you know that and you know the boundaries of those neighborhoods which is what HapMap tells us you can pick two or three SNPs in that neighborhood and if you test those you can infer what all the others would have been without actually having to measure them so the two or three you test are essentially proxies for all the rest well that saves you a lot of work you'll see how much in a second here because in 2007 instead of 10 million SNPs which is what we would have to do if there were not for this phenomenon of linkage disequilibrium a carefully so chosen set of 300,000 SNPs is enough to stand in for all the rest at a pretty high degree of coverage you're essentially covering 85 to 90% of the genome as if you'd actually done all of that work by picking a carefully chosen set of 300,000 SNPs that happens to be true for European samples and Asian samples you need more for African samples reasons will become clear why that is so that's pretty good we just saved a factor of 30 but if you remember our 10 billion cost that was still not going to make this affordable so you still have to collect your cases and controls you still have to do the genotypes you still have a lot of genotypes the other thing that happened and Larry Brody will mention a bit about the technology is that in part because of HapMap but mostly because of just really remarkable ingenuity on the part of private sector genotyping platform developers the cost of a genotype in five years in five years has dropped from 50 cents to about an eighth of a penny so when you put that all together what used to be a 10 million dollar project is now less than a million that makes it actually quite within reach especially because if you'd really collected a thousand cases and a thousand controls and done all that clinical work you would have spent more than a million dollars so that now means that the genotyping part of this is not the cost driver anymore the clinical work is the cost driver now many groups had already done the clinical work over the course of the last many years anticipating that a day might eventually come where this kind of approach could be taken and they are plunging in with wild abandon from studies on diabetes to the Framingham study which is doing this kind of thing to studies on schizophrenia on autism and a long list of others that are in the works and essentially hopefully without being two grandiose what this is doing is allowing you no longer to be relegated to looking under that single lamppost but lighting up the whole street by having this kind of technology now in hand so you don't have to know the answer you don't have to guess the right candidate gene you can now systematically and comprehensively survey the genome and find what's there and so if you have enough cases and controls and you've done the matching properly that's important you don't want to be finding things that have nothing to do with the disease and everything to do with the fact that your cases and controls were actually different in terms of their ancestral origins and you apply the technology appropriately and get quality data and you do your math right you should be able to find variations that are associated with common disease and that is in fact what is happening again what is being found are variations that have relatively modest effects on disease risk maybe increasing the risk and somebody who carries the risk spelling by 20% not multiplying it by 10 but each one of those then points you towards a pathway you didn't know about and I think one of the things that's coming out of this early phase of this is that most of the genes people are discovering for common disease are genes that they never would have guessed had anything to do with that particular illness they would not have been on anybody's short list of candidate genes so that's a brief introduction to the topic with much more detail to be filled in during the course of the day