 Good afternoon ladies and gentlemen and welcome to the Nature Press Briefing. My name is Faye and I'll be a coordinator for today's conference. With the duration of the call you'll be on listen only, however at the end of the call you will have the opportunity to ask questions. Is there any time needed assistance? Please press star zero on the telephone keypad and you'll be connected to an operator. I'm now handing you over to Ruth Francis to begin today's call. Thank you. Hi everybody and welcome to the Nature and Science Press Briefing concerning two papers. Firstly a map of human genome variation from population scale sequencing which will be published by Nature this week and a second paper diversity of human copy number variation and multi-copy genes published by science. We've also got on the phone today the side pack director Cassie Wren. Before we begin can I remind you all that the papers and this press briefing subject to our usual embargo of 1800 London time, 1300 US Eastern time tomorrow, Wednesday the 27th of October. First of all we're going to hear from three of the authors of the papers. We've got first Dr Richard Durbin of the Welcome Trust Sanger Institute, then Evan Eithler Professor at the University of Washington and we've also got we're going to hear from Professor David Altschuler of the Broad Institute of MIT in Harvard. We also for the purposes of the Q&A session have Dr Lisa Brooks from the National Human Genome Research Institute. So we're going to start off with some comments from three of the seekers and then we'll go to questions. I'm handing over to Richard now. Thank you and hello everybody. So 10 years ago the draft human genome reference sequence was published but we know that individual genomes differ and a main focus of human genetics is to identify which of these differences or genetic variants contribute towards our tendencies to disease and in the last 10 years DNA sequencing technology has advanced dramatically so it has become feasible to systematically sequence many people to find genetic variants and build a catalog which we can use as a basis for investigations into disease genetics and which which variants may be functional. So a few years ago an international consortium was founded the Thousand Genomes Project to carry out this plan to produce a catalog of genetic variation and in this week's nature we issue we are publishing the results of the initial pilot phase of the Thousand Genomes Project. Already just in the pilot phase we've identified over 15 million genetic differences by looking at 179 people. Over half of those differences have never been seen haven't been seen before and in doing so these have already provided a more complete catalog of variation than was available previously and an example is that if you look at one person's genome amongst the three million variants which that individual will have over 95 percent of them would be present in our catalog. So just as important as discovering the variation this has been a real shift in how we can approach human genetics and we've developed in conjunction with the manufacturers of machines who also were part of the project methods for using sequencing effectively and efficiently in human genetics and we've tested three different approaches and are taking two of them forwards to the full-scale project to produce a deeper and broader catalog which is already underway which will study 2,500 people and I think David at the end will come back to this later. So the paper primarily describes the new sets of variants we find and the methods that we use or initially describes those but also describes analyses that we can perform now. We have a more complete dataset than which were previously not accessible and there are a number of key points which are picked out in the paper and in the press release in particular we can see that each individual is carrying a significant number of deleterious mutations maybe 250 or 300 genes which have defective copies. We can also look at the effects of recent evolution on the human genome and the effects of natural selection around genes and between populations but this is in fact only a start a key property of the project is that all the results are being made publicly available on the web just as the original reference genome was and then people both within the project and outside are already beginning to use the data for many different approaches and in particular because we're looking at all the DNA when we do DNA sequencing rather than just looking at known places of variation as has been used in the in GWAS studies it's possible to examine more complex types of genetic variation and so Evan Eichler who is going to talk next has been his group have been looking using the thousand genomes data for particular analysis of copy number variation so I'm over to Evan okay hello everyone this is Evan Eichler I'm going to discuss the companion piece that was published in science I'm really going to pick up where Richard left off talking about really an effort to try to extract more information from the roughly 15 percent of the genome that has been very difficult to assay but is generally described as inaccessible the work I'm going to describe is really the work primarily of two third-year students Peter Sudmont and Jacob Kitzman and at the University of Washington so the focus of the science piece like I said is to pick up on looking at more difficult regions of the genome we focus specifically on the copy and content of duplicated genes and these have been particularly difficult because of the repetitive nature so there's a roughly about a thousand genes within the human genome that I would argue have been largely inaccessible to traditional genetic study as a result of the repetitive nature so what did we do in this pet paper we essentially take all the data from the thousand genomes project in addition to about a dozen other genomes that are not part of the thousand genomes project and we essentially remap it using our own computational measures when we do this we essentially assay two properties of that data we measure the read depth as an indicator of how many copies there are of a given gene family so this can range from zero copies to dozens in some cases you know many dozens of copies and then we identify unique sequence tags a total of 4.1 million of them that allow us to essentially distinguish one copy from each of its its nearest neighbors so this provides us the ability to assay both the copy and content for any region of the genome and in principle any gene in the genome really I think for the first time I think what's different from this paper in addition to where we're focusing and a much more narrow focus than the thousand genomes project per se is that we're looking at individual level variation as opposed to variation at the population level so what did we find for us it was particularly exciting we think the veil has been lifted for us in terms of a whole new level of genetic diversity and this is really back to copy number differences over gene families and we find three things which I think are worth highlighting first we find that the most copy number of variable genes in the human species map to historically duplicated regions of the genome so you can think of these almost as accordions of the genome expanding and contracting in terms of their copy number when we look at the the four populations that we have access to in this study from the thousand genomes we show that when we compare populations we see that there is more genetic difference between the populations in these particular regions at least in terms of copy number when compared to unique regions of the genome and we compare these roughly 159 humans that we've analyzed to date and compare them to that of the great apes we have the ability I think pretty clearly to identify the genes and the gene families which have expanded specifically in our lineage of evolution since we separated from that of chimpanzee and gorilla and what we find here and even though the numbers are quite small we find a particularly tantalizing set of genes that are important in terms of neural development in terms of a neuronal migration and we want to focus on these going forward as potential candidates for helping to define some aspect of the human condition so what we plan to do going forward I think there are two really big things that are exciting in our lab and and hopefully others as well is that by developing these methods I think we can now explore the functional properties of these what I can call untouchable genes of the of the human genome we can look at expression differences we can look at changes in terms of methylation we can look at changes in terms of chromatin configuration I think the second and perhaps the most important is that is that now we can assay the copy and content of these genes we can begin to do association studies for these gene families which have often been difficult to assay previously and look for a particular particular associations with phenotypes such as disease disease susceptibility and thank you hand it over now to David Schubert yes hello thank you for calling in and for the chance to speak with you so I've been asked to comment briefly on the application of the project and its relevance for medical research so start just by saying that it's clear from the history of genetics of humans but also many model systems that following the genetic contributors to disease can be a powerful tool to discover new clues about the genes and underlying biological basis of diseases both rare and common it's also clear that with evolving technology and given the true complexity of the genetic base of disease that a more complete understanding will require uh knowing the entire genome sequence of individuals and of populations and the routine deployment of this information in medical and clinical research from this uh application standpoint to medical research the key challenge in the first phase of discovery is to be able to distinguish those genetic variants that contribute to each disease from a background of many millions per genome that are not involved in that particular disease a problem of a needle in a haystack in addition it's important technically to avoid false positive discovery claiming their variants that aren't there or missing important variants as evan was saying that you otherwise couldn't assess so i think there are three ways in which this project contributes uh the first is that as richard mentioned the project has been a laboratory in which many members of our field have worked together on the methods for sequencing whole genomes these individuals have come from many countries they've come from academic institutions and also private companies and the work has been done in a public private partnership and with all the data available in the public domain and the methods available and this i think has contributed greatly we hope to increasing the efficiency and the accuracy and the broad availability of these methods so that they can be deployed in many quarters second uh while each of our genomes contains many millions of variants and some of these are unique to each individual and i'll turn to those in a second it's also the case as our paper describes and has has actually been clear for many years that most of the variation in each individual's genome in each single person's genome most of it is common variation meaning that it is shared by people who are apparently unrelated in the general population and so one approach that the project is enabling is to systematically and accurately and efficiently test these what are called polymorphisms variants seen in apparently unrelated individuals for their relationship to disease a early first draft of this approach was what are known as genome-wide association studies built on an earlier map of genome variation built by the Snipkin Sorsham and the hat map project and those genome-wide association studies tested only very common genetic variants those with high frequencies of five and ten percent and above and those studies have been successful in identifying hundreds of new clues about the genetic basis of disease the thousand genomes project makes this approach much more complete and much more powerful by going down to much lower frequencies and also broader range of populations and much more complete data in each frequency range in each population second some diseases are caused not by variants that are common or that are shared by individuals but actually variants that are very rare spontaneous even arising in an individual as in the case of a cancer or a very rare arising in a recent ancestor the mother or the father grandparent of someone those will never be contained in a catalog however in DNA sequencing is used to discover all the variation in such individuals or families the first step in analysis under that model of rare genetic variants is to know which of the millions of genetic variants are in the category of common variants which can be studied with the other approach conceptually and which are very rare and it turns out that that requires a catalog of which are common or else all one gets out of those sequencing studies in rare genomes is millions of candidates and no real way to hone in on the ones that you care about and so already the earliest data from the thousand genomes project has been used in multiple published papers in which individual sequenced in one case from evans institution but there are other cases well they sequenced a family and they were able to quickly hone in on the disease mutations and if you read those paper that paper and others like it the first step is to look in the public database and the thousand genomes database and say which are the common variants we're going to set those aside for study under different genetic model and we can hone in on the small number that a unique teach individuals that's the second way so three ways actually in which we think the project contributes to medical research one the methods data and uh and uh standards that have emerged from this public private partnership two a complete a much more complete database of polymorphisms of all sorts of single base changes copy number variants as Evan told you about other kinds of structural changes that can be tested systematically for disease and then third as if you will a a lookup table to allow people whose diseases might best be studied under this different genetic model of rare variants to focus in on those i'd like to make one final comment before turning it over to questions which is a just a personal perspective on genetics in medical research it's clear that disease is influenced by inheritance but also by environment by behavior and by chance moreover it's clear and been clear for a long time long before the current era that most diseases are influenced by many genes and moreover by many variations in many genes it is not a simple problem nonetheless uh we continue to uh drive as a field towards using DNA technology and we do this to understand or to eliminate disease we do this despite knowing each success it's just the first step towards the biological investigation of that disease it's not an answer it just frames the hypothesis i think the reason we do this is because we live in a time when one of the great scientific opportunities is to use this new ability to read DNA in our population and to follow up the inherited portion and it is a technical ability and a conceptual ability that i think has great fruit but we want to be very careful as a project not to uh suggest that this uh framework project is itself medical research because it is simply a foundational tool it is not being done in disease samples nor to suggest that there are any easy answers that will come but we do believe that in the long run this is a very valuable and promising approach to learn new things about the basis of disease and that if we do that as a field and then biological follow-up occurs that this has promised to contribute to improvements in the long run in human health thanks very much David and thanks to uh Richard and Evan also um so we're going to go over to questions now um just to remind you that we've also got Dr Lisa Brooks from the NHGRI on the phone if anybody wants to ask her anything thank you ladies and gentlemen if you'd like to ask a question please press seven on the telephone keypad if you change your mind i wish to destroy your question please press seven again it will be advised when to ask the question we have a question from the line of a look to her from the guardian please go ahead hi this is Alec Johnson the guardian um i've just got a couple of questions about numbers actually um which i hope you could just clarify for me um you mentioned that you have a database now 15 million uh snips and um there are also some other numbers in the papers about um the number of genetic changes an average person carries there that's between 250 and 300 um i just wondered if you could tell me how this compares to what we knew already um so it's just to put into context of how much further this new set of papers takes us in working out diversity between humans well this is uh Richard answering so the numbers of uh a total variance per person is actually about three million um but the numbers of genes which appear to have a a complete loss of function defects in them is is the 250 to 300 those numbers are not that new but what we've done here is by sequencing by looking at many people um we've each variant that you find in an individual will only be in some members of the population so uh the 15 million number is a substantial increase over previous uh uh list and means that we can cover more of the variants present in any one individual in fact the project as a whole is moving onwards from the 200 or so people studied genome-wide in the pilot um and now it's sequenced over uh a thousand people uh so that data on a thousand people are available and we can already see this number of 15 million is going to go up and we're going to get a deeper and richer picture going forward so we're moving um that number is increasing and will uh at least double over over the next year so this is David Altshuler let me um uh just add one other thing to that which is um a number that i personally find uh just just useful as a way to think about this because after a while millions of variants whether it's one million or ten million or fifteen million it's hard to have context if you're not if but here's a number that i think is easy for people to understand uh this idea of the number of genetic variants in each individual so what we mean by that is if we were to take a DNA sample if a DNA sample was taken from a reader or any of us and we sequence it and we just define as the variation in that genome the sites at which the month the copy one inherited from one's mother and the other copy of the genome you inherited from your father differ all right that's the number Richard said there's some three million or so differences between those two copies or another way of thinking about it is the differences between an individual uh and a copy of the genome and that uh human genome reference sequence that came out of the human genome product that number is the same it's just a comparison of any two chromosomes and it's about uh three million letters or one in a thousand letters because there's three billion letters so one to thousand of three million three billion is three million so now the question is how much is our knowledge of that variation that we might find in each individual increased over time so we go back to ten years to uh that what Richard started at the uh announcement of the human genome sequence that number would have been that if we sequenced any person's genome and said what fraction of the variation in that genome was in the public database it would have been five percent or less in other words 95 percent of what was found in the next person's genome was new not previously seen at the time of half map maybe five years ago that number was on the order of 40 percent or so 40 or 50 percent so in other words you'd sequenced the next if one had had that technology it wasn't available at that time but we now know um that if you had sequenced the person's genome said how much of that variation in that person is in that uh database for SNPs for these single base changes would have been 40 or 50 percent much less for the kind of structural variant that Evan was telling you about if you look today for single nucleotide polymorphisms in the pilot data of the thousand genomes project 95 percent of what you find in the next person's genome is already in this database still lower for the kind of variants that Evan's uh studying because they're harder to study and we see these numbers going to 98 percent or something like that for many types of variation that in other words by the time the thousand genomes project's done each person if they had their genome sequenced the vast majority greater than 95 percent maybe as much as 98 or 99 percent of the variation that person would already be in the public database and therefore it could be referenced back and then perhaps one or a few percent of the variation would be unique to that individual not in that database a lot did you want to follow up on that no that's fine thank you okay thanks next question please thank you and we have a question from the line of Tina say from science me please go ahead uh yes i wanted to follow up on that just a bit so how many SNPs are currently in the database and then i have a question about the loss of function genes um so i i can so in the the pilot um coming out of the pilot project there are 15 just over 15 million SNPs 15 million 275 thousand uh these many of those yeah these are not exactly the same as the set of variants which are in dbSNP which come from all other projects um i don't have a number number for the total union across all research available today um and then uh so you see that each person um has about 250 to 300 genes that are basically inactivated uh in their genome um do those tend to be some of the same genes or would they be unique genes in each person and what does that say about disease risk and um how necessary these genes are so the mix some of those genes uh are commonly inactivated and maybe genes which uh are not strongly required um but uh there are also a a more gene in the category of these loss of function variants which include uh premature stop codons uh frame shift mutations and uh changes in splicing so that uh the exons don't get put together properly um in that category is substantially enriched for uh rarer variants for things that are present just perhaps in one of the uh 179 people um so it is true that and that indicates two things first that these probably many of these are functional and that the reason why they're not more common is that they've been selected uh away there's this negative selection against them or purifying selection uh it also says that we are each probably carrying um more or less private certainly rare uh defective copies of genes they don't necessarily lead to disease directly in the individual they may um we have two copies of most genes one from the mother one from the father and so if one is defective you're just a carrier that can sometimes have a a a phenotypic effect i think i was just i was just gonna add that the there's been much speculation over the years about the relative contribution of different kinds of genetic variants to the basis of disease but ultimately these are questions that can only be answered empirically because all the speculation is based on different assumptions about uh how the genes and pathways and biology of disease works what's truly exciting is to live in a time when we have the methods and data to ask those questions empirically and i think that uh what you will the answer to your question for example about what role do these genes that are in activated plane disease can be rather straightforwardly evaluated now not just because of the thousand genomes project but the general approach by identifying them and relating them to patients and disease and so rather than speculate i think we'd say we are helping to create a foundation to answer that question and anyone who does speculate i think is speculating there's no way to know short of doing the experiment and what's exciting is that we now are going to have the tools and methods increasingly to just answer those questions through actual empirical data rather than speculations and people are beginning to do that yes absolutely and just to follow up on the last part of what Richard Durbin was saying when we talk about a gene being inactivated it's one of the two copies of the gene that the person has so we don't so for for most people one version doesn't work these loss of function variants but generally the other version of the person inherited from the other parent will work so this would be recessive mutations then well there's one copy of them if the other one works and the person is perfectly fine because of it then they are recessive yet no but most of the most of the variants that we're seeing in this way are for our carrier status of uh diseases known as recessive diseases although it's not always the case in fact in very few examples has has there ever been large-scale study of people who are carriers to know if they have any uh any disease risk okay one of the surprising things um is that actually um most disease genes have been studied either uh in to the extent disease genes meaning those that cause rare strong genetic diseases have largely not in every case but in most cases only been studied people who have those diseases so the broader population relevance of the of having a variant in those genes is actually not well characterized because it hasn't been possible to do studies of this sort until now and so when you sometimes read about someone saying someone had a part of their genome or their whole genome sequence and they were found to have a mutation that causes a disease and yet they're apparently healthy that's more a statement about it not having been possible previously to do that kind of empirical study rather than a shocking finding this is um Ruth i'm just going to remind the speakers if you can say who you are when you run through and i think i've got a bit confused uh but um next question please thank you the next question comes from the line of nicholas wade from the new york time please go ahead hi david can i ask you to clarify the cutoff point between common and rare variants and i think it was like what you just said now it was five percent in the hat map days so when you've reached the point of having ninety eight percent of variation in your database um what will be the uh uh the definition of a of a snip at that stage sure so let me let me just there's a couple of ways to answer this question um first of all i think that the uh personally well there's the classical definition sort of in the textbooks of what is a common variant or what's known as a polymorphism and that number is one percent or higher in the population and um in the era of a hat map and the snip consortium given the technology that was available at the time um there were the the sort of uh frequency to which that resource went was down to between five or ten percent and the first uh a set of genome wide association studies were a variance of five ten percent and above um things below that frequency first of all would still be down to one percent would still be classical polymorphisms that's the term that's been used but i think actually the most meaningful conceptual framework is the one that i used in my opening comments which is there are two types of variants there are those that are present in apparently unrelated individuals from inherited from shared ancestors uh that can be cataloged and then studied for relationship to disease and that goes down somewhere below one percent uh and then there's ones that are unique to each family or individuals which i think would be well called rare or private mutations and those are much much less frequent the numbers i cited to answer your question about 98 99 percent is for a frequency of one percent and above so it's not the 99 98 percent in your database uh what percentage of the quote missing inheritability do you believe you will have captured i have no idea because that calls for speculation i really i really consider it a complete a question of empirical data and i think that frankly there's been too much speculation i think that what we know is that there are rare variants that influence disease because that's been documented for decades and we know that there are common variants that influence disease uh because they're documented examples and i think the question of sort of a complete understanding is probably many years in the future where we've explained every if we ever will explain all the inheritance but i think the promising and exciting thing is we are actually for the first time as a field making progress at least honing in on the many genes that contribute to each disease but they just may want to comment if a major part of the burden is is borne by these very rare variants which i see mostly sort of spontaneous mutations then it could be we will come up pretty empty-handed i eminently rational as this approach is um our catalog might help us i mean i can't well here this is good i think it's i think this idea that this divide in the sand that we have in terms of percentage is really i think to be honest totally bogus what we want to be able to do is we want to be able to look at genomes comprehensively not just as a function of frequency of small variants but in terms of the full spectrum variation ranging from single base pair changes to insertion deletion events to larger c and v events so copy number variant changes and i think i mean i can't imagine that we come up came up empty-handed if we just look at things for example from the copy number variation perspective you know at the at this point today now with technology that we have we can pretty much uh diagnose about 20 percent of kids with mental retardation or intellectual disability as having a large copy number variation event sometimes it's sporadic sometimes it's been inherited for a very few gen and i think there's a lot of yield to come from studies of genetic variation but i think the point that we have to keep our you know keep the keep our eye on is that we need to actually divorce yourself from this idea of frequency and move to really a comprehensive assessment of all forms of genetic variation and i think if we do that we'll be able to actually link that's where we are right now with the technologies that are coming online to sequence genomes to you know comprehensively assay structural variations to sequence exomes i believe that this is this is the era where we're going to actually make huge huge inroads in terms of genetic basis or understanding the genetic basis i think the last point and i made this in my opening comments is that i think that the thousand genomes project that we mentioned each of us in our own way three different things three different ways in which this project contributes one of which is methods standards part of sort of a public private collaboration of companies and academics all working together to make the technology work for any application i think that that is a contribution under any disease model and then i think that as evan said we want to be comprehensive in the exact deployment of the resource will differ depending on the different approaches people take but i actually think that the actually the documented value of the thousand genomes project for medical research just in the first year is actually more under the model that nick way just mentioned of rare genetic variants it's a necessary step in how you interpret a rare a rare variant model to know the common variants so i think it would be missed just so people aren't confused this project is not about any given model it's about what evan said which is a comprehensive approach it's not about any particular underlying model it's about testing all the models thanks everyone um next question yeah next question comes from the line of michael price from the washington examiner please go ahead hi yes my question is for evan i'm getting away from the diseases a little bit when you're talking about the genes associated with neural development i was wondering if you could um let me know if those were related to specific regions of the brain or or processes or connections or anything like that um i think it's how you tied that back to neural development yeah so i mean what we did is actually we did a formal test to see which genes were you know enriched or which genes had been specifically duplicated in the human lineage and we use some of these programs that have been developed to look at gene classes and neural development came up significant but only modestly so but when you actually look at the specific examples they're actually quite striking so as as an example there's this gene known as surgap 2 it's actually duplicated specifically in the human lineage we don't know the function or we actually don't know the role or even if the the duplicated copies are totally functional but what we do know is that there are multiple copies of surgap 2 and the parent gene surgap is in fact important in terms of neural neuronal migration in the cortex and so it's expressed and there's a beautiful paper published in cell about a year ago that looked into this and it's really the kind of the spatial temporal expression of surgap 2 which dictates how far neurons actually migrate in the cortex and then begin their kind of arborization or lateralization it's to me it's just an anecdote and you got to be careful not to get too excited about these anecdotes but when you start looking at all the genes that that we find that are duplicated and pretty much often fixed in all humans that we look at but not so in the chimp or gorilla we see a significant number of genes that that would be particularly sexy candidates for I don't know if the right word is human mentation but they're neurogenesis and so this is something that we like to pursue and I think it's an understudied area where people have been looking at you know human evolution they've been focusing mainly on again the tractable regions of the genome that 90 that's easily a thank you next question thank you ladies and gentlemen please be reminded if you do wish to ask a question please press seven on the telephone keypad now and a final reminder to ask a question please press seven on the telephone keypad now we have no further questions coming through take the time then to say thanks to everybody for dialing in to today's briefing and thanks to the speakers for their time and if you need any more information please don't hesitate to consult us or once again I can want to remind you of the nature and science embargoes on these papers which is 6 p.m. London time 1 p.m. US Eastern tomorrow Wednesday the 27th of October there'll be a recording of this briefing on the Nature Press site and science will make that available to you under embargo many thanks