 Okay, why don't we go ahead and start? We're going to have a program project update on 1000 Genomes Now by Lisa Brooks. Okay, thank you. So I think it's been about a year or so since we've had an update on this project. We just had a project meeting at Cold Spring Harbor. I think most of you have seen the goals before that we're trying to find basically most variants in populations being studied. Snips, structural variants. So I'll be discussing the timeline as we go. But the first goal has been reached that all of the samples have been collected. This is from 26 populations, 2,500 unrelated samples plus a lot of their kids. So we're quite happy about that. It worked out. So there's three phases for the project, so I'll be talking about those in order. The first phase, the data were released, the last form of the data were released in March. The paper's being written up and it should be submitted in early June. This is based on 1092 samples from 14 populations with both whole genome light coverage sequencing, exome coverage, genotype data. And the phase one results are, first we found more sequence in the human genome. There's 38 million snips, 2 million indels, 14,000 deletions. Most of these are novel. As you know, the common variants are found first. The deeper you look, the more you look, the more people in any population, additional populations, you find more rare or variants. And so that's why they become more novel. These are high quality. What's a real advance in this data set is that the different types of variants are being integrated onto haplotypes. So that research people who want to impute into GWAS studies, for instance, when they take their genotype data, use 1,000 genomes data, and they'll get not just snips, but they'll get indels and deletions as well. So these are on common haplotypes. There's lots of new tools being developed. These data are on the Amazon Web Services Cloud. This was the poster child for big data for NIH. But we'll take any cloud where it's not an exclusive arrangement. But it is a way because the data set, clearly the variants are much smaller data sets and those are easily available. The large data sets if people want to develop methods or look at global analyses that need the BAM files, the underlying sequence reads, that's 200 terabases. Those are extremely large data sets. Okay. One of the major results is that given, so the phase one data sets on almost 1,100 people, that imputing so that if you have a GWAS data set and you want to use the 1,000 genomes data set, and even though you did genotyping in the GWAS, you use the 1,000 genomes data to impute the variants that you didn't see but that are most likely there, the imputation from the complete data set, almost 1,100 people, is better than imputation just from the populations that are related to the one that's being looked at. So there's enough sharing of variants and haplotypes across populations that imputation is still useful. Here's an example. It's the same for the results are very similar for coding variants versus non-coding variants. But you have the big blocks from left to right. Our frequency classes, the very rare variants are here, somewhat rare, not so rare, and common variants are here. The dark parts are the variants that are seen only within populations, the medium or ones that are seen shared between populations on the same continent. The light part is what's shared among continents, and as you can see, common variants are mostly found in all populations, all continents, and this is a reflection of our shared history coming out of Africa and spreading all over the world. So the common variants, but even very rare variants can still be found shared across all populations, but clearly the very rare variants are much more population specific. Results arose after people spread around the world, so they arose in various specific places. I'm a little sharing by frequency. This is kind of an interesting, I'm sort of showing you a few of the sorts of results in the paper. Please speak to the microphone, otherwise the web people can't hear you. Oh, I'm sorry. So I'm showing you some of the results from the paper, not all of them, just sort of some interesting highlights that, again, this is a little sharing that variants that are found in the British, except for very rare ones, which clearly are recent population history, but the variants found in the British are likely to be present in the Luya, whereas the reverse isn't true. Variants in the Luya are less likely to be found in the British. So again, it's a reflection of our population history. People left Africa, and there is a lot of genetic drift, a lot of founder effect, and didn't take all the variation with them. Here's an example, I believe from Carlos's group, admixture. We have the Puerto Ricans, the Colombians, and the Mexicans, and amount of European ancestry, African ancestry, and Native American ancestry in each of these populations. So we've included admixt populations. Many populations in the world are admixt, and so we're getting some very nice ancestry deconvolutions. They're also developing methods to go across the genome for individuals and assign them to ancestry. Loss of function variants is something that there's a lot of interest in, and I should say, of course, these are putative loss of function. We're simply looking at the variants and finding, for instance, a stop codon pretty early on. The assumption is the protein product is not made, but it's not based on any phenotype because, of course, we don't have phenotype data. As a class, these are much more likely to actually be nonfunctional, even though you don't really know that about specific ones, except maybe a stop codon in the second codon or something. And so there's thousands of nonsense, disrupting, frame shift, large loss of function variants. Now, of course, what's been discovered in these, especially ones that are homozygous, they're in such crucially important genes as olfactory receptors, where it turns out it doesn't really matter if you smell, if a person has a sense of smell a little bit different from another person, that's not lethal. So of course, because of the way natural selection works, the ones that are at higher frequency, especially ones high enough frequency to actually have homozygotes, really aren't that bad for you. So we talk about loss of function variants, but we have to recognize there's been a real selective filter. Okay, phase two, we're not going to make a paper out of this. The sequence data have been done already. It's almost 1,700 samples from 19 populations. Broad is simply doing the calling of the variants. The project really wants to focus on developing the methods for mapping and calling variants. So there's a lot, even though this is a lot better than the pilot was, the project has very active analysis group that's really developing a lot of methods of analysis. And instead of going through all the work of calling again and then having to do that on phase three, they decided we'll just do a very simple straightforward calling approach and actually spend the time improving the methods. And of course, this is a project, the methods are used in many other projects. So it's a very worthwhile investment. Phase three, that's the final phase. It's all the remaining samples, so that's going to be 2,500 unrelated samples. This includes 500 samples that are being sequenced deeply by complete genomics, including 161 trios, which means the kids of those unrelated parents. So there'll be two 6, 6, 1 samples eventually. The sequencing is planned to be done by December of 2012, calling variants by the spring of 2013 and preparing an integrated paper that summer. So that is the end of 1,000 genomes. But of course, people are thinking about what other resources are needed. WashU has certainly been a proponent of this, of working with the genome reference consortium because when you talk about the reference sequence, there certainly are errors. And when you have 2,500 sequences, some of which have been done very deeply, you have a lot of information to correct problems with the reference. So you can find more sequence, and if there's little bits of sequence that haven't been found, that can cause errors in variant calling. So simply finding more sequence is useful. I believe it's like about 35 megabases have been found so far. Fix errors, provide alternative assemblies, provide entire haplotypes rather than chimeric ones. There's also consideration of what resources would be needed in order to help disease studies. Okay. Any questions on the project? This is a fantastic project. It's so important to what we're doing. I'm really, really pleased to see that 1,000 genomes became 2,500 or 5,000 depending on your point of view. And it's just so useful in so many different ways. So congratulations on a great project. Scientific. When you may have seen it, but I didn't see it, the total number of loss of function mutations, those were fairly severe things. They're what? Fairly severe loss of function mutations. You're making guesses based on your early stop codons, et cetera. Yeah. Nonsense splices. And splice ones are easy to predict. Frameshift. For the mis-sense ones, obviously there are a lot more mis-sense. Was there some analysis, or has there been analysis done on those that are likely to be loss of function based on evolutionary constraint and amino acid stuff? Yeah, there's been the fig group, the functional interpretation group, has been actually quite active. There's a lot of overlap with the encode people, who of course know about, know from function. So there's been a lot of trying to understand some of the functional consequences. There's analyses about which types of gene functions are more or less variable or disrupted by these sorts of variants. So there's been a lot of looking at various sorts of functional ways of doing things. I want to throw something in that. I think that one of the directions that NHGRI, this is a little off base, but maybe it's not, that ought to go is large scale efforts to then test those functions. There's one thing for us to identify the variants. It's another thing to maybe even know that they're in a promoter or an enhancer or in a coding region. But then it's another thing to then take those and then do something more than just a little bit of sampling as to whether that actually really is changing the function. I don't think we have knockouts, obviously, mouse, and we have some things, but I don't think we have a big effort on something like that. And I'm hoping we can talk about that one day. Yes, I mean, we certainly talked about that a bit and people have discussed exactly that point that we actually need to do validation work on these predictions. I didn't want to use the word because I think it's more important than valid. It is validation partly, but it's really, we need to figure out how to do it. I'm not sure we know how to do that. You alluded to it, but do you have an actual number for how, in how many instances you see homozygosity for these presumed muscle? The Daniel MacArthur paper found what it was about a hundred-ish loss of function variance per person, and I believe about 20 to 30 of those were homozygous. And there was ray over representation of things like olfactory receptors. In addition to the sequence, were there any cell lines established from the? Yes, absolutely. All of the samples have cell lines. And the consent people agreed to have the samples used for things like expression analysis, not, unfortunately, IPS. Yes, so cell lines are available for all of these from Corielle. And there's more kids than actually are being studied in 1000 Genomes Projects. So people who want to look at sort of heritability of cellular, of molecular phenotypes can do that. So, I mean, just to, again, congratulate you on a great presentation and obviously as part of 1000 Genomes. One of the things that has been, you know, brought up at the last meeting and on everyone's mind is what's next? What's the next big plan that we should have? And I don't know if we have a meeting coming up or something where these issues are going to be discussed. But I think it should be something that we should think about or get perhaps a group of on council to put together or really kind of integrate the things that we're seeing. And the questions that we have about how we're going to really create the public resources to go from biology of genomes to the clinical interpretation. And I think 1000 Genomes and the efforts that we've invested in so far need to be thought about in terms of, you know, long term how to make that happen. It's just more of a comment than really anything else. And it'd be great to get folks sense around the table about their thoughts of what should be the future of these kinds of big efforts. In part because they're the only publicly available large scale data set that everybody has access to, right? You know, it just gets downloaded so much more than anything in dbGaP precisely because it's open. And it really is what catalyzes so much science outside of even the consortium. And I think it would be a real mistake to not continue to invest in this kind of area. Any comments or thoughts in particular about that idea? Because I mean, one thing I will say is, you know, if you everything moves very quickly here. And so if all of a sudden attention is not paid to when things start to dissipate, people will move on to other things. You will dismantle consortia and interactions that have set up. Maybe that's fine. Maybe they'll go on other things and that'll be important. But I think you want to take stock for a moment and just make sure before it dismantling starts. There's obviously a issue of where those funds would come from and so forth. But if we're not strategic, then naturally things will disperse. Yeah, Ross. At the risk of stating the obvious, surely there are human populations that have not been well sampled from this. And maybe one of the issues is what do we see or could be gained from looking at additional populations versus the cost? And I think it needs discussion. I think in addition to the additional populations, I really am intrigued by the connections between the variant information and the functional effects of those variants, which would mean more studies on the samples in which there's now been a big genomic investment to try to see, you know, that's a very intriguing list. And you could make an even larger list if you included the non-coding regions and the different types of structural variants. So just to follow up one more time on the cell line and also the consent issue. So it's blood cell lines and they're consented for anything you can do with a blood cell line? Almost. They're consented for variation sequencing sorts of things. They're consented for things like protein RNA and protein expression sorts of things. We didn't, when we were developing these consents in 2001 to 2004, we didn't think of IPS for some reason. So they're not consented for that. And that would be excluded then. You could not make IPS cells out of the established cell lines from the thousand genomes patient? We've talked about that and we've decided not. We could talk some more, but it doesn't look good. I mean, I think the other issue is if there are additional populations that are going to be sampled. At that point, the consents can be changed to allow that perhaps to happen. If that's one of the things that the sort of community believes would be really beneficial, for example. I'm a little I'm not sure why this is such an interesting set of variants because in a heterozygous form, you're going to find a lot of loss of function variants. So there's no association with disease because you don't know that about the thousand or several thousand people. So I'm not sure why you would want to put a lot of effort into genes that are probably redundant in the genome for function. It would seem to me that if you're a sequence, if you really want to understand interesting genes that have loss of function variants, you would do it in disease populations, not in random populations. Because most of these are not interesting. Before we go to Pearl, does somebody want to argue that point? I mean, it's a fundamental point. So I think it's worth discussion before I move on to Pearl. Yeah, I mean, I think when we think about this set of variants, but just the set of polymorphisms that are found by thousand genomes, many of the things that are now associated are part of those catalogs, right? So if we can begin to get an understanding of the functional characterization, both in terms of expression and impact on protein structure, there may well be the variants that will then lead us down the road to linking with disease, particularly if... Well, I have no question that knowing all the variants in the genome is important, but you understand their importance by looking at their association with disease. What I don't understand is why people are saying we need now to focus on these loss of function variants, because to me they're the most disinteresting set of genes in the genome, unless they're associated with disease. Of course, thousand genome samples come from people who are well enough to sign a consent form, but just like everybody else, they're going to get all sorts of diseases. So part of the whole point of thousand genomes is there's a two-step process that we're finding not complete, and especially for the very rare variants, very much not complete, but we're finding a whole bunch of variants that are actually out there in people, in the populations, and by some of the analyses of function have shown that they're really quite rare. So there's been selection against them. So as a class with selection against them, that implies they really do have phenotypic effects. So there's this two-stage process of using thousand genomes to find a whole bunch of variants, and then just like any GWAS study, using the thousand genomes data in some sort of disease study helps you figure out what's going on. I mean, one could argue having, for example, a collection of these cell lines, these 2,500 cell lines, to be found an interesting polymorphism. You know, Mike Benke and his T2D genes project, for example, might have an interesting polymorphism that's segregating in thousand genomes. So you can now order up those cell lines and do functional experiments on those that couldn't be done, necessarily from the patient samples that were collected. Yes, but if you wanted to understand the loss of function, you wouldn't go back to the cell line where you just, I mean, you could just do a knockdown experiment. So I don't understand still. I look, I completely understand the value of variants in amino acid or other regulatory sequence and their association with disease. I just question the value of putting a lot of effort into understanding the function of genes that have loss of function coming out of this experiment. So I guess the one point I'd make, I think your point's well taken that you restrict the number of whole organism phenotypes that you can tie to variants when they're in the heterozygous state like this. A huge number of the variants in this data set are non-coding, many of which will act in cis and could create phenotypes that can be studied in cis. So back at the level, not of the whole organism phenotype, but just the biology of how gene regulation works. There's huge numbers of elements that have been identified in the ENCODE project, for example, that have been mapped to some rough region. I think it would be very interesting to know how sequence variants that map in those regions impact the sorts of gene expression and mod-labeled phenotypes that you can score in cis in cell lines. It's a biology of genomes and gene regulation issue, but it's uniquely addressable by this kind of data. Yeah, just many of you have heard me say this before. I mean, I agree with you. I think this is one of the compelling reasons for why, as we think about selecting genomes to be sequenced in the future, we should pick ones that are as well phenotyped as possible. And, you know, I would argue that things like Emerge and Page and some of the places where we actually have pretty robust, deep electronic health records information behind them at least provides a lot of additional information that could be used to help us tease some of that apart. So I couldn't agree with you more that we need to sequence well phenotype samples. And there's obviously a critically important component of where sequencing needs to go. I think what is really unique and important about these public resources and community resources is that anybody can get access to them. You don't have to be an Emerge investigator and you don't need to really be in the, you know, lockdown mode of where so much of this data resides. So we were, for example, part of the big NHLBI ESP project. And we were investigators in ESP tasked to do population genetics in ESP where everybody was in agreement that we were going to collect exome sequencing data on these 7,000 people and put it in DB gaps so we could all share it and took us a year and a half to get the consents in place just to share that data within the consortium. Whereas for 1,000 genomes, right, you know, this data is immediately out there and available. And, you know, so if giving up whether the individual has or doesn't have diabetes would, you know, allow you to free the sample so that folks could use it, knowing that there's some polymorphism at 20% frequency that are linked to diabetes so you could study it in the cell lines, I think makes it a really kind of compelling resource. But I think we should explore whether there's ways to have your cake and eat it, too. I mean, so for example, we've developed a streamlined data use agreement, I think, in Emerge. And we're testing. We've got, you know, new pediatric sites coming on. We brought on two new sites this year. We'll see how well that data use agreement transfers. Everybody's anxious to try to figure out how you can reduce the barriers for the reuse of the data because that's key. We should have a meeting on that. So the 1,000 genomes is a great project. And I just wanted to, at the risk of sounding really unimaginative, I just wanted to emphasize the point that simply expanding it is important. There is real value in the brute force, and I'm using air quotes here, population-based type of sampling that this does. And I completely agree phenotyping is really important, but there's value to be gained from large-scale sequencing like this. But we're going to need much bigger numbers. A thousand or a few thousand genomes is still really pretty small if you're going to try to infer or come to statistical conclusions about burden, you know, mutational burden tolerated by different genes, et cetera. And those are the kinds of things that can come out of simply scaling this up. So. Let me make one last comment. So in your point, Jim, if we think about, for example, some most exciting things that have come out of thousand genomes have been where investigators have taken the cell lines, gathered expression data, and tried to make these EQTL maps. And because of the limitations of, you know, those are single investigators that can do 80 lines, maybe 100 lines, right? Those are woefully underpowered experiments. So imagine if we could take all thousand of these lines already, characterize them really systematically so that we can do for EQTL what thousand genomes has done for sequencing, right? It's the power of that resource that then enables an investigator who finds an interesting association to already go into a systematized catalog, right? We all use it. It's an EBV transform cell lines and dot, dot, dot. But at least it is one systematic resource that is available for all to use and could really be catalytic. Okay. Okay. Well, thank you, Lisa. It was a good discussion. One final point to set up Anastasia is that finally the analysis methods are including the X chromosome. So this is an update on a concept clearance which we had brought to council in May of 2011 that we had received recommendation from council to put out as a notice instead and then follow the field and see how it progressed. So as you may remember, the results from the GWAS catalog have now found over 1,500 genome-wide associations at a level of P5 times 10 to the minus 8th for over 200 traits. However, even now, there are still only about eight hits that have been found on the X chromosome, which is one more than there were a year ago. And last year we brought a concept clearance for the whole chip which proposed to support a broader utilization of genome-wide association data looking primarily at the X chromosome as well as the Y chromosome, mitochondrial DNA, and CNV analysis. And the goal of this was to look at this underutilized data and facilitate a more comprehensive analysis of this data in genome-wide association studies to develop and validate new methods for analyzing this data and put forward new analytical strategies and statistical methods that would be widely available to the community. There were a number of recommendations that were made at council a year ago. The first one of these was to go forward with publishing a notice rather than putting out a RFA for this initiative and to try and encourage applications to come in through the unsolicited mechanisms with R01, R03, and R21 applications. To present this finding at a number of meetings and try and kind of get the word out, write a paper on the findings that we have and publish it in a prominent journal to return to council in about a year and present what we've found so far, which is what we're doing now, and to remove the CNV part from this because there was really a feeling from council that CNVs were a little bit more advanced than some of these other data types on the X chromosome, the Y chromosome, and the mitochondrial DNA. And so we have issued a notice. We did remove the CNV part from the concept, but this is basically the same idea that was brought forward in the concept clearance that's put into this notice. And the goals are fundamentally the same to facilitate this type of analysis in these underutilized data types. So we're hoping for individuals to put forward applications that would analyze existing GWAS data for the Y chromosome, X chromosome, and mitochondrial variants to focus on data sets that have not previously been analyzed as well as to develop new methodologies and disseminate this to the field in general. And we'd really like to hope that this would also include diverse populations. So far, we've gotten our first batch of applications that have gone through peer review and are hoping to get a couple of them that will come out of that and be funded. We've put forward a number of presentations over the past year. We've presented at the Genomics of Common Disease Meeting as well as HGV and the American Society of Human Genetics Meeting. We've presented this internally at the Director's Forum Quarterly Meeting. And we have been working on an X chromosome commentary discussion with a journal currently that we're hoping will be published soon. And then just to give you an update on the numbers, this is the data that we presented a year ago to council looking at the percentage of genome-wide association papers that are published in the GWAS catalog that analyzed the X chromosome. And we've now completed a number of more months' worth of analysis. And as you can see, the trend is approximately the same with about 33% of the papers looking at the X chromosome. So it looks like this isn't something that has really changed over time. Even in the following year, we still have only eight hits in the GWAS catalog that are showing up as significant, meeting GWAS catalog standards. And we really feel that this is an area that really does still need some stimulus to be able to bring forward new methodologies that are accepted by the community and hope to be funding some grants in this area soon. So if you have any questions, I will take those now. Yes. As I think the noisiest one against this a year ago, I guess I was wrong. Thanks for bringing the data. Anything else? Anyone? Okay. Thank you.