 Okay. Thank you. Thank you, Gil. So we're moving on to describing the thousand genomes data sets, including the raw sequence data and the met sequence data and the variant calls. Gil described the three coverage strategies that were for which we collected data in the project and the goals of the various pilots. Moving on to the samples, the exon pilot, which used the highest number of samples in all 697 samples, represented seven different populations, three from Asia, two from Europe and two from Africa. The other two projects used subsets of these samples. The low coverage project used samples from four of these populations and, of course, the trio project used two trios, one European and one Asian trio. Gil also talked about the data sets and how basically we were able to compare using these data sets, deep coverage in a small number of samples with low coverage, but genome-wide shotgun sequencing about close to 200 samples and then going very deep in a small fraction of the genome. In fact, the fairly small fraction of the exon, about 5% of the human exon in the exon pilot project from a large number of individuals. No matter where the data came from, the data processing was done in a uniform fashion. This was what's called a re-sequencing project in the sense that we used the existing human reference sequence to organize all the data. That is, once the machine came, once the read scheme of the machine and they were converted into reads, they were mapped to the reference genome after which we called variants, both single base or short insertion deletion type variants and structural variants which will be covered in a separate presentation by Yan. The mapping itself was successful, but we must realize that older next generation sequencing technologies enable very cost effective and fast high throughput sequencing, but they can do everything. It turns out that there are limitations in terms of what fraction of the genome is reachable. On the top, you'll see a figure from the old 2001 human genome paper that shows that a large fraction of the human genome, nearly half of the human genome, consists of repetitive DNA, and of course it turns out that we are able to sequence a large fraction of the repetitive parts of the human genome, but not all. It turns out that 80-some percent of the human genome was sequenceable with the technologies that we used in the pilot data, and there is some fraction of the genome that we really weren't able to touch and weren't able to assay for variation discovery, but so this is important to keep in mind when we're considering how complete the resources, so we are, there were limitations, but most of the genomes were, we were able to assay. There were other challenges after the read mapping step is done, and some people are familiar with these, these are somewhat technical details, but during library construction and sequencing there are a number of amplification steps, and because of that you end up sometimes sampling clonal copies of a single sequencing fragment, which introduces systematic biases into variant calling and need to be dealt with. There are also limitations to the alignment technologies. Some examples include situations when you have an insertion deletion polymorphism in the sequence samples that can throw off the alignment algorithms, and there are, we now have specific methods that can clean that data up. There's also more subtle points such as base collative values that are not well calibrated at the beginning, and again this introduces systematic biases. All these things can be dealt with and the project developed methods to deal with these issues, and once the alignments are up to the standard, the next step is to actually detect the variants, and I'm going to be talking in this presentation only about SNP variants, and the structural variants will be covered in a separate one. So SNP calling with current methodologies are done considering all the data for all the individuals simultaneously, so data in this case would be available from a number of individual samples. The data, this specific method would be a Bayesian method where we calculate data genotype likelihoods or data likelihoods based on the data collected from the individuals. We are able to aggregate this data and use prior knowledge about, for example, the allele frequency distribution and the frequency of polymorphic sites, polymorphism rate in the human genome, and use this information to make better inference as to where the variant locations are, which is called SNP calls, and also we are able to infer the genotype of the samples based on the data. These are called genotype calls, and Gil referred to these in his presentation, and I'm going to come back to genotype calling a little bit later. A very important aspect of the 1000 Genome Project was that it was able to benefit from a number of competent analysis groups, and each project data set was analyzed by at least two, but often a higher number of different groups developing often quasi-independent methods, which could then be compared at various checkpoints, and data releases have tended to benefit from the best of each method and merging the SNP calls from the various different methods. This is just one specific example from the part of the project, where the exome pilot, where my group was primarily involved in but similar pipelines implementing steps that deal with the difficulties that I was describing below have been implemented. So what were the results? Well, this is a fairly dense table, looking at the variant calling results from each of the 3,000 Genomes pilot projects. The raw data was pretty staggering, and now it's even much bigger. Multiple terabases of sequence. We found about between 5 and 10 million SNPs for the trios, 15 millions from the low coverage samples. Remember, these are both whole genome data sets, and over 12,000 in the exome pilot. Again, looking at a very small section of the genome, about 5% of human genes. Turns out that many of the variants that we found were novel. Novel to rate was highest in the exome pilot, because of a large number of samples we were able to discover very rare variants, and we know that these are underrepresented in current databases. In addition to the SNPs, and these are just the results, I'm not going to go into the details, well over a million short insertion deletion type variants, and over 15,000 larger deletions were also found. We were able to detect with a base pair resolution a large number of structural variant breakpoints, where a large structural variant actually starts and end, and we're able to discover new variation types that are very interesting in their own right, for example, mobile element insertions in large numbers, and doubling and tripling the existing catalog of these variants. Important to mention that the project standard was, to have a false discovery rate, better than 5%. And all pilots met this criteria, which means that when you look at variant causes that we produced, you can be certain that 95 out of 100 variants will be real, and so the false positive rate will be very low. We have talked about calling genotypes, determining whether the individual is Homozygos for the reference allele, heterozygos, or Homozygos for the alternate allele. Now, from first principles, this should be very difficult for low coverage data, because sometimes there are no reads available from a given individual. I was very, very surprised to see that often very accurate genotype calls were made for samples, for which there was not a single read aligned at a given location, and that's because of the power of linkage disequilibrium-based methods, and this figure compares the accuracy within the same samples between the exon pilot, which was deep sequencing, to the low coverage pilot, and you can see that the accuracy is actually very, very high for the low coverage samples, despite often missing information. The flip side of accuracy is sensitivity or power, that is what fraction of the size you discovered and what fraction of the size you missed, we were able to do multiple different comparisons within and outside the project, for example, looking at SNPs from the HEPMAP, and comparing the low coverage to the exon project. You can see that if you have at least seven or eight chromosomes with the variant allele in your collection, basically you discovered at SNP 100% of the time, even with low coverage sequencing, it turns out that the deep coverage you can do better, you can go your sensitive down to two alleles and you discover even a large fraction of the singletons. Again, that's because of the deep coverage. It's slightly difficult to see the numbers here, but there are very, very large number of novel variants found in the project, multiple millions of novel variants, which is a large contribution of this project to known polymorphic sites in humans. Now, this is also a population level of quantities. What about per individual? Each individual in this project, we found about three to four million variants, including 10,000 to 11,000 non-cellonymous changes, a couple of hundred in-frame indels, about a hundred premature stop codons, about 50 supply side disruptions, and 50 to 100 recessive disease-causing mutations present in a database of human disease variations. In terms of introspection, comparing, again, between the various projects, as I mentioned, the exon pilot deployed deeper sequencing, 25x and higher per sample, and because of that, we were able to discover more at the low end of the allele frequency spectrum, and that's exactly what was the reason. That's one of the most important reasons why the exon pilot was carried out, so we can see what we're missing with the low coverage sequence. A little frequency. It's really an account. The low coverage was about 2 to 6x per sample, and the high coverage is 25x and on. There were some samples with 100x coverage in the exon pilot. So, yes, at least I can speak for the exon pilot. There was significant difference in the coverage between 454 and Illumina, which is the two platforms that were used, but it turns out that the venereal difference up to 20x. So, the coverage distributions look almost identical up to 20x, so it didn't really affect the discovery rate of the SNPs. Both platforms were equally suitable for discovery. What's just the last slide about the exon pilot, which of course is close to my heart, is that there really is a huge excess of variance in this data set, which allowed us to really look below the 1% area frequency, but I don't have time to describe the actual science of these analyses. One slide. The 1000G data supports structural variance from the next gen sequencing data. Again, details to come. And one thing that I couldn't resist to put in here, is these are two completely different variation types, SNPs and mobile element insertions, such as ALO insertions or L1 insertions, and the same exact data from the same individuals, we could calculate and compare heterozygosity. I think that's a real power of these data sets that we are exploiting. What are the data types that are delivered by this project? The reads in fast Q format, and Steve, I believe, will be talking a little bit more about data formats. The alignments, reads mapped to the genome in a standard BAM or SAM BAM format, and the variant calls in what's called a VCF or variant call format. Very important that there are tools available developed by a number of groups that you can use, download and use to analyze and manipulate the 1000G data. For example, you can take the BAM files, the alignments, and you can calculate various metrics, you can calculate coverage, you can discover your own variants, and so on. You can also, there's a set of tools called VCF tools that is able to manipulate the variant calls in their own particular format. You can use viewer programs. I just showed one example here, the Gambit viewer from my group, but there are others, for example, the IGV viewer from the Broad Institute, to actually look at the data sets. It's pretty efficient browsing now. Moving on, so this was the pilot and these were the tools. One slide about where we are right now. The current status is the analyzed 629 samples, and this supports 25.5 million variant calls. Very good progress, very low missed snip rate, and very high quality. Gail talked a little bit about the second phase of the project, expanding the number of samples to 1,100 and then beyond 2,500, which will consist both of low coverage, whole genome sequencing, 4x, or perhaps a little higher per sample, and this will give us an even more complete snip catalog of genomic variants about 1% allele frequency, and in addition to that, there will be full exome sequencing on all the samples, which will go below 1% allele frequency in the exome. And finally, I want to acknowledge the 1,000 genomes project, all the participants who contributed this work. Yes, there is time for questions. You probably have to yell because it's kind of difficult to see with the lights on. The URLs for the tools that you listed didn't show up well in the handout. Will they be made available for us later on? I believe that the presentations will also be posted and accessible so people can download them, they should be better accessible that way. They will be posted at genome.gov, which is the NHGRI's website. And these will come online about a week. I have a question about the exome pilot dataset. You mentioned that you did not identify short indels in it, just snips. I'm surprised if you didn't see, for example, in-frame deletions or things like that in the exome dataset. I believe there are reports in the literature for such type, so it's surprising that you didn't see that. We did see that. I just didn't talk about that. Okay, because it's not reported in the table that we have here. There were about 100 short insertion deletions in the exomes, and some of these were in-frame, most of which were at low allele frequency, as you would expect from these highly, most likely, functional variants. Oh, okay. Thanks. Just one question. I heard that more samples will be available in the future for a full exome, meaning this 1,000 genes or all the exomes. Meaning the full exome, so all the human genes, well, all the ones for which captured arrays exist. So you will look up to arrays for all the exomes together? Yes, yes. There are, I believe, at least two, but maybe a larger number of full exome captured arrays for the human genome available today that people are using. You spoke about a global target of under 5% for the false discovery rate in the process. Can you discuss a little bit how you determine that? These were done with experimental validation experiments, which consisted, there were multiple different strategies used. There was direct Sanger-based resequencing, and also genotyping. And these show that depending on the project, they were slightly different, but they were all above 95% accuracy. The sensitivity, of course, was not experimentally determined because it's generally much more difficult, but they were determined from comparison to other projects with overlapping. Well, that's a good question. That's anybody's question. I can't really answer that. HGMD is probably the best currently available data set. We did not follow up on these in that regard. This was not a disease project, primarily. This was a project that set out to essay human genetic variation on a population scale, but not in a disease context. Those are, of course, interesting questions, but there would be other studies that will answer those. Any other questions? In that case, moving on to the next presentation, Steve Sheriff.