 Okay, thanks very much, and thanks for everyone who's stuck with us to the end. After hearing, I think, some really nice summaries of the huge amount of data that the product has generated and the different ways that the community can get access to it, we'll finish with a discussion of some specific examples of how the 1000 Genomes Project can be used in disease association studies. So, obviously, in addition to the really interesting biological insights coming from the data itself, the project hopes that using the data more broadly will let us gain all kinds of insights in all kinds of different projects. So, to sort of frame the question, there's a lot of talks at this meeting and thinking about doing genome-wide association studies of complex disease, and we can frame the purpose of doing those studies in two parallel goals. And those are to explain the heritability of disease and to understand the biology of disease. And Eric Lander in the opening talk put this really nicely in that sometimes we think we'll go heritability first and then understand biology. But really, the two are at the very least happening in parallel and possibly it's the biology will crack first. And in terms of these two goals, how they'll eventually be used in a translational sense, heritability will hopefully be useful in things like predicting disease and predicting prognosis of disease, where on the other hand, understanding biology of disease might be useful in developing treatments. These are obviously very forward-looking things, but the 1,000 GMs project data can be used in both of these goals. And I'll mention examples of each of these. So in the first case, understanding the heritability of disease, this statistical technique of imputation is the principle way that we use 1,000 Genomes data. And in terms of biological understanding, annotation in a very broadly classed sense, by which I include annotating a region of the genome and understanding all of the places where there's variation so that we can use that variation to interpret signals we see in association studies. So starting with imputation, and I'll mention a little bit more technological detail in a moment. But this is just some data from HapMap actually, before 1,000 Genomes. And the x-axis is showing allele frequencies going from rare to common snips. And the y-axis, these curves are showing essentially improvement by having more and more reference samples available. So imputation is basically using a reference population to predict genotypes in some targeted GWAS data. And the key point is that, for common variation, we already do a very good job of being able to do imputation using the relatively small number of samples available in the HapMap and the 1,000 Genomes pilot data. But increasing the number of samples available really increases the accuracy of imputation at low frequency variation. So this is a key observation that to accurately impute low frequency variants, we need a large number of reference samples. So imputation can be done with a number of programs. I've listed the three that I'm most familiar with, there are a few others. These include impute, the second version of which is really optimized for modern imputation problems. It was developed in Oxford by Brian Howey and Jonathan Marconi. Beagle was developed by Brian and Sharon Browning previously in Auckland, New Zealand now at the University of Washington. And finally, MAC, and it's a follow on program called Minimac, developed by Gonzalo Bacasis' group in Michigan. And each of these operates on a similar principle, which I've alluded to and which Gil mentioned at the beginning of the evening, which is, you have a panel of reference samples which are extremely dense in the number of variants for which they have genotype data. So the 1,000 Genomes, of course, is the complete sequence in the reference samples. And you use the correlation between those variants to be able to predict that entire set of variation into a GWAS type sample where you have a small subset, say, a few hundred thousand or maybe a million variants. This is still a very computationally heavy duty process, especially when we get bigger and bigger reference sets. So this is an example that we did using the 1,000 Genomes pilot data. We took the Welcome Trust Case Control Consortium, which was a very big first generation GWAS with 16,000 samples. We've now generated a combined SNP and Indel reference panel from the 1,000 Genomes pilot data. It's worth pointing out that this is becoming a routine practice, but aligning all of the data to the same strand, making sure the alleles are correctly set up, getting the files in the correct format can still be a big headache. And I'll mention in a few minutes that the project is working to try to make these kinds of files available in as usable formats as possible. But it's worth being aware that it's still not a sort of push one button and go operation for the most part. We did these analyses with the impute version two factory default settings. So you can twiddle those settings as well, although I think the recommended defaults are certainly a good place to start. You can't do this kind of analysis on the entire genome at once. So you have to split the genome into chunks. And then you, to get good accuracy, you actually split it into overlapping chunks so that you don't lose any accuracy at the boundary of the chunks. And you can submit each chunk to a job on our computing cluster. Even these small chunks are relatively memory hungry. So we need them on machines with between four and six gigs of memory. And to do this entire process requires two computing years of time. So that is quite a lot. We do have a big cluster, so it doesn't take two actual years of time. And it scales approximately linearly with sample size. So if you scale back the 16,000 samples we were working with, which is obviously quite a huge data set, it takes about one or two hours to do each individual. So you can imagine if you had a couple of thousand individuals, it would take a couple of thousand compute hours. So, again, as I mentioned, this is becoming a routine process. But it is still worth pointing out that you do need some serious computational hardware to do this on a large scale. There are some interesting methodological developments that are happening that maybe will be able to help this. This idea of prefacing can actually save a lot of time. So, again, simplistically, the imputation algorithms are looking to find a haplotype in the reference panel, which is very similar to the skeletal genotypes in my GWAS set, and then fill in the missing data using the matched reference haplotype. And the typical setup for these imputation software algorithms today is to compare GWAS genotypes. So just at each position, you have the diploid genotype. And to try to match those combinations of genotypes to the phased haplotypes in the reference panels, this obviously requires a combination of trying combinations of different versions of the phase within the genotype samples, and then matching those versions to the possible reference haplotypes, which is computationally very intensive. You can phase your GWAS data in advance. So now you've gone from having a GWAS data set of genotypes to pairs of phased haplotypes, and you save this result. And then what you can do with that is now it's a much simpler process, again, a very simplified version, of matching the individual phased haplotypes in your target set to the reference. And that's much faster than, instead of having to do a square of the number of individuals, you're just scaling as the number of individuals. Also, by saving this information, you can then keep imputing in the future into different reference sets as they become available much faster than doing it from scratch each time. And this approach is implemented in software flags and in Pute V2 and Beagle. And as I mentioned, there's a separate program called Minimac that does this for Mark. In terms of reference data sets, in the past, what most groups doing imputation to GWAS have used, the HapMap 2 and 3 data sets. These have HapMap 2 270, scaling up to over 1,000 samples in HapMap 3. But a relatively small number of SNPs compared to what's been discovered in 1000 Genomes, so between 1 and 2 million SNPs really in these reference panels. And what 1000 Genomes right now enables us to do is to obviously have a much greater number of variants, although in a relatively small number of samples, the 179 pilot samples. As I mentioned, we now have a, we've generated a merged set of SNP and Indel calls, so you can simultaneously impute SNPs and small Indels, more than 10 million variants. We just heard that there are genotypes for deletions, not all the structural variations, but the project is aiming to form integrated call sets with SNPs, Indels, and each class of structural variant that has genotypes so that you can run an imputation analysis across all kinds of variation together instead of trying to treat them each separately. So you'll eventually get a predicted genotype at any kind of variant discovered in 1000 Genomes in your disease samples. And we think this will be a really useful thing for people. Right now you can download the VCF data from 1000 Genomes as has been mentioned. You'll have to then convert it to the formats that are required for each imputation program. I'll just point out that Jonathan and Brian and Oxford have specifically converted the pilot calls to their impute format so you can download ready to use files directly from their website which I think is very helpful. And they'll plan with some lag behind the project to be releasing those formatted calls with each new release of project data. And indeed as we go forward the whole 1000 Genomes project will eventually have many, many samples as well as many, many variants and we'll continue to try to release these phased combined sets of SNPs, Indels, and structural variants. It's worth pointing out that while I think these are going to be really useful each of those classes of variation has very different properties in how confident we are in the genotypes. So SNPs were pretty good at calling right now and generally have very high confidence. Indels are getting very good but aren't perfect and structural variation is really still, there's a lot of development in calling genotypes. And so these kinds of joint data sets I think will come with very bold humongous caveats saying that you have to be very careful interpreting the outcome depending on the kind of variation you're looking at. So I think it will be good to get this into the public domain but again reminding people that not all of the variant calls are created equal. And I'll finish with a couple of examples. So here's a GWAS plot of Crohn's disease. So anyone who's worked on GWAS will be very familiar with these. The x-axis is just a position along chromosome 22. The y-axis is the significance of association with disease each point being a SNP. And they're colored based on green dots where genotyped and original project. Blue dots, sorry red dots where genotype were imputed in a HapMap2 analysis and the blue and gray dots are SNPs which weren't imputed in HapMap2 either already discovered in DBSnip or else just discovered by 1000 genomes. You'll see there are two strong peaks on this plot. The red line is genome-wide significance. One at 28 megabases and one at about 42 megabases. And that in fact most of the signal is coming from the gray and blue dots which were only discovered in 1000 genomes. And so this hit at the 28 megabase position was completely missed in the original WTCCC as well as a meta-analysis of Crohn's disease in 2008. The p-value is bigger than 10 to the minus 4 so it wasn't followed up. And it's worth pointing out that the key SNP from 1000 genomes that really makes this a significant association is only 3% allele frequency in Europeans. So it's this class of low-frequency variation that 1000 genomes is really just allowing us to scratch. Now this hit, you might say well maybe it's a false positive but in fact we're just about to publish an enlarged meta-analysis of Crohn's with over 20,000 total samples which absolutely confirms that this is a true Crohn's disease association. But the hit SNP in that analysis which used HapMap-based imputation is 13%. So in essence we found by a much larger set of samples a signal at a SNP that's probably only very weakly in LD with the causal SNP, it's at 13% rather than the likely causal SNP at 3%. So this is really cool in that in this particular instance had we had the 1000 genomes reference we might have been able to find this gene with just the original WTCCC samples of about 5,000 instead of having to do the much larger meta-analysis. Now as a word of caution these things aren't leaping out all over the place and I've chosen a very interesting example. We also have this thing to the right at 40 megs and that isn't supported at all in the meta-analysis. So it is still difficult to work out all of the sort of tweaks in doing imputation-based analysis with 1000 genomes. It's worth pointing out that that very top blue SNP is in fact an indel which might imply that again the indels aren't quite as reliable as the SNPs yet. I'd like to just also mention this idea of validation and gold standards. The project did a lot of genotyping to validate the sequence calls. And this is really useful but it's also important to keep in mind that I don't think either is always the gold standard. So what I've shown here is in the genotype data the X and Y axes are the intensity of one allele, so allele C at this particular location and the other axis is the intensity of allele T from the genotyping algorithm, the genotyping experiment. Now calling genotypes is basically just coloring in the dots so you could color in blue for one homozygous cluster and green for a heterozygous cluster. But what we've done here is instead of using the guesses from the genotyping experiment we've colored in the genotype intensities with the sequence calls. So in this example you can see these are the three populations from the low coverage data. They basically agree perfectly the variant doesn't exist in the African or Asian populations. There are a number of heterozygous individuals in the European population and the sequence calls which are the green colors are perfectly concordant with the genotype intensities. We can see you probably can barely see this actually. There is one green dot in the Y or I box. In essence this is a rare variant, a singleton variant discovered in the sequencing which also is nicely genotyped. So this is an important thing going forward obviously because we want to get accurate data on rare variation as reassuring that at least some of the time both sequencing and genotyping can get high accuracy for rare variation. In other circumstances the sequencing clearly makes a mistake. So here this variant was discovered in the Asians, the sequence genotypes are exactly concordant with what we'd expect by the genotyping. But you can see that this variant is actually very polymorphic in the African population. There are in fact not just heterozygous but also non-referenced homozygous here. Yet the sequencing didn't discover this at all. So in this particular case within that population the sequencing missed the variant. You can also see the opposite happening and this is a little bit hard to see but the genotype intensity data here isn't showing very much signal. It's all kind of munged up to the left side. But what you can see is that if we color in the dance based on the sequence there's a pretty clear striping effect of blue reference homozygous, green heterozygous and a couple of red non-reference homozygous. So in this case the sequence has pretty clearly picked up a rare variant but for whatever reason the probe used to do the genotyping experiment hasn't been able to distinguish them very well. So it's worth thinking that as we try to build gold standard reference sets it's a combination really of interrogating the sequence and the genotype kinds of data sets to be able to really know what the truth is. So that's switching now from using imputation to try to discover new associations and data sets. We can also use the wealth of new variation in 1000 genomes to try to annotate existing GWAS results. So for example this is showing different classes of functional variations so non-synonymous, stop, splice, the HGMD disease mutations and basically their enrichment or depletion relative to neutral variation at different derived allele frequencies. So what you can see is that at the right hand side of the graph for high derived allele frequencies as we'd expect these functional variants are much less common they're below that dashed line which is the rate at that frequency of neutral variation. On the left hand side we see many more we see an enrichment of these kinds of variation being very low frequency. So we need projects like 1000 genomes to be able to discover these interesting functional variants because they're hugely skewed towards being low frequency. And there are some cases where GWAS hitsnip is really strongly correlated to say r squared greater than 0.9 with a functional variant. And that might be a really useful smoking gun to annotate that functional variant as a possible causal allele but there's still a lot left to be discovered and a lot of evidence suggesting that functional protein coding changes aren't the whole story in explaining GWAS hits because less than 10% of GWAS hit snips have a very strong r squared greater than 0.9 coding snip in the 1000 genomes data. So to finish off 1000 genomes is definitely going to become the default reference panel it is going to incorporate all the information that's currently in HapMap data and then some. In addition to imputation which the project will support by releasing these reference data sets it also enables the discovery and annotation of variants the standardization of file formats across many different types of data the project generates and we've heard a lot about that and the development of genotyping products. So companies like Illumina are now building arrays with 2.5 or 5 million snips and indels all generated from the 1000 genomes data. So those are going to enable a new version of the project. And finally coming to grips with the subtleties of the data so I mentioned how the different types of variation are at different states of maturity this is still going to evolve over time and the project is going to really try to move forward in generating really high quality data sets. So I'll finish there obviously most of this work was done by the huge project Gonzalo, Brian, Brian, Jonathan have obviously developed imputation methods which are really useful to James helped with some of the functional annotation and Luke in my group really did a lot of work on the imputation and there we were at ASHG last year I don't think our picture this year will be quite as nice. Thanks very much. Sorry could you repeat that? So the question is whether there's any plan for the individual sequences part of 1000 genomes to get benefit from their data so these individuals are completely anonymized there's no way to connect their sequence back to them so no no direct benefit just the the indirect benefit that the world will get from learning something new about medicine. Yeah. Okay two questions I was hoping you can help me understand a few things. What is the overlap of any between individuals from HapMap and the 1000 genomes data? Were the same individuals used? Are there some that were the same and some that aren't? So the overlap is very substantial. Are all of them in the HapMap 3? So most of them come so the HapMap is itself and I think that all of the 1000 genomes pilot individuals at least are in the HapMap. The populations for the full 1000 genomes project are actually quite a bit more diverse than just the HapMap so many of those are brand new individuals. Okay and the second question is why aren't all HapMap SNPs represented in the 1000 genomes data? That's a good question. This really gets to calibrating sensitivity and specificity of discovering variation in the sequence data so there are real variants such as ones that were actually genotyped by the HapMap which just were missed by the low coverage sequencing. It doesn't find everything. That being said we're hoping in the future we're going to start applying many technologies to these samples so we'll do big genotyping chips, high coverage exon sequencing, low coverage whole genome sequencing, array CGH experiments on structural variation and we hope eventually to release combined data sets which will have the best of every kind so it will really be as good a guess as we can make at every position in the genome and that will eliminate the problem you're seeing where we'll have filled in the missed sequence variants with the genotype results. So for those of us eager to start using data from the full project as an imputation reference panel can you clarify the expected timeline of when we'll start to see the first wave of VCF phased files from the larger sample sizes? So we did just discuss this before the meeting started and the goal is to we're trying to sort of phase the future releases so that on the one hand there won't be too many but on the other hand we'll get useful data out soon. So the goal is very soon we're going to release an interim set of variant calls from about 650 samples the timeline for getting phased reference sets from that is probably early next year and then a bigger release will be aimed to be generated and have preliminary analysis for the Cold Spring Harbor Biology of Genomes meeting in May of next year and then I think another hopefully big release by late 2011. Okay, I think everyone's ready for the bar.