 All right, good evening. Thanks for everyone for coming out tonight. What I'd like to do in the next five or 10 minutes is just do a brief overview of how to access the data, both on the project website and in other resources that we have, either at NCBI or EBI. And I'll be followed by Paul, who will do a nice introduction to the browser they've developed. So before I went into how to get the data, I just thought I would review a little bit more of what Gabor said, that there's three primary data types generated by the project. We have a sequence, which is in a fast Q format. And as you can see here, for those who care, it's a tag followed by a fragment of sequence followed by base quality scores and an ASCII computer coded system where the letters or characters are representing Fred-based quality scores. The SAM BAM format is for how the alignments actually look. So if you're looking in the text version of the data, which is SAM, you would see a file, this is a very small shot. This is just the alignment of one read where you have some header information at the top in red. You have alignment information in blue here with a cigar string followed by sequence and again, the quality scores. You don't see this pointer up there. Over there, okay, thanks. So these are very large files. Both the fast Q sequences and the SAM files are now approaching tens of terabytes of data. So if you're gonna download this data, there's a real investment and you have to make both in hardware infrastructure for disk and bandwidth at your institution to make sure you can do this in a timely fashion. But again, as Gabor and Gil alluded to, we wanna make these data useful to a broad variety of people. So we also report the variant calls, which is the summarization of the differences that pass the QC filters or could also in raw versions include potential variants that don't pass the QC filters. But right now, for which is a clarity, we're showing on the release files all of the things that look like real variants. So you have a list here. This is the VCF format. Again, the links are on these slides like in Gabor's. In this format, you have chromosome positions and identities. If there's a dbSNP RS number, it's available as well as the alternate alleles information. And on the files with genotypes, you have these column delimited genotypes one per individual. In the no genotype version, there's only eight columns on the file. And these are very compact and provide an easy to digest, easy to parse representation of the data. Again, this is a standard format that you can look for. So having reviewed those three types of data, I thought I'd just take you on a quick tour of where the data's available on the FTP site or the browser. So 1000genomes.org is the, I'm sorry, I can't find my pointer again. Is this a laser pointer? Here. This is the URL you should start with. This is hosted by EMBL-EBI. The data behind it is mirrored both at NCBI and EMBL. And the link that's most important is probably the data link. That's where you go to get access to all of the data. That'll take you to a new page here that reviews but the data release policy, the conditions for use for publications authorship and just kind of what kind of credit the consortium would appreciate in using the data. Scrolling down the page, you'll see a review of all of the file formats and conventions I just described. And then further how to access the data. The important links are the two, which there's no pointer. Okay, you can see there. These two links, the FTP sites either at EBI or NCBI are mirrored every 24 hours. So you can go to either site and you'll see the same FTP organization. And I'll just walk through that. But I also like to point out that if you really are trying to mirror all of the data that you would strongly encourage you to consider using the Aspera technology. This is a much faster UDP based transmission protocol than FTP. It's about 10 times faster. It'll get up to 300 megabits per second if you have the infrastructure in place. And it'll allow you to down, especially if you're getting the alignments in the FASTQ. If you're just doing the other files, it's not so important the VCF files. So going into the FTP site, just pointing out a few of the core paths that are useful to know about. First, there's readme's up here that document in really nice detail, kindly hosted by Laura Clark at EBI. The organization of the FTP site and the formats of the files. There is a folder called data, which is where all of the alignment and sequence information is posted by sample ID. This would be the Coriel NA number. And for all 700, 1100 samples we have now, that's where you can get all of the data that's been sequenced. That's updated nightly. We have pilot data, which is the release sets that have just been released to the nature paper. So if you're actually trying to find the data sets of the VCF calls, the NDELs, the SUVs, everything else, that's in this pilot data folder here. Release is a folder for our past releases. So if you're interested in historical data from 2009, 2008, or the early 2010 stuff, that's where we put the archive releases. And then finally, for those of you that wanna be on the bleeding edge, Gil alluded to the fact that we make publicly available all of our work products. There's a technical folder. And in this folder is where you can go and see our latest unreleased products. So these would be the worklists that were, that the project is working on right now. It's caveat emptor. Don't ask the help desk why things may or may not work there, because this is really the cutting edge where the project shares the data about what we think the next release might look like. So it's there if you want to look at it, but it's not supported as a release product until it's in the release folder. So beyond the website, where else is the data available? DBSNP is the primary archive for the SNPs in the Indel data that's gonna be coming out of the paper. The SNPs were put in Build 132, which was released a few months ago. And right now DBSNP is up to 23.6 million uniquely mapped SNPs. These are the single base class one. The deletions from the paper will be coming out in Build 133. And I believe there's a 1.1 million Indels that will be loaded. And then there's other categories of things that the project isn't really generating, that are just in there. Something that we've tried to do to help develop the VCF file format, I'm pleased to announce that we've taken all of DBSNP and dumped it in VCF format. So there's a file that's here at this link where you can go and get a map organized file of all of the content of DBSNP. So it would be everything that's uniquely placed. And there is a nice annotation in column eight, which is the info column where we are annotating for every SNP the tags. And there's a URL at the top is an Excel spreadsheet that describes the 47 tags that we're using in VCF to talk about the clinical information. So SNPs that are in OMIM or known to be clinical or diagnostic and a gene typing kit, LSDBs, they're in there, all of the functional information of a SNP as encoding region, Intron, Exxon, Python, UPR, 3D structure, splice site, that's all in there. The sequence annotation, if there are weirdnesses about how the SNP aligns to assemblies, if it aligns to the reference, but not a salera or vice versa, there's all that kind of nuanced stuff. Information about genotyping, the platforms, if SNPs are on genotyping platforms, if there's conflicts, the same SNP has been genotyped at different times in the same person and it's inconsistent. We tag all that stuff in DBSTIP. So this is just a resource, if you're looking for a quick way to get the entire contents of DBSTIP in one file, that's a place to go. The contents of Build 132 have been put out on all of our RefSeq annotations. RefSeq is an NCBI product for chromosomes, mRNA proteins, and then we collaborate with EBI to make the LRGs, which are the clinical sequence records for the clinical reporting standard. And so variations are annotated in all of those places. The last thing I'd like to say is if you're trying to find sequence data by sample, we have made some, we have a resource called biosample at NCBI that you can find from a pull down menu here at the top. And if you put in the tag 1000 genomes, and you can do pilot one, two, or three, and it will return all of the samples that were actually sequenced as that part of the project with their Coriel NA number. So if you want to get, in this example, I did pilot two, these are the six individuals from the two trios, but likewise, there's the 179 from pilot one and the 757 from pilot three. You can click on any one of those samples here, and that will take you to a sample summary record, which points back to DBSTIP and all the content for that person, and maybe more importantly, SRA, to get all of the sequence and the alignment data for a particular individual. So that's kind of an easy way to jump in and pull data by sample ID. Last slide is to say we have colleague, our collaborators in Amazon that have been trying to develop cloud-based computing platforms, and we're happy to say the 1000 genomes has been an enthusiastic partner about putting the pilot data into the Amazon cloud. So for those of you that are trying to pursue cloud solutions, there's an S3 address for finding the data, and there's also an XML summary. Right now the data are chunked into five gigabase bins because there's a file size limited Amazon that I understand might be going away in the next few months, and we'll be back to whole chromosomes. So with that I'd like to stop and maybe take a few questions and then turn it over to Paul. Thanks. And then please speak into the mic so the recorded audience can hear. So QC filter that you applied to the VCF file. The QC filter is permitted for anybody that's running a SNP calling algorithm to write their QC information, passes a default string for things that passed a caller standard. It's not a rigorously defined column right now. Do you have documentation for that? I think the methods documentation in the paper would explain how each caller applied QC to their call sets. Hi, I downloaded the VCF from DBSnip maybe a few weeks ago, it was labeled I think September 30th, but at least at that time it was only about 10% of DBSnip, and I interpreted it as just a thousand genomes related component, but you implied it's the entire DBSnip, so did it change or did I? As of 2 p.m. today, the full file came up, and so this is a new fresh off the press release. A final question, do I understand that there's now more detailed allele frequency data in this compiled DBSnip data set you're describing? The version that we put out today does not have a allele frequency data in it. There are the genotypes that we have in DBSnip are there by population, but that doesn't include all SNPs. We have allele frequency data, I think right now on about 50% of DBSnip, a lot of the legacy data didn't have allele frequency, but we think that will get upgraded as this next set of 25 million SNPs come in, so our goal is to try to match DBSnip releases with the major build releases of this project and then pull in all the genotype and frequency data to keep it in sync. That would be very helpful, thank you. Yes, sir? Quite clearly this data is huge. Is there any thought to being making it available on physical media or for people to ship physical media to you if you wanted to trade sort of bandwidth for latency? We're happy to talk about it, that's in our experience can be a lot slower and it doesn't scale well for us because it means a lot of pairwise conversations with hundreds of people in their drives and did we get the right hardware to the right person? We encourage the asparagus solution if you have the capacity to do it and if you don't, then you should probably just write info at 1000 Genomes and Paul and I could try to find a solution. Another question, clearly there's a lot of value in the pipelines we use to analyze this data as well and given the huge size of the data compared to the actual size of the informatic component, is there any thought to sort of releasing a server image or something of that nature that would provide a working environment in which the data was analyzed as well as the raw data itself? I think there are some groups that are trying to consider putting their pipeline methods into the cloud and so it could be run in like an Amazon environment or something. I know just from our personal experience at NCBI, there is a lot of infrastructure optimization that had to go in to get this pipeline to run and I don't know if an image would translate effectively. It's really, I mean the performance of this is tied to the hardware characteristics of your server cluster and your compute farm and we'd have to explore that. So I think each method developer has their own goals about how to release their product and I just say that they've been encouraged to do that. I think technology transfers one of the goals of the project, we have a pipeline in NCBI, Sanger has a pipeline, Brod has a pipeline. So I think you could maybe talk individually to them, the project won't speak for the pipelines as a whole. Anything else? All right, well then I'd like to introduce Paul Flightcheck, my colleague at EBI.