 Thank you very much, Paul. It's my pleasure to present the structure variations for the structure variation consortium of the 1000 Genomes Project, and this obviously relates to a peculiar class of variants in the genome, and I'd like to start this tutorial with briefly presenting you our view of what we define as a structure variant in the human genome. We, and I'm speaking for a heterogeneous group of 10 different research teams that are being involved since three years almost now to identify structure variations in the 1000 Genomes Project pilot data. We define these variants as polymorphic rearrangements of the genome of 50 base per up to hundreds of kilobases in size. As the 1000 Genomes Project maps reads initially against the reference genome, the way we define structure variations from the start is by relating them to the reference, so a deletion ends up being a variant that is present in a sample sequence by the 1000 Genomes Project that is basically corresponding to a missing chunk of DNA relative to the reference. An insertion on the other hand is a chunk of DNA that is inserted in the individual of interest. The version will be sequenced that is flipped in orientation, a tandem duplication and a dispersed duplication. Those types of variants signify different sorts of duplications depending on the context of the variant. And we're also looking for copy number variants or what the field relates to as copy number variants, namely LOSA and the genome that show fairly extreme variation in terms of regions being strongly different and copy number. How striking this variation can be is shown on the right panel of the slide. It essentially shows results from a study I was not involved in personally by a period all that looked at the AMI-1 gene locus and found quite striking copy number variation in this locus with some individuals displaying copy numbers up to 20 and others displaying copy numbers down to only three copies per person. AMI-1 to give you some background information on that gene is actually a salivary amylase and period R could draw a link between the genotype that is the actual copy number of the gene which is expressed in the saliva and human preferences in terms of diet and particular starch consumption. So relating to the work we carried out as a group over the last two years, a very important component of structure variation analysis in the 1000 Genomes Project was method development. And when the project started, methods to detect structure variation sequencing data were in the child choose. So we developed methods and began detecting variants largely with four different conceptual rationales that are summarized on this slide here. The first rationale is read pair analysis or paired in mapping which we used the fact that the 1000 Genomes Project sequenced in most of the sequencing reads ends of a fragment for which the approximate size was known. So we could relate those read ends to the reference and from the way the ends mapped onto the reference we could then decide as to whether structure variant was present in the sample. So the ends mapped with approximately expected distance and orientation we did not infer on SV at that position however if the reads mapped too far apart we identified deletions. The other hand if one end of a read pair mapped onto a mobile element this allowed us to infer polymorphic mobile element insertions in the human genome. We also used the fact that we could analyze the relative orientation of the read pairs we analyzed and this allowed us to identify tandem duplications. Some people in the audience may notice that read pair analysis can also be used for other types of variants and just had the question whether inversions have been ascertained by the 1000 Genomes Project. So in the initial phase of the project we made the conscious decision not to look for balanced variants which are harder to validate in appreciable scale compared to unbalanced variants. So we in the first pilot phase of the project did not look at inversions but we identified mobile element insertions, tandem duplications and deletions. So read pair analysis was one of the conceptual approaches we used to identify structure variants. The second one is read depth analysis and read depth analysis we closely assessed where reads mapped onto human chromosomes and this allowed us to identify deletions regions in which much less reads mapped than expected and duplications as regions in which many more reads mapped than expected. The third conceptual approach was split read analysis in which we used the fact that we could align some of the short sequencing reads generated by the project directly over structure variations by gapped alignments. This allowed us to identify many deletions in particular small ones. But not least in the trios that is and the genomes that were sequenced at very high coverage in a whole genome sequencing fashion. We could use assembly to identify several novel sequence insertions by an assembly rational. I'm showing you in the next slide how essentially these different approaches that are quite different conceptually often identified the same type of variant by showing you an example locus in the human genome, a deletion on chromosome one. The position of the deletion is displayed on the upper panel of the slide and purple and the two panels below indicate or they display sequencing results from the trios project. The middle panel corresponds to an individual of Nigerian ancestry which does not display the deletions so all the gray spots on this panel indicate mapped reads that were mapped with a specific mapping quality. The yellow line actually shows the read depth along the chromosome for this individual or you'll notice that the individual below appears to carry the deletion. The read depth drops significantly near the break points of the deletion so read depth could actually be used to identify the deletion. The more stretched read pairs frame the deletion as well as a split read could be identified by gap sequence alignment again spanning this deletion. This is a nice example for educational purposes that shows that different types of rationales can come to the same conclusion, identify deletion in a particular sample where actually in most instances we found that different rationales for structure variant detection were actually identifying different types of variations. Read depth analysis was very well suited for identifying fairly large variants as one may expect. The larger the variant the more reads mapped into it allowing us to make a statistical ascertainment of an event. On the other hand split read analysis was most suited for identifying small variants. For the more and this is very important for you the audience potential users of the data the precision with which structure variations were analyzed that is the precision with which we gained knowledge on the precise start and end coordinates of the variants also differed between the different approaches that we used. As you can imagine for split read analysis and for assembly the precision was very high so those approaches map structure variations to their precise break point and we can record the genomic coordinates as the structure variations become discovered. For repair analysis you are shown in the middle panel the actual break point precision that we assessed based on relating discovered variants back to validated variants and assembled variants but the precision was lower it was in the order of several dozens of base pairs for read depth analysis it was even lower than that as one would expect. So these are important information for being able to use the data and I'm going to summarize in this slide the actual extent of structure variation data that we released together with the 1000 genomes pilot paper that came out in nature last week. So we released structure variation calls for two pilot projects for the low coverage and for the trio pilots that is the whole genome sequencing pilots. The amount of raw data and sample studied were related to in presentations earlier today we identified 11,000 additions in the trio 15,000 in the low coverage individuals and we further more identified about 6,000 mobile elements in both projects. So for those types of variants, deletions and mobile element insertions we inferred based on gold standards that we achieved a sensitivity of approximately 80% in the high coverage trios lower sensitivity in the other individuals that were sequenced at lower coverage. So those were fairly well covered yes there was a question relating to the sizes of the deletions we discovered. We discovered additions 50 base percent size and larger up to hundreds of kilobase and in fact nearly megabase size range in some instances. All right so deletions and mobile insertions mobile element insertions were discovered at high sensitivity. We discovered at lower sensitivity tandem duplications and no sequence insertions and that relates to difficulties in identifying these variants compared to the other two variants and also relates to the fact that we used a lower number of callers to identify these two variants. Right so question was whether overlapping positions between the trios and the low coverage product. Yes there were many overlapping events we identified obviously to large extent the same variants for both projects. I'm shown below the number of structure variant break points or the number of structure variants for which we identified break points that's slightly more than 50% of the structure variations we released have precise break point sequences that means they were either discovered by split read analysis or by assembly or they were discovered by another approach but repair analysis and redepth analysis and later validated by a targeted assembly approach by the TECA assembly that was developed at WashU and that enabled us to essentially enrich or set quite strikingly for structure variations for which we achieved break point position. Furthermore apart from the discovered variants we released a fairly large set of genotypes for structure variations however similar to what we released in the in terms of discovered structure variations so far our release that has carburet that we only released genotypes for deletions along with the pilot paper. As we speak a release of mobile element insertion genotypes is being prepared so for this second type of variations we are in the process of releasing genotypes as well for tandem duplications and novel sequence insertion so far no genotypes have been generated by the project. The slide here relates to deletion genotypes. A method by Bob Hanssaker at AL called genome strip has been used to generate deletion genotypes for about 14,000 deletions. We closely examined the concordance of these deletion genotypes with previously published array-based genotypes in particular with about 2,000 deletions that have been genotyped in a paper by Conrad at AL that came out about a year ago and we found the concordance to be in the order of 99%. Genome strip to give you some information with regard to how this method operates uses three of the features that I explained to you earlier in this talk that we used to discover structure variations namely read depth, read pair and split read analysis. And using the deletion genotypes that were inferred by genome strip we could relate deletions to SNPs and identify for 81% of the common deletions that we generated genotypes for at least one hop-mop tuck SNP for which a very strong association exists with an R square greater than 0.8%. This means that a large portion of the deletions we genotyped are actually taggable by nearby SNPs and therefore in the future it could be imputed into association studies. Again relating to what was earlier said in this presentation they could in a sense be genotyped for free with even with present genotyping arrays. I'd like to, yes, right, so question is do we have data on how they are tagged by 1000 genome SNPs? So this analysis has been carried out recently and my answer to this is no I don't have data on 1000 genome SNPs for this. So I'd like to spend the remaining five minutes of my talk and talk in more detail about the specific data formats relating to structure variations and along with our structure variation release we essentially released data in the VCF format that was mentioned in several of the earlier talks today and that provides the specific possibility to represent structure variation calls. Furthermore we decided as a group as a huge portion of the work we carried out in the last three years was approach development we decided to also release what we call master validation tables which basically represent a fairly raw data format in which various raw informations that we collected over the last years can be traced back and I'll come back to that a bit later. Furthermore we released structure variation breakpoint information as text files in fast A format and release specific information relating to the fact that an analysis of structure variation, a relation of structure variation to primer genomes allowed us to infer the ancestral state of structure variations compared to the chimp genome for instance and the macaque or Utah genomes. So I'm going to start with the VCF format and won't spend too long on this slide as we heard already a lot of details on VCF and important key messages from this slide are there's useful tools available that enable processing VCF files that can easily be converted into extra spreadsheets power modules have now been released that also allow to essentially browse through the VCF files and how such a VCF file looks like from a structure variant perspective as shown in this slide and I'd like to briefly walk you through it. So although appearing cryptic initially what can be retrieved from the VCF file is the precise position of a variant if it is known there is a VCF format for a structure variant for which the breakpoints have been identified but there's also an imprecise tag available showing on the lower panel of the slide here for variants for which the exact breakpoints are not known. Apart from the position of the variant if the variant breakpoint is known we report the actual reference allele sequence as well as the alternative allele otherwise instead of reporting the alternative allele a text such as Dell will indicate that a deletion has been found at this position and corresponding text would indicate that a mobile element insertion for instance or a ton of duplication has been identified. We would I'm saying we would be presently repeating the question if there would be a duplication of two megabases would we put the two megabase duplication in there with a present specification of VCF as far as I know we would presently our release set is very much enriched for deletions compared to duplications and to the best of my knowledge we don't have duplications of that length of that size in the data at present. The master variation tables a motivation for reporting these tables along with the data shown in this slide here we have the opportunity through the master variation tables to enrich the released structure variation data with specific validation results. We carried out PCR and micro analysis at fairly large scale and in the master tables for each call users can essentially obtain information with regard to the validation status of the variant whether it was validated or not and the specific method that was used to identify the variant can be identified in the master variation tables along with meta information such as the precise breakpoint sequence. Examples for master variation tables that are available online are shown in this slide here I won't walk you through this slide in detail but this is information present in your handouts that you can use and that will also be posted online. Last but not least we're making information on the ancestral state of structure variations available as well as on the formation mechanism involved and this relates back to something I said in my introduction so we define structure variations initially as a variant relative to the reference genome. Obviously there are other views on how to identify and how to relate to a structure variation and many people would say a more appropriate definition of a structure variant is whether it occurred as a deletion or as an insertion relative to ancestral genomic sequence when it occurred and we use the break C classification algorithm to identify the plausible ancestral state and also the presumable formation mechanism of structure variations for which we identified breakpoint level data and the format of the GFF tables that are reported by the break seek tool are shown in this slide here. So report in addition to data that would also be available from the VCF file just as the chromosome the start and end coordinate report information relating to the ancestral state and the formation mechanism. Before closing my talk I'm relating back to the presentation by Paul Fleachek. It is now possible to display structure variations in the 1000 genomes browser making it feasible for people that don't want to go through the process of deciphering VCF files or of downloading the appropriate software to deal with VCF files and this is a slide that was also shown by Paul Fleachek relating to a variant on chromosome one that is displayed in large deletion and this slightly relates to the presentation that Jeffrey Barrett will give after mine. It is possible with the reference deletion genotypes that we released to impute deletions into genome-wide association studies. At least we are making, by making available these deletion genotypes we facilitate this to you. Deletions can be imputed into GWAR studies using existing tools and I'm closing by showing you the link slide which I see is a bit scrambled on this display but it would hopefully not be scrambled once this slide here goes online. I thank you very much for your information. These are the people that are involved in the structure variation group of the 1000 genomes project. I'll be happy to take questions. Thank you for this talk. I have two questions. The first one is how well has been the phenotype of the 1000 people characterized and the second question is would you have and have you find any mosaicism? Right. So the individuals that we are analyzing are anonymized. Phenotypic information has been gathered by groups in molecular terms that have started to analyze some of the samples in particular, samples that have previously been analyzed by the hub map by measuring RNA expression in these samples. Other groups have looked at transcription factor binding. That would be the answer to the first question. So this relates to chirotaping, whether the samples have been chirotaped and transomies have been excluded prior to the analysis by the project. So there were, in some samples there will be cell and artifacts. Yeah, exactly. But they are still there. My question is really has these people been looked at clinically? Do we know that they are normal individuals? These people are all anonymous. So this information would not be available. I think it has not been comprehensively analyzed. I have a few questions. The first one is about the two children in the trio sets for which we have data from three different platforms. In your analysis of the structuration, did you try to integrate these three data sets and try to see how a lot of these support each other? Right. We did so and in fact the data very often support each other. What is more important, the data showed us that different, for instance, different insect-sized libraries that were used by the different platforms actually added value in the sense that large insect-sized enabled us to identify variations at a different size scale than smaller insect-sized. Is this information kept in somewhere in the VCF format? So you may have noticed this on one of the slides that I showed. Through the tables that we make available on the THOSGEMES website, this information can be traced back. You can identify the sequencing platform that was used to identify a specific variant or whether it was seen by more than one platform. Thanks. The other question is, you mentioned you talked this morning that you are or you will be using BWA for the alignment. I know that for the first data set you used MAC, but I guess the reason you're using BWA is it's much faster. But on the other hand, you know, if I understand correctly, BWA doesn't consider the quality score. I just wonder if that will compromise any accuracy in the alignment. Right. So another point that I did not mention is the tutorial apart from testing various different methods for assessing structure variation. After we'd have been mapped to the genome, we also assessed the functionality of different mappers and actually different structure variation algorithms that were used benefited from different types of mappers. So apart from BWA and MAQ, also Mr. Fast was used as a mapper. For instance, Mosaic was used as another mapper. One tool used BLAD as a mapper. So in a sense, you know, to say in simple terms, it depends a bit what one is looking for. Split read analysis clearly benefits from a slightly imprecise mapper that then enables an additional monitoring of reads that were imperfectly mapped in the first mapping step. But I guess for the BAM format you are providing to us, would that be mostly done with BWA? That's correct. I'm just wondering if your method for detecting structural variation is able to deal with overlapping or superimposed different kinds of structural variants, example, like a deletion that's on an insertion background rather than to the reference, and if so, how that would be represented in the VCF format. Right. So our analysis pipeline does deal with these types of variations. We, for each, for each and every sample tool that we use to identify structural variations, we closely assess breakpoint confidence values to then be able to merge structural variations that were developed by different approaches in a second step. As long as they merge together, they're then considered as a single variant. If they don't merge, they'll be considered as being different variants. Okay. So I think that can be analyzed. Then an insertion with a deletion on it would be just an insertion, basically, of a different length. Right. If it's a deletion and an insertion identified at the same location, if both approaches use the same reference for inferring the variant, these would indeed be then obviously reported as two different variants. However, again, I was simplifying here, but so not all approaches use the reference genome as a reference. Some approaches use the population reference to identify structural variation, which then leads to, may lead to a duplication call where another method would infer a deletion. Thanks. When we use the imputation program to impute the common deletions, by common do you mean larger than 1% or 5% here? Right. To my knowledge, the genome software imputed all deletions greater than 5% with Beagle. 5%. Thanks. Was there a significant drop in precision and sensitivity for the heterozygous deletions in the CNB calls when moving from the low coverage to the high coverage samples? Right. Can you please repeat the question? So was there a significant difference in the precision and specificity of heterozygous deletion calls when comparing the low coverage samples and the high coverage samples? Right. That's a good question. So the low coverage samples did indeed, in which more for homozygous structure variations, stand for heterozygous ones, which simply comes from the fact that the genomes were sequenced at a depth of 2 to 4x, which makes it much easier to identify homozygous variant over heterozygous one. I have one another question. In one of the tables, it says that the novel sequence insertions identified in trios was 174, and there were none identified in the low coverage samples. And the trios were done at 20x? The trios were done at 20 to 40x. Okay. So is it just because of the high coverage or is it because there's more confidence because of the transmission availability here? Yeah. Good question. So in this case, really the depth was the major consideration. As assembly of a new sequence insertion indeed requires a lot of weed to be able to confidently assemble that insertion. So the assemblers were only one on the high depth samples. What? I thank you for a lengthy discussion. And then moving on to our next speaker, Jeffrey Barrett, we'll talk about how the data can be used in association studies.