 Okay, so I'd like to start by thanking the organizers for giving me this excellent opportunity to present our call for participation in the mutation calling benchmark exercise we've put together. Mutation calling benchmark four, this is the fourth benchmarking exercise that TCGA is carried out. So I'm going to tell you a bit about how we've gone about setting this up, what the motivation is, and how to get involved. So just briefly to go over the mutation calling benchmark process in case anyone is unfamiliar. We start out by selecting pairs of tumor and normal BAMs, BAMs contain short read alignments. And these BAMs are then distributed to the participants in the benchmarking exercise and they call mutations and return mutations as VCF files. So VCF stands for variant call file. It's just a standard widely used method for expressing all variety of mutations in a unified file format that we really want to push people to express their mutations in. And so VCFs are collected and compared, compared for concordance and discordance across somatic and we want to encourage people to submit germline calls as well. And so at the end of the day, what we get is a picture of sort of where the field of mutation calling in cancer stands and that's a really valuable thing. Really briefly to give you a little bit of background, make sure we're all on the same page with the kinds of mutations I'm going to be talking about. I'm sure this is fundamental, but SNVs are single nucleotide variants. So just base pair changes that define nucleotide positions. Indels are short, insertions and deletions less than 100 base pairs. Larger rearrangements like insertions, deletions, duplications, inversions, transductions are referred to collectively as structural variants or SVs. And regions where genomes differ from a diploid copy number, differ from an absolute allele count of two are referred to as copy number variant or CNVs. And so since this is benchmark four, clearly there are three other benchmarks which occurred prior to this. Just to go through briefly the history of benchmarking efforts in TCGA. So benchmark one was single nucleotide variant calling on six pairs of whole genomes. Benchmark two was single nucleotide variant calling on 14 pairs, two or normal pairs of exomes. Again, single nucleotide variants on 25 pairs of exomes, this time with associated validation data, so deep sequencing data over selected regions to validate the presence of mutations. And so what I'm calling for participation for today is benchmark four. So in addition to single nucleotide variants, we're going to take indels, SVs and CNVs into account. We're going to do this on whole genomes derived from cell lines. So why is it important that we do another benchmark? So we've done three. Why is it important to do another? Well, if we're going to accomplish the goal of comprehensively characterizing cancer genomes, TCGA has to get together and measure and set standards for the accuracy of mutation calls. And sort of toward this end, in this benchmark, we're being more comprehensive about the variety of mutations that we're considering. So like I said, in addition to single nucleotide variants, we want to extend this to indels, structural variants, and copy number variants to get the full spectrum of variation and evaluate how different mutation calling algorithms are performing across these different types of somatic variants. So as I'll talk about in subsequent slides, benchmark four, really, it's a controlled experiment. So we have these cell lines. We can take advantage of their clonality to do things like simulate normal contamination, which so Gaddy's talk and Chris Miller's talk or Great Lead-Ins for this. And we can simulate subclonal expansions by using spike-in mutations. Spike-in mutations also give us the opportunity to evaluate false negative rates, so they give us sort of a ground truth, and that hasn't been possible in previous benchmarking efforts. And since the cell line genome data is publicly distributable, we can encourage wide participation both within TCGA and outside of TCGA. So for instance, we're reaching out to ICGC and they're participating in this benchmark. And others outside of the cancer genome consortiums who have an interest in mutation calling in this sort of tumor normal context are encouraged to participate. So further, let's see on this theme of why are we doing another benchmark? So there's still a lot of discordance in the mutation calls that we get. So this is sort of a representative example from a previous benchmark exercise. And so what is shown here on this Venn diagram, there's calls on the same pair of tumor and normal BAM file made by the Broad Institute, by WashU, and by UCSC. And you can see sort of the concordance and discordance here in this Venn diagram. So the majority of mutations are concordant between at least two of the centers, but there's still a lot of discordance happening. And this is important to take into consideration since sort of mutation calling is fundamental to cancer genomics. Cancer genomic depends on the sort of fidelity of mutation calling algorithms. So the samples that we're using to derive all of the BAM files that we're distributing for Benchmark IV are based on these two pairs of cell lines, so HCC1143 and HCC1954. So these are both 1143 and 1954 are both derived from breast tumors and they each have a paired normal sample which is derived from blood from the same, it's a cell line derived from blood from the same patient. All of these lines are available through ATCC and they've been sequenced to between the sequences that we have for this benchmarker at between 50X and 71X sequenced at the Broad Institute. And as I'll talk about and as I've mentioned, all this data is publicly distributed through CG Hub. So this is sort of what we want participants to do. So there's three parts to this mutation calling exercise. So the first part is pretty straightforward, we just want participants to compare the tumor cell line full genome BAM to the corresponding normal full genome BAM for both pairs of cell lines. This will sort of establish a baseline under sort of ideal conditions, so they've got sort of higher coverage genomes here in their cell lines so they're presumably clonal. And so from there we can use sort of this clonal property of the cell lines to do interesting things. And so these are, we can simulate normal contamination, so sort of in this row A here what I'm showing is samples of each one of these pie charts represents a BAM file that we've generated for the benchmarking exercise. And so what I'm showing here is we've mixed the normal and tumor BAMs to yield a 30X coverage BAM file in various proportions. So over here it simulates 5% normal contamination and over here we're simulating 95% normal contamination. And as has been alluded to in previous talks, normal contamination is an important factor in mutation calling fidelity. And so in addition to simulating normal contamination, we can simulate subclone expansion. And the way we do this is by taking the original tumor BAM file, spiking in single nucleotide variants and structural variants into a single allele. And we can spike into a single allele by using results from Scott Carter and Gettigetz's group's absolute algorithm. So we can selectively spike into one allele. And then we can mix that. So by spiking in we get a genetically distinct tumor BAM. And then we can mix that back in with some amount of normal contamination and some amount of the original tumor to simulate the presence of a subclone in the tumors. And so we've scaled that from 1% subclone, which will be difficult if not impossible to detect mutations and up to 40%, which should be feasible. And so this sort of normal contamination model scheme and subclone expansion scheme were generated for both pairs of cell lines. And so in total, we're doing six comparisons here. So this BAM versus the normal, this BAM versus the normal, et cetera. So in total, we end up with 28 BAM files, which are distributed publicly via CG Hub. So if you navigate to this URL here, you can download a public key, and you can use that public key in GeneTorrent to grab the BAM files for the benchmarking exercise. And so if those of you who attended the CG Hub workshop yesterday evening should be familiar a bit with this process. And many thanks to Chris Wilkes and the CG Hub team for helping us get these BAMs up and dealing with our requests to replace them and so on. So in addition to providing data whereby we can evaluate the performance of mutation colors comparatively, benchmark four has also been stimulating the creation of new evaluation tools for VCF files.