 Okay. Thank you, Ilya. So, multi-center mutation call or calling, what is that? A somewhat enigmatic topic and possibly mundane, were it not for the fact that the mutation calls that we generate underpin so much of what we do in TCGA. So, that makes this a necessary topic and one that has undergone itself great evolution during the course of the TCGA. So, what I'm going to do for the next 15 minutes is review the approaches to somatic mutation calling to consider what it is to call mutations with one caller. Talk about the early benchmarking of somatic mutation callers that we did early on in the project. Look at the early trials of three-center calling and adoption of standards for the three-center calling and look at the current status of multi-center calling and new developments in mutation calling. So, the whole trick to this game is to be able to distinguish error from real variation, the real biological variation that we seek. There are two sources of that error. Some comes from the sequencing machines themselves and are inherent in the base callers and fortunately most of this is randomly distributed and the base calls come with calibrations of Q values that enable us to distinguish the truth from error and we're able to filter most of this and yet some of this escapes because there's a fair amount of it across the 50 billion reads that we take for exome and when these errors happen to be coincident, they're found then at allele fractions that might be similar to what we're looking for in the very heterogeneous tumor environments that we have that's been described in other talks. I'm not going to go into that in this talk but then there is systematic error that comes from mapping and alignment ambiguities. There's a lot of difficulty with 100 base reads and the structure of the genome that we have that makes this a very tricky problem and this leads to high quality errors that are much more or high quality base differences that actually are reflective of true base differences but the read just happens to be in the wrong place and so they can easily be mistaken for mutations. So the way the current callers work is that they all have some what I'll call for lack of a better term a truth engine that distinguishes the real variation from sequencing error and these formulations which are largely Bayesian or log odds based formulations give or output tens of thousands to hundreds of thousands of events which then which then have to be filtered and it's at this point that heuristics come into play and each calling center applies heuristics in a slightly different way and most of the variation or a very large fraction of the variation that passes those so-called truth filters then get filtered out. The best documentation of this is in the paper by my colleague Chris Sebolskis at the Broad when they did their mutech publication and what this shows is that at each sequencing depth you're filtering a significant fraction of the variance away due to these various characteristics which are being heuristically applied. This amounts to up to 90 percent of the variation that came through the first step is now disappearing and this is going to be important as we go along because I believe this is where most of the variation between callers emerges. So with the mutations that we collect from a single caller we get actually very nice profiles of significantly mutated genes. This is one example from colorectal cancer but now we have 10 or 12 tumors in which we've collected similar profiles as this and the profiles end up making a lot of sense. You can place these mutations into pathways that describe what's going on in the tumor in actually in detail that we've never seen before. But unfortunately if you go back and compare the calls that different callers are making on one or two tumors you see a picture that is at first somewhat disturbing and this is where we were with the first benchmark proposed by David Housler back about in February 2011 where here we have on the left what were the somatic calls passing all the filters from each of three centers and you can see that the number of variants in the center is fairly small compared to the total variation discovered by the callers in the union set and in particular a lot more seem to be going on in the variation that was unique to any given caller than there was in the overlap. So then on to benchmark two things didn't look a whole lot better there 310 was the intersect from four callers here and again you can see that going around the outside the number of events that were unique to all these different callers was much larger and so this left the sinking feeling that there was a lot of true variation possibly being left on the table and a lot going on analyzed. David took this another interesting step further and from that, from the next benchmark did a, from the next benchmark analysis where for each center that was being considered an event diagram he looked at the top 100 calls from each patient that was being studied here and looking in the top 100 calls of just the UCSC caller there were 10 calls that the UCSC caller was considering high quality that no other caller was seeing and going around the horn here we go over to the wash you caller looking at their best calls there were 143 that nobody else had seen for Baylor here there was about 1,000 these are unvalidated calls so even though the center is saying they're high quality we don't really know but in here were 55 from the Broad so potentially there was a lot not being, not being turned up in the analysis of any given cohort by a single caller so similar results were obtained in a set of colorectal samples where UCSC and Broad made calls you can see a lot more going on in the unique calls and similarly the high quality variants were numerous in the unique calls so the conclusion from this was obviously the discordance between callers was high and that was pretty dismaying and the high quality calls is defined by one caller were missed were being missed by the others so that particular discordancy was distressing so the suggestion was that at least if we did multi-center calling we would ameliorate some of the possibilities of false negatives okay by the way it's also the fact that if you were to apply callers just to diploid analysis where here you have the tremendous benefit of an expectation of about 50-50 in the reference and variant alleles which is a very powerful constraint on the data if you look at the results from five different callers on diploid genomes they only agree amongst themselves by 57 percent so it's not just in somatic mutation calling that there can be difficulties but even in this business as a whole and my sense is that a lot of this as I said before is coming from the fact that we use heuristic callers to help filter the data and that those differences that we have had in the data are we're basically sampling a very large multivariate space and choosing different components of that space so nonetheless going forward we then began using three-center calling on each of the cancers and the results coming out of that where we now were able to superimpose validation data on it showed that at least in the overlap what was being called was extremely highly accurate so in the three-center overlap shown here 99 percent of the variants validated and even when there were just two centers making the call the percentage of validation was very high out in the Unix the validation rates are much lower nonetheless validation was occurring in the Unix so that the three-center calls at least can pick up false negatives at least there on the table for scientists to consider and accept or reject here's a similar analysis of lung adenocarcinoma data done by the Broad and once again very high 100 percent validation rate in the three-center overlap lower validation rates in the around the outside and more recently in the this is the kidney clear cell project where we had 500 patients this was the largest cohort at this time and looking at validation within 177 of the cases this shows the number of mutations that were validated and the percentage that were valid in each segment of the overlap and you can see a very high validation rate in the three in the at least if two centers made a call and a much lower validation rate in the unique data so this led to two developments the first was a metacolor developed by Terry speed and this metacolor he demonstrated was very highly accurate and so it leads to a recipe for being able to take the multi-center calling data and make it highly accurate in order to achieve that high accuracy though it's necessary to have validation data to calibrate each of the callers that are involved but once that's done you can make very accurate calls from the data and so one possibility unfortunately prior to the marker papers coming out we don't have the validation from all the centers and so one thing that could be done retrospectively now is to go back and with the validation data available and the calls from each caller apply his method and generate even more comprehensive mutation data sets. The other thing this did was lead to a formalization of the multi-center calling and with the recognition that the mutation callers are improving overall and different callers detect different events, the validation cycles were taking too long to lead to expeditious publication of the marker papers and so we began using multi-center calling in order to provide a significantly mutated gene list that's based on the calibrated accuracies of calls made by two or more centers. So this gave us a path forward and broke the conundrum of the problem we ran into with reviewers who never wanted to take at face value the mutation calls that were being provided without validation. So we accelerate the submission of the marker papers now and we don't abandon validation, a validation is still brought in but that can go on while the paper is under review, validation requiring a second independent sequencing event. Okay, so where are we now? The other thing that the multi-center calling does is enable other potentially interested researchers to add their callers in and experiment with the development of new methods and so now we have for the Adrenal Cortical Carcinoma project which is underway and nearing a paper I think earlier in this, earlier in yesterday's session we saw a review of the work with this tumor but now we've got five centers calling. You can see from this that there are still large numbers of unique calls but now in the center through the advancement and improved sophistication of the calling the center is the most heavily weighted so the overlap is looking much better. The fact that we still have a number of calls on the outside and preliminary data that I'm sorry I can't show you here is that among these calls in the unique section we look in the RNA seek data now and there are hundreds of events out here that are part of, that appear to be being expressed and therefore are probably valid somatic mutations. So again we continue to sample from a large space and every new logic that is applied picks up more of that sampling. Most of what's out here are events that are very low allele fractions so they're subclonal events in general although not all of them are. Some of them have just escaped one or more of the heuristic parameters that are used in filtering. So now we have a second generation of mutation callers coming on and one of them is being developed by the genome center in collaboration with MD Anderson which measures a distance per position per sample to reflect a mutation evolution and uncertainty estimates are based on Bayesian Markov model and therefore this method will come with a calibrated certainty akin to a Q value. Method is now being refined called Viper by WashU and there's a mutex version 2 on the drawing board. So the next round of mutation callers is going to have even better accuracy and better sensitivity at low allele fraction. I don't think we'll see these outer unique sections decrease at all and that's a good thing because that is going to be pulling in these very low allele fraction subclonal mutations that we want to have. These new callers are being tested in the dream challenge as we go through dream one, two and now we're in three. The callers from the sequencing centers you can see are at the very top of the list in what is probably a statistical dead heat. I think that speaks very well of the TCGA sequencing centers and so we'll be looking forward to actually the final rendition of the dream challenge which will use real data instead of synthetic data. So in conclusion the TCGA paradigm for mutation discovery is improved by multi-center calling. This enables us to decrease the false negative rates. It delivers a set of somatic SNVs of calibrated accuracy, accelerates submission of marker papers and stimulates development of new mutation callers by providing benchmarking on the fly. A formal metacolor was developed which may be useful in retrospectively refining mutation calls from TCGA tumor sets and finally one parting thought that I didn't talk about is that we're now starting to use validation in or mutation calling in a lot of different contexts now. We have RNA fusions, we have structural variation, all of those mutation modes are likely to experience a similar phenomenon as we see in the SNV and so that needs to be checked and probably multi-algorithm calling will be required there too. So I'll just end with my acknowledgments of my colleagues who all contributed to this talk. Thank you. Okay, one question, one burning question please. For those multi-colors are you using send underlying alignment like PWA or BOTI? These are, yes, they're using the same set of BAM files. Okay, we observe that there are some software files for like PWA, BOTI, if you use different software alignment they maybe have some different mutation call or mismatch. Yes, well yeah, that's going to be important for comparing mutation profiles across tumors and so it is necessary to go back and that's being done in the ICGCTCGA, whole genome analysis now, but that is a very important point. So TCGA is recommended BWA for overall aligner? I'm sorry, TCGA what? Basically I saw lots of data was using BWA, so is this like... Well this, I was talking about whole exome here, so, sorry, yeah, so yeah, we were using BWA, everybody's using BWA, but in different ways actually. Okay, I'd like to welcome our next speaker, Kyung-Fan Lehmann, who is from Memorial's Long Kettering Cancer Center. We'll talk about extensive trans- and cis-QTLs revealed by large-scale cancer genome analysis, please.