 Good morning. So I'll talk to you about the results or the initial results of the ICGCTCGA dream somatic mutation calling challenge, or I'll just say challenge. So the origin of this, at least from my end, starts off with the idea that anytime we want to do something in cancer research, what we really want to do is translate it into the clinic. And to a clinician, often what it comes down to is we have a number of genomic profiles that we tell them give new information about the tumor, and then it goes into a bioinformatic black box that is supposed to generate an answer, and then in the end it's supposed to cure cancer. And from a fundamental perspective, that means a lot of what we have to do is give clinicians and patients confidence in the analysis that we do. And so the problem is, as we all know, in many cases you'll have the molecular profiles, but if we use different black boxes, we might get different results. I think this to me was perfectly demonstrated with a biomarker that I developed during my PhD. It was a biomarker for non-small cell lung cancer, and Richard Simon took a publicly available data set and said, I can't validate that biomarker. It doesn't work. That's okay. Not every biomarker is supposed to validate. But interestingly, when my postdoc looked at it, we were able to replicate it. And it took round after round, going back and forth with a world-class statistician to figure out where the difference was. And it turned out to be an incredibly subtle difference that led us to have the same data set and what we thought was the same algorithm yielding different results. It was actually a slight versioning difference in the way in which we used our software. So the only difference was the pre-processing, and yet it changed completely the conclusion of which biomarker worked and which one didn't. We explored this in a bit more detail, and we found out that if you just change the way in which you pre-process your data, each of the rows is a different data pre-processing. Each of the column is a different patient, and you can see there's very little concordance about what we would predict as the right care for that patient. So that, of course, tells us that this is going to be true in every type of data that we look at, and it holds true for breast cancer. In fact, for breast cancer it's pretty funny. If you decide to just play with your algorithm, you can show that 74 percent of genes are prognostic in breast cancer. Essentially, every single gene just depended on how you analyze your data. And so to try to resolve this, we put together, and we as myself, Adam Margolin at Sage and Josh Stewart at UCSC, a challenge that would allow us to benchmark the methods for identifying somatic mutations in cancer. So at the beginning, we set out and said we only care about accuracy. This isn't about efficiency. This isn't about the most computational or parallelizable algorithm. We just want to know how well we can do. We started off saying that this should be only real tumor data. We backtracked on that a bit, but at the beginning we said we'd start off with a series of 10 tumor normal pairs. We chose them from two different tumor types that vary very substantially in the cellularity. Pancreatic is very low cellularity. Prostate is quite high. And we looked at raw and processed data. So we looked to make available fast queue files, BAMs, normal aligned a couple of different ways, and all of the clinical information or protocols that somebody might want to know. How did you extract the DNA? What ages the patient? Those sorts of pieces of information. Access approval to make a data site like that publicly available turned out to be painful. And so we organized with the ICGC DACO an application to streamline access to the data. Essentially, there's a template that groups will fill out, get an institutional sign off on, and can receive ready access to the data. And we're averaging six or seven days to get access to the challenged data right now. We also sought and received an opinion from the Western IRB to give confidence in many countries which require IRB approvals even for the use of challenged data like this. But as we started to design the challenge, we also realized that there would be a big benefit for having simulated data. A couple of reasons. One is that it might draw people from outside the standard cancer genomics field into this work, somebody who wasn't a ready developing algorithms. So the way this works is to start with a genome, be it a cell line or a germline tumor genome, to burn in a series of SNVs and structural variations using a tool called BAM surgeon developed by Adam Ewing at UCSC, who's another part of the challenge. And then we would split this into two, take a subset of the reads, call one the tumor, and spike in additional SNVs and SVs, again at the read level, to create an artificial tumor normal pair. There are five releases of these. They're coming out every six weeks. The third one is active right now. The idea is to have increasing complexity. So the first one is relatively simple. It assumes 100% cellular tumor. There are no sub clones or anything. It's meant to give a lower bound for just how, sorry, an upper bound for just how accurate our tools are going to be. It's quite easy to get access to the data. You can do an application at Synapse. There is a download using GeneTorrent, and Google has provided $2,000 in credits to everybody on the cloud. So the basic structure of the challenge then is that we've got this human tumor data section and the simulated tumor section, kind of challenge one and two, and we're looking at both structural variations for the tumor data and single nucleotide variations. And both are being assessed using balanced accuracy. And then from the simulated data, we have each individual tumor, and we have each individual tumor for structural and single nucleotide variations. So there's a lot of sub-challenges and opportunities for people to feedback and see how their algorithms work. So the challenge is being scored in different ways. For the real tumor normal pairs, we're waiting until all the results are available. And then we're doing a random selection of results, basically trying to balance across intersections between groups. And those are receiving deep resequencing on an alternative platform. It's ion torrent because the initial discovery was on an alumina. For the synthetic data sets, of course, ground truth is immediately known. So this is the kind of timeline. We expect to have the final results by about November for an announcement of the winner and we'll be in the validation phase with the contest closed for challenge entries by July. So so far the results have been, to me, quite staggering. There have been 440 entries. In fact, this is a few days old and synthetic three is closing. So we're seeing something on the order of 10 to 15 entries a day. So these are 440 different predictions of single nucleotide or structural variations on just three genomes. There's 270 registrants with about one or two joining each day. And to me, maybe the nicest thing about this is that they're ongoing post challenge submissions. After people find that they don't do well, they keep on trying to see what happened. How can I improve my algorithm and to continue to use these tumors as a kind of living benchmark? It's also greatly improved our ability to simulate reads and simulate mutations. Maybe eventually getting to the day where we don't need to use real tumors, we can accurately simulate the characteristics of them. So some preliminary results. The entries broadly reflect a single rock curve. That's not exactly true, but if you kind of take a look here, you can basically see a relationship between sensitivity and specificity that reflects a rock curve. There's some groups that are clearly off it, but if you take a look at the sort of optimum point on the top right corner, essentially groups are only moving out a tiny bit. There was a surprisingly large chromosome bias. In fact, this plot even doesn't do it justice. What you'll see is that the winning algorithm in the first in Silico only wins on a subset of the chromosomes, whereas there's significant overlap of other algorithms doing better on one chromosome than another. Similarly, there's bias from one chromosome to another of how accurately different groups are calling it. With some chromosomes, chromosome 11 in particular, are having very low calling rates for reasons that we haven't been able to assess right now. And I said this plot doesn't do it justice. In fact, the top algorithms were the least prone to this chromosome bias. The groups that finished in sort of five through 10 or five through 12, so the middle group showed a lot of chromosome bias in their analyses. And it's unclear why that might have been, but they did very well on some chromosomes, but not on others. And permutation studies show much more so than you would have expected by chance alone. There also turned out to be, maybe unsurprisingly, a significantly different determinants of your different types of errors. So false positives were dominated by differences in variant allele frequency and mapping quality. By contrast, mapping quality was still important for false negatives, but normal coverage was more important. In fact, normal coverage turned out to be more important than tumor coverage in this genome. Now, take that with a grain of salt because this was the 100% cellular genome that we're taking here, so that may influence why we're seeing that result. To give you a feel for it, again, each of the columns here is a different algorithm, and these are the different importance metrics of the different things that we looked at. So you can see, for example, variant allele frequency moving from the left, which is the top algorithm through to the right, turns out to be the most important determinant of accuracy through most of the groups. And this is determined using a random forest type ensemble approach using that for the importance metric. Similarly, you can see that for most algorithms, mapping quality is second most important and so forth, and something like GC content or, for that matter, homopolymer regions were much, much less important to determining error rates. The other big surprise to us was surprisingly strong trinucleotide effects. So this is the false positive rate normalized to the genome as a whole for different trinucleotide repeats across the genome. And it's very clear that there are certain peaks. And actually, that peak is a CCG moving to a CTG. So maybe not entirely surprising, but there are several other peaks in the genome that are of great interest. And the last thing, and maybe the most promising thing here is that the coding regions had substantially lower error rates than the rest of the genome. It's almost as if we optimized our mutation colors to the coding regions. And the top three groups were all perfect in the coding regions. Then there was a couple of recurrent false negatives. But the false negatives, the most recurrent across groups, was only 30%. And in fact, there were a small number, sorry, false positives. And there's only a small number of false positives. And in fact, the first false negative only showed up at group number 8. So in short, the vast majority of groups in the top performing set did very, very well in the coding regions. And it's actually suggest that we need to be putting much more work into our validation of non-coding regions. And the last takeaway is that parameterization was really critical. Pretty much every single group was able to step forward just from having the small amount of data and feedback that we put up on the leaderboard was allowing them to almost immediately, within a day or two, reparameterize their algorithm and do a much better job on calling. So in short, having a little bit of validation per tumor allows you to do a much better job of calling. And it might almost be a useful thing until we work out optimal parameterizations to have as a standard part of pipelines. So in summary, we're seeing surprising trends in error profiles, both on the big side from chromosomal bias to the very small side trinucleotide bias. And we're looking at windows of the genome and sequence context and things like that. It may be that normal coverage is more important than we've given it credit for, although take that with a grain of salt. And we're starting to identify the best methods for mutation prediction. For example, Mutect was the best scoring on the first two in silico data sets for SNVs, and Delia Novobreak respectively won the first two structural variant calls. And maybe the most important thing is we hope this is starting to foment a community that will be working on benchmarking of these algorithms and be able to carry this forward to the kind of living benchmark that I talked about. I should thank the people in my team who did the pilot work, all of the challenge organizing team, especially Josh, Adam, Katie, who's in my lab who produced much of the analysis that you've seen here, the funding partners. And I think maybe I'll close by encouraging everybody to submit the more entries the better we're able to look at these issues. And community participation is really what drives this. Thank you, I appreciate your time. Yeah, we're doing it. So the question was, will we make the calls from different callers available? Yes, actually that's what we're working on right now, is to be able to do that systematically for all of them and to start looking at ensembles to see how the error profiles of different complimentary callers could be used together. So that's probably number two on our priority list. Question, if I submit or submit multiple, many wrong, is this will be taking into consideration where we rank their performance? So when you... So the question was, if somebody submits multiple calls, would we be able to take that into account when looking at their performance? So we do, we're not penalizing them for that in any particular way though. We only use a subset of the data available for the leaderboard and it's certainly possible to overfit on the leaderboard. In fact, we actually have seen a couple of examples of teams that may have overfitted to the subset used for kind of providing rankings. And we think it's a good thing to encourage teams to continually go in and improve their algorithms like that. Thank you. Thank you very much. So in TCGA, we have luxury to use a fresh frozen sample, but in real clinical setting, we don't often the case we only have the paraffin-emboweled sample. So it's important to know what's the difference between these two settings. So our next speaker, Dr. Eric Zomoda from Nationalwide Children's Hospital, we will talk about the lessons from the genomic characterization patient matched frozen and thermally fixed paraffin-emboweled tissue.