 Hi everybody, Nikolai Kazanov. I am a bioinformatics scientist at Compendia Bioscience, now a life technologies company. And I'm very excited to share with you some of our work that we've been doing over the past year with TCGA data, specifically some of our pan-cancer analysis of 3,000 exomes. Broadly, Compendia's modest mission is to cure cancer with genomics data. And we really try to accomplish that in two ways with our biopharma customers, and that is one, trying to help them get to novel therapeutic targets, and following that, trying to really define the right patients that could potentially respond to the treatment for these targets. And we do this within the company by tightly working between the bioinformatics and translational medicine teams, and also working with our customers to see what their goals are. The general approach that Compendia has taken to accomplish these goals is to really be the world's repository of cancer genome data. And we have about 10 years experience working with microarray data, as far as capillary sequencing data, and we currently have largest collection of copy number data, which helps us identify amplification deletion events. But really, TCGA is critical for our effort to make this catalog complete, right, with the effort and the scope of TCGA and the ambition of the data available. We can really expand and look at the long tail of these somatic aberrations, which will help us reach our goals. The unique opportunities that TCGA presents for us is the mutation and fusion data that's available from the exome samples that I've been collected. So this is what I'll focus on in my talk, and I'll briefly go through our views of challenges as a company of working with TCGA and the approaches that we take and then show several examples both from the mutations and the fusions. So first, preaching to the choir, the scale of the data is challenging. It's rapidly evolving. Terabytes of data to process. If you look at the exome data, ready to eat up as much compute time as you can throw at it. Heterogeneity. TCGA has done a great job, as you've heard, all throughout today and yesterday of compiling the data, getting it into one place and trying to be as consistent as possible. But it's an evolving project. Different working groups still use slightly different methods, potentially different gene models. So there's still a question of data heterogeneity that you have to address if you want to do a pan-cancer analysis, for example. As a company, one of our unique requirements is speed. We really want to give our customers a competitive advantage of seeing these novel discoveries as soon as possible that are somewhere in this data. So we really value getting the most up-to-date data, processing it, interpreting it, and delivering the results in a timely manner. And there's also a question of method development. Many of the methods used in the TCGA analyses are still evolving, and that's great. But some of them haven't been published, for example, and so it's, to us, the burden in delivering really relevant events is to try to identify true, first identify the true positive events, and then, out of those, identify the drivers. Just as an example, as mutate beta testers, we actually ran in on several hundred samples to identify mutations, several hundred TCGA samples, and we knew that it generates a lot of false positives, but since the method hasn't been published yet, it was challenging for us to know exactly which filters have been applied to generate the nice curated TCGA dataset. So we developed some of our own quality-based filters and compared the data back to TCGA, and we're happy to find that our understanding of what quality mutation data is compared to that of TCGA. So both kind of now have more trust in the TCGA data, but also on our own have a method of calling mutations and producing comparable datasets. So this method development process is a part of our work. So here's the approach for us to address these goals. We used all three data sources extensively for mutations. We used the Broad G DAC and the DCC, depending which one has the most up-to-date data at the time. We also, as I mentioned, can process our own mutations and fusions, and we get the primary data, of course, from CG Hub for those purposes. As CG Hub beta testers, we actually developed quite a robust pipeline to ingest RNA-seq and DNA-seq data and do our own fusion calling. So just as an example, just this summer, we've done several thousand fusion calls and several thousand samples, which is 50 terabytes or so of data and over a year's worth of compute time, and we've managed to do it in about a week. So we have this great infrastructure set up to process raw data, but also to re-annotate and standardize the mutation data that's already called. So this gives us our comprehensive kind of somatic aberration database from which we can start doing integrative analysis and pan-cancer analysis and really get to this whole driver questions, what are the new and exciting driver events. And our general approach to getting at these drivers is to really have kind of this iterative method development process where we look at gold standard positives and known mutations, known validated mutations, and known published fusions, and try to develop methods that call out events of similar type in this, from this catalog. So here's a brief overview of our mutations pipeline. This is a snapshot from about middle of this year, where we've processed CEST 3000, but it's actually 2998 samples from about, across 15 diseases, and we obtained, as I said, the data from TCGA, a little over a million mutations and goes into our annotation pipeline. Primary goals of the annotation pipeline is to really define a consistent variant classification and variant position for these mutations, since variant position is important for us in looking for occurrence. This is the protein level position of the mutation. We also filter to kind of gene regions of interest, throwing out a lot of intergenic stuff. We then are down to about 700,000 mutations, and we really classify mutations in three categories. So we call out hotspot mutations. So for us, recurrent hotspot mutations are those that occur in three or more samples in this data set. And it's a very simple definition, but it works remarkably well. And most of these events are statistically significant. So if we see mutation at the same position in three or more samples, it's a hotspot. We also call out the potential loss of function deleterious mutations, such as frame shifts, non-stops, and such. We then assign statistical significance to the hotspots and the deleterious mutations on the gene level. And our approach is different from Mutsig, for example. We use both the relative occurrence and enrichment of these mutations in the gene, but our approach is really driven towards classification of a gene into two broad categories, a potential gain of function or a potential loss of function gene. And there's more detail in our scheme if you're interested in the poster. And this scheme was developed by looking at gold standard oncogenes and tumor suppressors that are then really identifying what are the characteristics do we see? Do we see a certain percentage of hotspot mutations in those genes and certain enrichment of deleterious mutations? So out of this 3,000 sample data set, we identified 107 potential gain of function genes and about 120 potential loss of function genes. This is on a pan-cancer level. There's also a mutation analysis that was done. So I'll highlight two examples. One you might be very familiar with and one sort of novel one that comes out of the pan-cancer analysis. So this is one of our predicted gain of function genes, which is good. One oncogene with therapies in the works, 16% mutation rate across the cancer types. The bar chart here in pink is the frequency of recurrent of these recurrent hotspot mutations, whereas the gray are other mutations that aren't recurrent or deleterious. And so you can see that it's definitely enriched for hotspot mutations. Deleterious would be in blue and you don't see any. And it's remarkable that 14 out of the 15 cancers, there are at least some picked 3CA recurrent mutations that we observe. On the bottom is the X-axis is the residue position and on the Y-axis is the individual occurrences that we see in this data set, colored by disease. So you can see that many of these peaks, many of these are known peaks and they occur across disease types, so it's not a, many of these are not disease specific. So the E545 and the E542 and the H1047 peaks are well known. But you can also see that we start seeing some of these smaller but also perhaps significant driver events such as the E726 peak in the kinase domain and this is really those events that our customers are looking to explore. Are these mutually exclusive? Or are there potentially also things to look for or to target in patients? This is a RAF1, I just see RAF and it actually, much less frequency of occurrence, only 1% across the cancer types. But we do see it in several cancer types and the fascinating thing about this example is that this serine in the second conserved domain is actually implicated in an autosomal dominant disorder where it's an inhibitor phosphorylation site and when it's disrupted it actually causes a RASMAP-K dysregulation and so there's developmental aberrations, mental retardation. But this is a somatic example of this event and we wouldn't have called it out if it wasn't for the pan-cancer analysis so as you can see the colors of the dots in that peak are different. So this is only by combining across disease types can you kind of see these events pop up and then drill into them to see if they're functionally significant. So I'll move on to our fusion's method. This is again a snapshot from the summer. We've since done about 3,000 more samples. This is from six diseases. Many of the diseases that we picked were ones where known fusions exist so we could validate our methods. We implemented two fusion collars, diffuse and top hat. So we have both single and paired end calling capability. We did find that with default parameters many of these methods actually miss some of the known fusions because they filter them out, they're a little too aggressive. So we rolled back the filtering and devised our own kind of filtering and classification scheme filtering mostly for break points and known gene regions and classifying based on if the fusion was, if the sample was processed by both collars we'd like to see a dual collar validation so both collars calling it and there's also kind of an evidence backed scheme to filter. We have about 127 priority fusion events in this data set. About half of those are actually known published fusions which is really comforting to us and of course the rest are potential novel discoveries that we hope are exciting for our customers. So this one was a result that we like to see. Temporous arc fusion mentioned earlier on, we saw it at expected frequency, 30 out of the 53 prostate samples had it. Expected fusion boundaries, the below is our kind of plot of exon count to the five prime and three prime end of the break point. So we're seeing the same isoforms. The really fascinating thing is to look at the expression. This is TCGA expression data in pink are fusion positive samples where the fusions were called and you can see that ERG expression is up because it's driven by tempers. The exon level plot is also pretty cool so the diamonds are the predicted break points and you can really see elevator expression in the fusion positive samples past the break point. The blue samples are the ones where the fusion has not been predicted and you can see the expression is pretty flat. Interestingly there's a couple of cases which show high expression and you can see the faint blue lines in the exon plot and these are potentially undercalled samples so it will be interesting to go back and see why some of the collars missed those. Now we also looked at individual gene partners in their occurrence and this is RET fusions, RET partner fusions. Of course the most dominant event was the RET PTC and thyroid. We found nine of these but we also found RET fusions in breast and lung with pretty consistent break points. So these are possible exciting new observations of this fusion and other diseases. Again looking at the exon level plots we can see corroborating expression evidence for these fusion events and once again we have some events that may be undercalled. So some future directions for us with TCGA is going to be an important resource. We're going to continue to try to ingest as much data as we can and continue to make these pan cancer and really whole scale calls digging into the tail end of the somatic aberration landscape so to speak. We're also moving in the direction of integrative analysis, gene pathway level summarization and various outcome and analyses and trying to get more cancer types beyond TCGA to improve our pan cancer analysis and potentially mapping the drivers that we find back to model systems for better functional classification or better functional characterization. And of course we're still a business and we're going to be driven by what our customers questions are and try to answer those as best as we can. I'd like to thank everybody at Compendia. We have a small but great team who may make this possible. We also like to thank all of you. This wouldn't have been possible without a whole of TCGA but specifically Ken Asha, Michael Noble and Christian as well as the CG Hub guys, Mark Deakins and Chris Wilkins and Day 1 Ken for feedback with Top Hat. Thank you very much.