 I'd like to thank the TCGA and the organization for giving me the pleasure of presenting the results that we've been collecting with Mike Lawrence on the pan-cancer data that's being analyzed by the TCGA right now. So many of you have probably seen this slide and it just goes to the basic notion of how genome and exome data is analyzed in many of these sequencing projects. So the tumor and the match normal extracted from a patient and they're compared against each other to come up with the, to characterize the most important somatic genetic alterations that we tend to look at nowadays, which is a single nucleotera variance, indels, copy number alterations, translocations, etc. And then when we combine these we get cohorts of patients either in the same tumor type or now as you will see across tumor types to look at, to perform statistical analysis to see which genes are, which events are most recurrent, significantly recurrent in the population and then see what are these genes, what are the pathways and what is the selection that they provide to tumor genesis. So this is a little bit of overview of the data that has been collected due today and us the TCGA have contributed to a large portion of it and as you can see we cover about 20 to 25 tumor types with about 500 tumor normal pairs for each and you can see that we have many types of data for each. We have whole exome sequencing, we have whole genome sequences, RNA-seq, SNP arrays and methylation data and as you can see the ICGC is also contributing for a huge amount to up to 50 different tumor types with 500 tumor normal pairs for each. So this in a nutshell says that there is a huge amount of data and this is a flood that we have to handle and we have to take advantage of the power that it gives us and also we have to deal with the complexity that it actually represents when we add so much data together and the way this data is processed and this is something that Mike Noble will go over in his talk. So basically going from sequencing data to a BAM file which is the line data against the reference genome and then going through quality control and then going to the characterization pipelines that are all in fire hose at the brood and they detect all of the genetic alterations that we try to look for which is mutations, indels, characterizing purity of sample, copy number, rearrangements and pathogens in tumor data. So this is the overview of the pancancer data set so we have eight different tumor types so it's breast, colon, glioblastoma, kidney, lung squamous, ovarian, rectal and endometrial to a total of 2,143 patients and also that amounts to 436,755 mutations, coding mutations. So that's an extremely large amount of data and we can see here the spread of the different mutation frequencies across tumor types so we can see that they can vary and the lung has a significantly higher mutation frequency than the rest of the tumor types and we can see that within tumor type we have variable mutation frequency. Also we can see the different mutation categories on the bottom panel and the C2A changes are the ones that are prevalent in lung squamous and those are the typical signature for smoking and also we can see that C2T transitions are prevalent across all tumor types and those are largely the C2G context mutations which are contributing to the high amount of the background mutation rate. So in order to deal with the 400,000, approximately 400,000 mutations and figure out what are the recurrent genes, we have been working on the pipeline music for several years now and at this point in time this algorithm takes into account multiple factors so it calculates sample-specific, gene-specific and context-specific background mutation rates so this is for each gene, we try to estimate the background model just based on the number of mutations that are there and so then we look at the base level evolutionary conservation of the events, we look at the positional configuration along the CDNA to see where are the mutations located and are there particular hotspots that they cluster in and we also have a separate metric for truncating mutations. This is an overview below of all the different tumor types that we analyzed and we see that the number of significant genes varies from a couple of dozen to only a few significant genes and we see that in pan-cancer we actually detect a lot more and we'll go into the details one by one so I'll present all of these published studies one by one in the chronological order in which they were published, we can go over some of the genes that we find and then we'll see how that actually is represented once we combine all these data sets together. So this is Glioblastoma and this was published a long time ago already 2008 and we can see that here we find most of the genes that were published at the time that are characteristic for GBM like EGFR mutations, PTEN mutations, RB1 mutations, PIC3R1 mutations and then soon after this paper was published also IDH1 was published and we see that the mutations here are clustered in two hotspots in two sites and there's 15 of them so I'm going to go over what the columns mean of this table. So basically in this table we have a list of genes, we have a number of mutations, number of patients, number of sites and then we have the different P values that the algorithm outputs so we have the background model P value, we have the clustering P value, we have the P value for conservation and then we have a P value after we combine all of these different metrics together. So this is for Glioblastoma and then we see that IDH1 is up on the list and we have GABR1 and integrin alpha might be also interesting because it's clustered and also well conserved. And so also for Varian we see the basic highlight that most patients have P53 mutations and also by performing clustering and conservation analysis we managed to pull up SARC which is I'll be on low recurrence, there is only four mutations and four patients but they're in two sites that are well clustered and well conserved. So for colon we have as you've seen in the published data we have the two major pathways the wind pathway and the TGF beta receptor pathway, we have FBXW7, we have APC, we have FAM123B and then for TGF beta receptor we have SMAT2, we have TGFBR2, we have SMAT4, then we also have some PIC3CA mutations, we have BRAF mutations V600E that also cluster and they're very well conserved as you can see here. So we managed to get most of the genes that were published in an order on the top of the list, so the same two pathways are implicated in rectal tumor as well as you can see here. For lung squamous we also managed to get a lot of the genes that were published in the manuscript and one of the most important ones are Nf2O2 and KEEP1 which are binding partners, then we have notch one loss of function mutations that are similar to the head and neck paper that also came out with the similar type of mutations, we have RB1, we have MLL2, so as you can see we're getting the same mutations that were published and these are important pathways and they're important to their respective tumor types. For breast we also get all the genes that were and even though we lose in this case we use different algorithms we get largely the same genes that were published, so we get the genes from every different each subtype of breast cancer, luminal A, luminal B, basal, etc. So we get GATA1, RUNX1, we get AKT1, we get MLL3, we get P10 and also we also very interestingly we get SF3B1 which was published in a chronically lymphocytic leukemia paper in New England and also it was also implicated in MDS and it's a splicing factor that it's still being largely researched to see what the target that it's misplicing is, so for kidney cancer we have the two important genes that were just talked about in the previous talk, PBRM1, VHL and VAP1 and so here we have a much lower recurrence of P53 and also there is an interesting gene here, DNH9, which is clustered and conserved with 20 mutated sites. So with endometrial the most interesting genes here apart from the ones that are in the known pathways that we know about are NFV202 which is implicated in lung cancer and also we have SPOP which is important in prostate cancer and NFV202 is also clustered and conserved the same way it is in lung squamous but we don't see its binding partner here so it would be interesting to investigate exactly how this pathway functions in endometrial tumors. So now after putting all of these data sets together we have a lot more power to detect genes that are not significant in any of their respective tumor types because there is not enough power to detect them in them but now once we combine everything together we'll see that apart from the genes that we get that are significant and they're hallmark genes for each separate tumor type we as we go down the list we see genes that we can only detect by combining tumor types. So here on the top of the list this is 150 genes so there is five parts of this table and the top part shows the older genes that I just went over and they're all from the different tumor types that we described and they're all hallmarks of each respective tumor type up until we start getting to this part of the list where we have MTM1 and HCN1 which are new and then we have NFV202 and then we go forward to see another family member of hyperpolarization activate cyclic nucleotide receptor and then we have beta 2 microglobulin so we have a lot of genes that we wouldn't be able to find if we were doing all of these tumor types separately also ATM is a famous it's a it's a famous tumor suppressant gene it was found to be significant in CLL in the New England Journal paper where SF3B1 was found as well but we didn't find it as significant in any of these tumor types that we analyzed and also ERB2 is significant and and other genes as well on this list so and here we have different transcription factors like E2F1 which is a very well clustered and well conserved and we have also DCF7 and STK3 which is a serine 3 and in kinase which is not very recurrent in the set but it's really well clustered and well conserved so here is a here here are the genes represented as a percent maximum number of percent mutated for for the respective tumor type so we can see in this in this table that that the genes that order in the way that they were so we see most genes that we found significant by analyzing the tumor types separately so we have TB53, APC, P10, KRAS, PIC3CA so we have all of the genes that we've talked about in all the papers that we published so far and this this this represents summarizes the the recurrence of the results when we analyze all the tumor types separately and so when we combine the datasets together we get a table where there's where they're sorted by percent recurrence in the overall dataset the pancancer dataset and so here we we we see the same genes up on top that we get in all the papers but now we get genes that we think they're significant in one tumor type but they're also muted it in every other tumor type and that's DNA, DNAH9, we have FAT4, we have MLL2 and and then we have we have certain genes that we have to see if they're real or not like EYS and and so on NFE2L2 is only in lung but so so so to summarize these results as I said we have a different number of significant genes per tumor type and we have a lot of significant genes on the pancancer dataset and and we have a lot of new genes that come up and I only talked about a few of them but you see that there is a big list and and here are some of them that I just mentioned that are family members that might be important and then we have beta 2 microgoblin which is an immune immune pathway an antigen marking pathway and then we have MTM1 which is a muscle cell differentiation molecule and we know that that differentiation is an important mechanism in cancer but there are still genes that are part of it that we haven't found yet and and so so to conclude I think that when combining tumor types together there's two things that we have to keep in mind combining tumor types give us the significantly more power to detect putative driver genes that we're under power to detect in each tumor type separately and on the on the flip side it also dilutes the power to detect driver genes that are potentially important in the respective tumor types so genes that are found on the bottom of the significance table that are barely recurrent enough to actually be noticed by by the analysis team in the respective tumor type will not make it in the in the final list once we once we combine all the datasets together but here are some future steps that we need to consider when we combine datasets because this is a pretty complex problem so there's a couple of things we can do we can incorporate other information for potential functional role apart from conservation so there is polyphen 2 there is mutation assessor and there is chasm and racial cartoon we'll be talking about those things in the next talk and then we have we can we can perform the significance analysis on curated gene sets which we have done before for different tumor types and then we can extend this analysis to look at correlation and mutual exclusivity with MIMO within and across tumor types and we can take into account the variable background mutation frequency across the genome and by taking the variable mutation rate across the genome we can also look at pathways by performing significance analysis and out there gene subnetverse by working with hotnet and and paradigm as well so it's important to collaborate with these groups together and also the other thing that's really important this is as the previous speaker mentioned is that integrated analysis has not been done yet especially on these on these huge gene huge huge data sets where we get genes that are that are new and that are not significant in the respective tumor types so with that I would like to conclude and thank first of all gattie gets with being spearheading this pan cancer effort at the Broad and then Matthew Myers and Stacy Gabriel Levi Levi Eric Linda and Todd who are you know the leaders of the Broad who help a lot with this analysis and and bringing about these ideas and also I'd like to thank our analysis team and our collaborators thank you so how often two related questions how often do these look like gain of function versus loss of function these these rare ones that you're pulling out do you see second sort of corollary do you see particular point mutations at particular amino acids showing up in in multiple cancers very infrequently or these are more often very loss of function in many different ways so for for certain genes that we found in the in the last table that I showed we have we have an investigative there are really loss of function but we have both cases for DNA H9 for example we have hot spots and for beta 2 microgoblin we have to see if if if there are loss of function or not but we haven't really looked at these genes closely they're just you know just fresh out of the computer and we have to go through them so just two more questions then we'll have to move on hi I'm Angela from Harvard I have a question on since you've done the pan cancer analysis now can you comment on which pathway had the most mutated genes of from all the different tumors you've analyzed and my second question is are you look going to look at the promoter regions in your whole genomes to look for significantly altered regions so for the first question I think from from what I've been noticing and the the most implicated pathway seems to be you know the tg of beta receptor and the and the wind signaling pathway mutations that are on top of the list but we also have to look and see if we can place the other genes that we just discussed that are more rare into different pathways and for the second question I'm not exactly sure how many whole genomes we have to analyze this but we can use the flanking regions and the coverage in the flanking regions to see if there is any promoter mutations yeah basically so I'd like to know that if you have the percentage information or mostly for this mutation are they like site specific mutations or they are this most related to the alternative splicing like related to that also and I'd like to know that what kind of software you use when to identify these kind of mutations thank you excuse me can you repeat what type of mutations the first one yeah so my first question is asking and you you have any information related to these gene mutations are they mostly like a site specific mutations or are they like you mean the clustering yeah bake changes like alternative splicing so so the so the first question is about the cluster mutations right yeah related to the so we have so we have an add-on to the mutic algorithm that looks at jointly looks at the conservation and and the clustering and then we combine that metric with the with the p-value that we get for the for the different covariates that build the background model and so that way we can pull up genes that are not as recurrent in the data set but are well clustered and have some sort of conserved hot spot that might be important in a pathway and that that's how we can get genes like SF3B1 with a canonical site which is a splicing factor and then we can get other mutations like a lot of different kinases this way okay thanks thanks and uh one more speaker