 Okay. Well, thanks a lot. It's a real honor to be here and talk about the work we're doing at the Broad in the cancer genome analysis group led by Gaddy Gets. I'm going to talk about basically two subjects. One is just sort of the great diversity of tumor projects that we're seeing now and the spectrum of mutations in them, but then also about improvements that we're making to our statistical methods for picking out significantly mutated genes. And then at the end, I'm going to try to tie it all back together. So we're at a really amazingly exciting time now where in TCGA and other projects, we've started to sequence, you know, thousands of exomes and genomes, and there's, you know, 20 or 30 tumor types listed here. This plot here shows the, for each patient, each cancer patient whose exome was sequenced, is showing their total somatic mutation rate. And you can see it varies over four or five orders of magnitude from tumors on the left like leukemias and childhood tumors, which have very low mutation rates all the way on the right to melanoma and lung cancer and other tumor types that are associated with known mutagens that have among the highest rates. So this already poses a big challenge dealing with this, you know, vast regime of different mutation rates. And then on top of that, it's not just that the total number of mutations increases, but the types of mutations change. And that's illustrated in the bottom panel here where these six different colors show the six fundamental kinds of mutations you can have. And that is, you can be either on a CG base pair or on an AT base pair, and then you multiply those two possibilities by three more possibilities reflecting the three different bases you can change to in each case. And so if we kind of zoom in on one of these tumor types, ovarian, which was one of the first TCGA tumor types, and also it happens to have a very well representative mutation spectrum that's, we kind of call the vanilla mutation spectrum. It's pretty much dominated by these C2T mutations, and it's kind of flat everywhere else. So these are, again, these six colors or these six fundamental kinds of base changes. But then within each square, it's broken down further by the context on the left and the right, the left neighbor and the right neighbor. So you can see that this group of four bars sticking up here, these are the CPG dinucleotides, well known to be very highly mutable because of methylation. And they have, per site, they have a much higher mutability than other bases. And we see a very similar pattern in GBM. Here it's even more dominated by those CPG transitions. These plots here have kind of been scaled to the near the maximum. So here everything else has kind of fallen down flatter in comparison. It's a totally different matter when you go to lung cancer, where here now you have these cyan-colored bars sticking up that are C2A mutations. And they don't have a particular sequence, context preference. This is lung squamous. When you look at lung adeno, that C2A component becomes even more important. It becomes maybe even the dominant part of the mutations there. Melanoma, the C2T mutations are the vast majority of mutations. But in contrast to the sort of vanilla spectrum, the context doesn't matter here. It's just C2T transitions at any Cs. Now here's some interesting new tumor types that are coming out of the TCGA pipeline. The first one is cervical, where it kind of looks like the vanilla spectrum except now you have this back row, the red, the cyan, and the yellow back row sticking up. And that represents context where the five prime base is a T. So these are TPC dinucleotides. This is the first tumor type that we see this in TCGA. And cervical is known to be associated with HPV. And there's also some suggestion in the literature that HPV infection is associated with this mutability of TPC dinucleotides. So this kind of makes sense. We also see the same pattern in bladder, almost the exact same spectrum. So this is really tempting to speculate that bladder might also have a viral component. And that'll be interesting to see. So an alternate representation of this data, which Dr. Lander showed in his keynote this morning, is this kind of radial plot here where now the distance from the center reflects the total mutation rate and the sort of angular position reflects what is the kind of spectrum. So now you see the lungs segregating. The melanoma has its own sector of the plot, et cetera. And here up where the orange oval is, this is this sort of TPC dominated cohort with the cervical and some head and neck tumors. And also now the bladder set. Now if you focus in on these sectors highlighted here, you can look at this bar graph plot for each of those. My pointer stopped working. Anyways, you can see that the head and neck cervical and bladder, that's the one with the back row standing up. And everywhere else, it's pretty much the yellow bar standing up, except for lung. Okay, so now I'm going to move to the second part, which is how do we go from these sort of descriptive observations of mutation spectra to actually finding significantly mutated genes. So this is the algorithm we've been working on for a few years called MUTESIG, which picks out significantly mutated genes. It starts from a set of patients and the mutations observed in them kind of tallies them up and does some kind of statistical stuff to figure out a score and a cutoff. So the very most naive version of MUTESIG assumes a constant background rate across the genome, so specifically that there's no difference between sequence context, which we know isn't true from the previous slides, but also that all the patients have the same mutation rate or that we model them with an average across the patient set. And furthermore that all genes are modeled as having the same uniform background rate across the genome. So as I said, we know the sequence context matters because you might be looking at melanoma where the vast majority mutations of these C to T mutations that are UV induced and A to T mutations are relatively much less frequent. So if you look at, if you pick these four genes, you might see that there's this sort of constant background of UV induced mutation, but if you have one gene, this gene X here, that has, you know, the less common mutation, you might be tempted to weight gene X with more importance despite the fact that it has the fewest total mutations here. So kind of analogously, if you look at patients, you might have two patients that have very different mutation rates. Patient one might have a much lower mutation rate than patient two. And again, if you see that gene B is mutated in patient one, whereas all kind of all genes are mutated in patient two, you might be more drawn to gene B as a potentially important cancer gene even though it has the fewest mutations and it's mutated in the fewest patients. So this is sort of the third item on the list here as we make our algorithm more refined. What about genes? Genes themselves could have different mutation rates across the genome. So here's four genes, you know, sort of an increasing order of number of mutations. But how do you know which one is the driver here? With the previous two cases, you could sort of pick out by eye what was the background and what was the signal. But here it's much less obvious. So how do you know what the background rate of a gene is? And why is it important? To expand on the point that Eric made in the presentation this morning, you could imagine two distributions that both have the same average mutation rate. In the first case, you could imagine that all the genes have the same mutation rate and there you could draw a cutoff that is designed to have a false discovery rate of 1% or less and that'll give you some hits. And that's probably good if you're dealing with a situation that it accurately represents. But you could have a slightly different situation where you have the same average mutation rate but now you have a quarter of your genes being just twice as highly mutated. Their background mutation rate is just twice as much as the previous genes. So now your distribution is a sum of these two distributions low mutation rate genes and blue and high mutation rate genes and red and that gives you the fatter tail. And so if you use the same FDR cutoff here, you're surely going to be swamped by these false positives from the distribution that has the higher mutation rate. So how do we get a handle on this problem? First of all, how do we know it's a problem? This is a look at the lung data and here I want to refer you to the great poster that Brian Hernandez has for those of you who already didn't take a look at it. But what I'm showing here is the lung data. The lung data, you basically, you get 843 significantly mutated genes and that's kind of a mind blowing number and it's sort of when you see that great 146 of them are olfactory receptors you start to realize that a lot of these genes are fishy and there's the cub and sushi proteins, musins, ryanidine receptors, titan. So these are kind of common offenders that show up on a lot of lists and it would be great to find out why they're here. So one thing we know is that expression has an effect on mutation rate. The genes that are more highly expressed have lower mutation rates but that only accounts for about a one and a half fold difference between the highest and lowest categories there. And when you look across the genome you see that mutation rate varies by 10-fold or more. This is showing the non-coding mutation rate across chromosome 10. So what we looked for covariates of this basically and we were inspired by Shamil Sunayev who pointed out that replication time co-varies with mutation rate in germline studies. So when we plotted replication time on top of this we saw that it's not a perfect match but it's really very tightly correlated the black curves and the red curves and it does explain quite a bit of the variation we see. And for olfactory receptors in particular they replicate late and as one example here's a cluster of olfactory receptors on chromosome 1 that's in a very highly mutated part of the genome as measured by non-coding mutations and also replicates very late. So maybe this isn't a region of the chromatin that's kind of pinned to the outside of the nucleus. The cell doesn't really care about it. These are maybe old genes that don't matter anymore and they're left till the end of S phase when the nucleotide pool is depleted. So in any case we probably want to be less excited about them. Here's this coven sushi domain protein CSMD3 looked like the replication time assay had saturated so we can fix that and put it into our calculation. And basically the upshot of all this is that the mutation landscape is not flat like the naive model assume but it's got structure and it's kind of a high dimensional space with these covariates like replication time and expression level and you sort of want to learn for each gene what is its actual background mutation rate. And we have a lot of data so far but we don't have perfect data yet and we don't know this accurately for every gene especially some of the smaller genes. So the trick we use is to kind of explore outward from the gene itself to its very close neighbors in the space and look at them, look at how many silent mutations do the neighbors have and how much does that increase our confidence about our estimate of the gene's background mutation rate. So once we do that for the lung set the improvement is really just tremendous. We don't lose any of the known genes and moreover they kind of bubble up to the top of the list we only have to look in the top 12 genes to find the top six known lung cancer genes as opposed to having to go down to number 169 to find NFE2L2 and even better the olfactory receptors drop way down to rank 181 and below and so the gene list just cleans up to 52 genes and we're a lot happier with this. And it holds true in other tumor types and even better it still works in the old tumor types that didn't have these problems like prostate which has a low mutation rate and was relatively well behaved without these new improvements we still get the three significant genes and it looks like it still works. So to kind of summarize and put it all together this is kind of the master equation that Gaddy wrote out on the board once and you can see where these features play into it the patient specific mutation rate is Fp the gene specific mutation rate is Fg and then learning these mutational spectra characteristics and how they apply to the tumor types and the individual patients that's the weighted sum here of the different mutational factors. So once we get all this in we can sort of run Mutsig on the pan cancer set and this is just kind of a preview of that where tumor types are listed as the rows and then these are the top significant genes by Mutsig and I've just labeled some of the known genes because a lot of, I should point out also that this is not all validated data so we definitely want to follow up on a lot of this before we make any excited claims about new findings and with that I just want to stop by thanking all the people who have helped with this at the Broad there are really too numerous to even fit on the slide but I especially want to thank the people who have been part of the Mutsig team including Peter Stoyanov, Brian Hernandez, Marcin Imolinsky, Peter Hammerman, Greg Kruchoff, Aaron Hodes and Chip Stewart as well as Gatti Getz for his unfailing leadership and inexhaustible stream of ideas and all the other people at the Broad and thank you all for your attention. Chris. Mike, hi. So that's a very important topic obviously. I want to ask you a question about the relationship between mutation count which is what you observe in the data and the mutation rate as in the probability of likelihood for a mutation to arise in the particular gene at a certain time in oncogenesis and third, the proliferative advantage such mutation confers to that to the clonal expansion and then the likelihood of observing the mutation in the final tumor. So in other words, the observation is counts or in mutation density and that relates to the rate on the one hand of that mutation occurring and also the proliferative advantage comes in which is a sort of you might call a confounding factor and so I'm asking how are you going to take that into account? That's a great question. They are totally confounded together and I don't know that we really have a great handle on that. We tend to just, we tend to leap from the the observation that a gene is is observed to be frequently mutated to saying therefore those mutations conferred a selective advantage but that's surely not as simple it's not as simple as that. Maybe you can add a term to the master equation. Yeah. Rule? Hey Mike, a wonderful talk. I wonder if you have any plans to make your method available Yes. and the data? Definitely, yeah. We're working with a few beta testers or we're planning to we're setting up a transfer of the software and you should join. Does it also apply for deranger and all the other the other tools that your group has developed? Absolutely. Correction for the background mutation rate do you only consider two factors replication timing and expression level or do you consider other factors because there are many other factors that are known GC content, association with lamina and so on. Great point. That's a great point. Dozens of factors. You're right, yep. The model we're using right now has five factors and GC content is one of them association with the lamina that's a really great idea What are the five factors? Expression level, replication timing GC content and the other two? Let's see if I can remember. I can't remember right now. Maybe I can talk to you later. Okay. We would love to get a data set that's you know, lamina association. That's a great idea. Two more questions, please. Over there. Very good talk. So you introduced the gene-specific concepts for each gene and then correct the background gene. So one thing is now, we identify same-muted gene after we first identify mutation but if we infer the mutation from maximum data we know each gene actually have different coverage. Maybe some gene have very high coverage some gene have very low coverage. Have you ever considered incorporating that sequence and coverage information into your model to sort of inform a specific gene background? Oh yeah. That's been in there since the beginning. We count up how many bases were successfully covered in the sequencing and that's sort of our denominator for all the calculations. But sometimes the coverage itself, the deeps, some gene maybe they all total coverage but some gene only 20-fold but some gene only 20-fold. So that's actually also a fact. Oh I see. You're talking about how well covered it is. Right now we use a cut-off. Kind of a cut-off that we know we're about 80% powered to detect a mutation and if we're better covered than that we don't take that into account. It's kind of a binary thing but you're right we should we should really move to using a probability of detecting a mutation. Next question, please. Hi. You describe a spectrum of mutations for different types of cancer. Inside the spectrum for every type of cancer do you see some sort of classification or partitioning of patients? So some types have I don't know 16 different types of mutations. Maybe these classify patients. I think it could definitely help to classify patients like in the head and neck cohort some of the patients have that TPC dinucleotide signature that's HPV associated and some don't. And when we looked it actually did correlate with which patients had HPV sequences detected in their DNA. There was a good correlation there. So that's one example where the spectrum can independently stratify the patients like that. Another example is in the colorectal set where there's the hypermutated simp positive cohort they have a slightly different mutation spectrum too. Thank you. I think it'll be interesting to see if there are any gene signatures that go together with that. Yeah. Thanks. Okay, very last question please since you already opened. Quick one. Yes. I assume you talk about these somatic mutations. My question is if you look at the germline mutations is this also true that all fat gray genes have higher mutation rates? Yeah, I think that is true. Yeah. Okay. Thanks again, Mike. Thanks.