 Okay. Thank you. Thanks to the organizers for the opportunity to present our work. So we heard this morning that one of the challenges facing TCGA and other cancer genomics projects is distinguishing functional driver mutations from random passenger mutations that are also measured in the genome when we do high throughput genome sequencing. And of course, functional mutation is really a biological phenomenon and ultimately must be determined by experiment. So we can prioritize these experiments by looking for recurrent mutations, mutations that are present in more of the patients than we would expect by some model of chance or perhaps zoom out a little bit and look at genes that are mutated more than we would expect by chance. And so for the purposes of this talk, we'll sort of look at mutations at the level of genes and we'll assume that we've sort of carefully annotated the genes in our experiment and so that we have a mutation matrix where for each patient in each gene, I've indicated whether or not there's a somatic mutation that's present. And so then the question is, are there genes that are mutated more than we expect by chance? And so the standard way to do this is with a single gene test where we look at each column of the matrix and we ask, you know, do we see more ones in that column than we would expect? And then since we're performing all of these independent tests, we then have to go do some multiple hypothesis correction to correct for the fact that we did thousands of statistical tests. So this was the approach, simplified, but that was used in the first two TCGA papers and here are the results from these two papers with the list of significantly mutated genes. And as was described also this morning, these lists are pretty short. We don't find that many significantly mutated genes at a level of statistical confidence that we're happy with. And there's various reasons for this. Our statistical model may not be good and certainly that's part of the reason and our passenger mutation rate may not be quite right. The data itself, of course, has false positives and false negatives in it. Maybe we don't have enough samples. I mean, 91 samples is a pretty small number to try to find recurrently mutated genes, especially given lots of the mutational heterogeneity that's present in tumors. But there's also a biological reason and that's that genes don't act on their own, but of course act together in pathways or networks. And so cancer is also sometimes called a disease of pathways. And this has also been appreciated in the original TCGA paper. In addition to finding single, significantly mutated genes, there were tests of pathways that were done. And the standard approach here is to look at known pathways. So this figure shows a network and within that network of genes, individual pathways were extracted and you can then do a variant of the single gene test, just looking simultaneously at multiple columns in this matrix. And again, as the question is now this pathway, this group of genes mutated more than we would expect by chance. And here we see the P53 pathway. This is sort of a schematic, but in the data, the P53 pathway was mutated in 87% of the patients, which was much higher than any single gene in that pathway. Okay, but of course this approach has some limitations. We're only looking at the pathways that we know. We're ignoring somehow the connections between the genes, the topology. We're just viewing these as sort of columns of the matrix. And moreover, this idea that pathways are their own discrete units is somewhat of a simplification. This figure even shows that the pathways themselves are interconnected, this is sometimes called crosstalk. So what we've been doing is we've been asking the question, we have a lot of, and we're getting more and more, sequenced genomes. So could we start to develop methods where we can look at combinations of mutated genes that are somehow less biased by prior knowledge of pathways. So in going from in the left of the picture known pathways, could we instead look at all combinations of genes, use no prior information. Of course, as we reduce the amount of prior knowledge of which combinations of genes we're going to look at, we increase the number of hypotheses that we have to test. So for example, if we wanted to test all possible groups of fewer than six genes, 10 to the 22nd hypotheses, we would need a lot of samples in order to obtain any statistical significance. So we might look for some intermediate. So maybe we'd want to somehow restrict our groups of genes by those that are on an interaction network, maybe a network constructed from a superposition of pathways. Even here, if we tried to exhaustively test every part of this interaction network, maybe we'd look for subnetworks. We don't really reduce the number of hypotheses that much. So what we've been doing is developing algorithms that are sort of in between and sort of different points of the spectrum. And I'm going to tell you very quickly about two of them. The first is called Hotnet, which uses the interaction network and tries to pull out subnetworks of the interaction network that are mutated more than we expect. And the other is called Dendrix, which gets closer to this idea of looking at all combinations of genes. And both of these algorithms, we compute p-values in a sort of robust manner, and I won't really get to describe the statistics in this talk. So Hotnet, the approach here is that we have a predefined interaction network. This could be some high-quality network that we take the textbook diagrams and pathways and superimpose them. It could be some noisy thing that includes whatever high-through-potato you want. You take your mutation matrix, and the model is to find connected subnetworks that are mutated in more patients than you would expect. So in doing so, now that we've moved to the network, there's really two considerations. It's not just the frequency of mutations that's going to determine which subnetworks we pull out. It's also the topology of the network. So to illustrate this briefly, on the left, you see that we might, for example, have two genes that are mutated at moderate frequency that are connected in the network via a single path. And that's somehow more surprising to us. That's more of a clustering of mutations than if we had the same two genes of the same frequencies, but that are connected through some gene of very high degree, a gene that was connected to many others. And this problem of having these different topologies actually comes up a lot in looking at cancer genes, because many cancer genes have very high degree in these interaction networks. So we need to account for both mutation frequency and topology. And the model we use for this is we actually think of mutations as sources of heat on the graph. So what we do is we heat up each gene, which is a node in the network, in proportion to its frequency. And then we let that heat diffuse over the edges of the network. And what this does is it encodes both the mutation frequency and the topology of the network in a single model. And so now what we have is a distribution of heat on the graph. And we can then break up the graph and find significantly hot subnetworks. And this significantly hot requires a somewhat subtle statistical test, which again I won't describe and I refer you to the paper for more details there. But we can get robust p-values and FDRs for testing this hypothesis of significantly hot. So we worked with the ovarian analysis group to apply hotnet to the ovarian data. And this was published earlier this year as part of the paper. And running hotnet on the whole exome and whole exome mutation data and copy number data together, we found 27 subnetworks of the HPRD network, a network that contained 37,000 interactions. 27 subnetworks with at least seven genes, with a reasonably good p-value. And so here's a picture of them, this sort of shows you the subnetworks each in different color. Some of them are connected to each other, some of them are sort of more isolated in the network. And so what do you do with such a picture? Well, the first thing you do is you go see if you've found anything that was already known. And so one thing that fell out immediately when we looked at intersections between known pathways was one of our subnetworks overlap significantly with the notch signaling pathway. And so here was the picture of notch that appeared in the paper. And what you can see is that each of the genes in this pathway is not mutated at very high frequency. I guess notch three is mutated at moderate frequency. The others are mutated at fairly low frequency. So it's both the frequency and the interactions that's driving this prediction. In total, 12 of the 27 overlapped either a known pathway or a protein complex. And others are sort of novel predictions. Some look interesting. They're all published in the paper. And I refer you to the appropriate supplement. So having looked at the interaction network, we decided to go be a little bolder and see if we could just get rid of the interaction network entirely. Because interaction networks are noisy. And sometimes when I give a talk, people complain about them. So I said, well, let's get rid of it all. Get rid of the whole thing. Well, I said that there's too many hypotheses to test if we wanted to look at all combinations of genes. So what we do is we impose some constraints on the sets of genes we're going to consider. And these constraints are driven by a couple of assumptions that are sort of supported in the literature. And one we've heard about already a few times in this workshop. So under the assumption that driver mutations are relatively rare compared to the passenger mutations. If there's a pathway that is going to be mutated in a patient in order for that patient to have cancer, then there's probably only one driver mutation that's necessary. Now, here we have to be careful, mean by pathway. Some pathways are large and hundreds of genes. We mean something maybe more targeted. So a pathway as shown here. And there's various evidence of this. And what this imposes then is a mutual exclusivity between the mutations. And so the black bars here indicate mutual exclusive mutations. And the red indicate co-occurring mutations. And you can see across this pathway there's lots of exclusivity and very few patients that have more than one mutation in this pathway. The second assumption is that if the pathway, the set of genes, is important, many of the patients will have a mutation in that pathway. So the pathway should have high coverage. There should be lots of patients with a driver mutation in that pathway. So with these two assumptions, we then introduce an algorithm we call de novo driver exclusivity, or dendrix for short. So just directly from the mutation matrix, we try to find sets of genes, columns. Actually, they should be rows. I've transposed the matrix to match the figure. So rows of the matrix here, it's a contiguous set of rows, but that doesn't have to be the case. And we find them to meet these two properties. This turns out to be a computationally difficult problem. So we have a couple of algorithms for doing this. And we have some theoretical results that show that they perform well. And we've more recently, since the publication at the bottom of the slide, extended this with an alternative scoring metric. So I'm going to show you just a couple of examples of running these algorithms on some new data sets. The first is AML. And so if we run dendrix, we get several approximately exclusive sets, each with reasonably good statistical support. And hotnet, we get a few subnetworks. And before I show you a few of these, I was instructed to say that this is unvalidated mutation data. So anything here is preliminary and subject to change. So running the mutual exclusivity, we get two sets, the two top scoring sets. Actually, both have six genes in common. Four of those six are fusion genes. And this is a highly exclusive set. These fusion genes are mostly subtype-specific. So in some sense what we're doing is picking out the subtypes. And then extending that set of six in two different directions are the two other exclusive sets. I've shown you the schematic of exclusivity. And you can see that they're mostly exclusive, some overlap. And together, the set of six, the blue and bottom, cover about 25% of the patients. When you extend them, you get coverage up to 75% of the patients. Now looking at these two sets, there's a question of why they separated. And so if you look across these two sets, what you see is that there's lots of co-occurrence between them. So many patients that have a mutation in more than one gene in the set. Moreover, the co-occurrence I'm showing here on the bottom, the dark red now, is co-occurrence across the two sets. The light pink is co-occurrence within the set. And so what you can see is that there's a lot more co-occurrence across the sets than within, as there should be, because these sets are exclusive. But those co-occurrences are actually spread across multiple genes. What we're seeing is an effective not due, because we haven't done pairwise analysis, but looked at larger sets, we can actually find these exclusivity and co-occurrence relationships that are not just pairwise, but are more complicated across multiple genes. For hotnet, here's a view of five sub-networks. Here's three of them that were enriched for either known complexes, the cohesin complex, the polychrome complex, or the keg pathway AML. Oh, it's a nice screensaver. Oh, it's back, okay. And again, the p-values are not huge, but are not particularly small, but we're digging into this large interaction network with a lot of noise. So pulling out these complexes is, we're pretty happy with. Finally, we've done a quick run of the date on the breast cancer data set. Again, several exclusive sets, all with pretty good p-values and several sub-networks. Some of these appeared on a poster earlier today, which is probably still hanging up. And I'll just show you two of the sub-networks. The first one on the bottom, you probably can't read the gene names there, but the first two rows there are P53 and PIC3CA, which are fairly exclusive, but really they're driving the set because they have very high mutation frequencies. So if we remove P53 and PIC3CA, then the set on the bottom actually contains a really nice exclusive set, genes of moderate frequencies, including Gata3 and CTCF, and even I can't read the slide from here. So again, they're on a poster and you'll see them. So that's the summary. Trying to take a view of the data that's sort of less biased by known biology and see if the data can just lead us to the interesting sets directly. The two algorithms that do that and what we're working on is to bring in more data types, methylation's underway, gene expression will come, and then to do a little more pre and post processing of what we put into the algorithm and what comes out. We've been very naive about it. We just throw in the mutation data itself. We don't filter by subtype. We don't post process. So all these things are sort of add-ons that we hope will get us even more power. So the acknowledgments, my colleagues, Fabio Vendine and Elliot Follett-Brown worked together on me to develop these algorithms. Sintah Wu has done some of the analysis. Genome Institute at WashU for the AML data and Andy Mongol and others at BC Cancer Agency for the fusion gene data from the RNA-Seq and the funding agencies. Thanks. Questions. Have you seen any exceptions to your exclusivity assumption? I mean, I'm not sure what an exception would mean. What do you mean by an exception? I mean, there's lots of gene sets that are not exclusive. There are gene sets that are exclusive, but when you look on the network, they're not interacting in any way that we can see. So that, in a sense, violates the idea that they're within a pathway, at least a known pathway. That's probably a very naive question, but there must be things driving P53. I mean, mutations in P53. The driver mutations, in many cases, are gonna be before that, right? So how do you pick out those from this analysis? Maybe I missed the whole thing. I mean, there's no temporal information here, right? We just get the, you know, we get the mutations from the patients when they were sequenced. So we can't distinguish things that happened before P53 or after. Is that... Yeah, I mean, because probably, right? If you got P53 in 80%, then... Sure, yeah. There are a lot of other things upstream that are doing that, so... Yeah, and... That's really what we want, isn't it? Right, and that's why, you know, in that last example on breast, you know, if you, P53 is a driver mutation, we know it, so we pull it out of the data and then we can start to get these more subtle signals. And I think doing that in a more intelligent way will allow us to sort of pull out some of the things that are obscured by the high-frequency genes. Next question. Yeah, so, you know, a good talk. So, you know, hot night work part, you use HPRD, which primarily contains protein-protein interaction. Yes. But we know a lot of interaction, genetic interaction rather than protein-protein interaction. Sure. So, man, do you ever try to combine different type of interaction through certain analysis rather than only use protein-protein interaction network? We have not, we've used various protein-protein interactions. We've used KEG as a network. We've used IRF, which is sort of a mismatch of a few with some curation. We haven't used a genetic interaction network. If there's a good one for a human, we'd love to try it, but we haven't found sort of a good source of that. Then how much your result depends on the protein-protein network? We know which is very biased and quite noisy. Yes. How that really affect your result? We've assessed this by, you know, running it on these different networks and the results change a little bit, but they don't, they didn't change dramatically. You know, some networks are seemingly give sort of nicer results than others, at least in terms of the known biology, but yes, I mean, the network we're taking as information. Other questions? Okay, thank you. Thank you.