 All right. Well, thank you. Thanks for the opportunity to present our work today. So one of the first analysis challenges when you sequence a new cohort of cancer patients is to identify significantly mutated genes that are mutated in more patients than you would expect by chance. This is not such an easy task. There's many subtleties. A number of individuals in this room had developed very sophisticated methods for doing this statistical test appropriately on a gene by gene basis and then applied these tests to TCGA and other large data sets, thereby identifying a number of significantly mutated genes. And I put a list of a few of the more recent studies from TCGA. And while these genes seem to be enriched for cancer-causing genes, these do not seem to be the complete list of cancer genes in these samples. These lists are relatively short. And what I'm not showing you is the long tail of genes that are mutated at, you know, modest or rare frequencies, but don't pass this threshold of statistical significance that one would use. Now there's many reasons for this mutational heterogeneity, but one of them is that genes don't act on their own, but rather act together in pathways. And when you look at the mutations in known pathways, you indeed see such clustering of mutations in pathways. Again, showing you some examples from some of the TCGA studies, and thereby showing that we can recover some of the cancer-causing genes by knowing something about the biology and looking at how they interact with other genes. But as the data sets get larger, my group has been interested in developing ideas that allow us to drill a little deeper into the data, not restrict ourselves to only known pathways, but rather to sort of reduce the amount of prior knowledge we put into the analysis, thereby possibly identifying novel pathways or crosstalk between pathways or perhaps use something about the topology of these interactions. So of course, as we reduce the amount of prior knowledge and perhaps go to, you know, a noisy whole genome interaction network or even try to look at all combinations of genes and whether or not they're significantly mutated, we're going to face the statistical problem of increasing the number of hypotheses. So it would be very hard to obtain statistical significance if we couldn't even do it for a single gene. So we've developed a couple of algorithms that allow us to use some additional information and some tricks to kind of get this statistical robustness. First, you're looking at subnetworks of an interaction network, an algorithm called hot net, or alternatively going sort of deeper into this no prior knowledge, but we use this notion of mutual exclusivity. I'm just going to tell you today about hot net and to just briefly recap what the goal is. So we assume that we've taken our mutation data, collapsed it at the level of genes. We have a whole genome interaction network in a very simplistic format here where we've got genes or proteins and an edge between two genes. If there's some known interaction between them, we could of course use varying amounts of prior knowledge in constructing such a network. I'll tell you the ones we used in the analysis in a couple of minutes. And so the goal is to then identify connected subnetworks that are mutated in more patients than we expect. So you can see how this sort of generalizes a little bit this idea of looking at known pathways where if we think of this network as say the superposition of all known pathways, then we're sort of simultaneously trying to pick apart these pathways but allow for crosstalker interactions between them or perhaps novel subnetworks that we don't know a priority. So in developing such a method, there are really two components and there's this interaction between these two ideas that needs to be dealt with. One is that genes can be significant on their own. They have a frequency mutation or perhaps some score, maybe some significant score that you use, as well as the topology of the interactions between them. So we'd like to identify for example cases where the individual genes might not be significant on their own. They're mutated at moderate or rare frequencies but they're very highly connected. And perhaps distinguish those from cases where we do have genes that are connected to each other. Maybe they're even highly mutated or highly significant but that connection is through perhaps some high degree node which isn't so surprising then that we would find such a connection. And biological networks have this sort of uneven topology. So we need to simultaneously account for both mutation frequency or score and the network topology. So one way we can model this has this nice physical interaction that I can describe as thinking of the mutations as sources of heat on the graph. So the idea here is that we have this interaction network. We heat up the genes in proportion to their frequency or their mutation score or some notion of significance at the gene level. Here you can use a variety of different scores. And then that encodes the mutation part and then to encode the topology it allows that heat to diffuse over the edges. So this encodes both the local topology not just nearest neighbors but neighbors of neighbors and it does this in a very continuous way. So now we have this distribution of heat on the graph. So how do we get the subnetworks? Well we need to break it apart and the way we break it is by removing cold edges. And then we have to do a statistical test on top of that which I won't describe but has been published previously. So that's hotnet. And in this study what we wanted to do was apply hotnet in the pan cancer setting. So to go across multiple cancer types. And the idea here was that what we would like to find in doing so is perhaps universal subnetworks that were mutated across all cancer types maybe sort of with equal frequency somehow or perhaps once they were mutated only a subset of cancer types or maybe even cases where the subnetwork itself is mutated across all cancer types but the individual genes within that subnetwork are mutated with some cancer type specific biases. So we didn't know which of any of these things would arise and so we decided to go and look. And the story I'd like to tell you for the next six minutes or so is that we took a very noisy interaction network and we took all the mutation and copy number data threw it into hotnet and magically we got the right subnetworks. Unfortunately although the algorithm is nice it doesn't yet perform magic. We can't take all mutation data and copy number data. There's just too much noise there. So we have to do a little bit of filtering. We try to do as little as possible. So for single nucleotide variants we don't have to restrict to just significantly mutated genes. We can go really far into this distribution if you will. Here we used to cut off of only .8% frequency of mutation and that reduced some really low frequency mutations. Copy number aberrations of course are much, much trickier because they tend to be large. They include many genes and really identifying the target gene is a really difficult challenge. And we ended up using agistic max peaks that were predicted for the individual cancer types and then merged those together in a pan cancer analysis. The interaction network though we are able to use sort of a very noisy interaction network as I'll show you. So with some modest filtering on the mutations, here's what we obtained. So in total 1984 samples across nine different cancer types here, colorectal I've combined as one. Breast cancer I've split into four expression subtypes. So there's a number of samples of each type and 765 genes after we remove these low frequency mutated genes. You'll see some of the usual suspects have either high single nucleotide frequency or high copy number frequency there. But then there's also sort of this big smear of genes that are very low frequency. So the first interaction network is a very noisy one. It's this IREF index 200,000 interactions, over about 14,000 proteins. This sort of incorporates a lot of different interaction databases. So has a lot of false positives, presumably also a lot of false negatives. Note that there's no sort of temporal context here or subtype context. So this really is a very noisy thing. But we were able to pull out 11 subnetworks with at least three genes. And the p-value is at least 10 to the minus two. We haven't sort of finished all the permutation tests, presumably much lower than that. So here's a list of these subnetworks. And I'm going to show you a few in the next few slides, but not surprisingly, some of the usual suspects. Again, P53 signaling, PIC3CA, GFR and other receptors, RB1. This is a DNA damage subnetwork that I'll show you. P10 cohesin, I'll show you in a second, as well as a few others. So just to sort of show you some of the data, sort of a very dense plot. So here's the genes, their frequencies across the different subtypes, color coded by subtype, as well as whether it's a single nucleotide variant or an amplification in that sample. This is PIC3CA and RAS. What we also do is we sort of compute whether there's enrichment for individual cancer types in this subnetwork. These p-values you should take with a grain of salt, because we have thrown out some of the mutations. We filter the mutations, but it does show at least some trend of maybe some cancer-specific biases in these subnetworks. And then we also do see whether these conditional on the number of mutations in this cancer type, whether there's any gene-specific biases. Here you see that N-RAS is sort of preferentially mutated in AML, while K-RAS is preferentially mutated in colorectal, the luminal A subtype breast cancer with a PIC3CA mutation. I wanted to also note that, you know, there is a strong pattern of mutual exclusivity, but there's also co-occurrence that's allowed here. And in fact, although PIC3CA and the RAS genes are often thought to be mutually exclusive, there is some evidence that they do co-occur in particular samples, and we see that here in endometrial and colorectal. Here's this DNA repair, a network, same type of story here, centered on BRCA1. These two genes here are actually Fanconia anemia genes, which are genes that infer a genetic predisposition to this DNA instability disease. So it's sort of nice to find them here, very rarely mutated, would not be sort of found on their own and seem to be across cancer types, although there is some enrichment for breast cancer and ovarian for this subnetwork. Here's cohesin, which falls out. And in fact, we get the subnetwork is exactly the cohesin genes, and no more, five. So this is the canonical cohesin complex involved in sister chromatin adhesion, as well as more recently seems to be implicated in more general gene expression roles. We see, again, some gene specific biases in AML and a little bit in GBM for individual parts of the cohesin complex, and perhaps some cancer specific biases in different breast cancer subtypes for this complex. Another one really fast, these polycomb group genes. This subnetwork has a few sort of strange things that are driven by copy number, but what we do see here is this two components of this PRDub complex, involving BAP1 and ASX01, a little bit of gene specific tendencies for mutation ASX01. BAP1, of course, recently shown to be an important gene in renal cell carcinoma, so not surprisingly, there is a kidney cancer enrichment for this subnetwork. We then went and did another run on an additional interaction network. This one is a little more focused, tends to be literature curated interactions. So a smaller number of interactions here, only about 40,000. We pull out a larger number of subnetworks, 20 with at least three genes. Again, a good p-value, here's the list of those. Many of them are the same, or similar, not exactly the same, but similar type of story. Here's one with ARID1A and PBRM1 that I didn't know before. This is actually a version of this, is currently in Kirk, and this was identified mostly in kidney cancer. Notch was in the OV manuscript, identified by HotNet. And here's a nice one that's new, slit robo signaling. Here it is, we see again, most of the slit, we see both the slit lig in the robo receptor and this SIRGAP1, the ROGTPA. So we're getting sort of the complete part of this signaling network. Not too much enrichment here across cancer type, seems to be sort of across different types of cancers except for AML, hardly anything in AML in slit robo. Interestingly, we were sort of doing this analysis a couple of months ago, without knowledge of a paper that of course many of you now know. Some in the audience contributed to this paper that came out in nature just two weeks ago that highlights slit robo signaling as important in pancreatic cancer. Again, some of these same genes, although also including robo3 in that analysis. So we're pulling this out of the TCGA data types, again purely computationally, not in any directed fashion. And so we sort of view this as in some sense a validation of our computational predictions. So with that, I'll conclude and I'll just say that in summary, we took hotnet, which had been previously applied on individual cancer types. We applied it in this pan-cancer setting. We uncovered some evidence for perhaps these individual types of things that could happen, a little bit of gene specificity, nothing of course this dramatic where the gene is exclusive to a cancer type, but perhaps some such tendencies. Of course, there's only a first analysis. The mutation data needs a lot more QC. We're still struggling a little bit with copy number aberrations and how to do those, get a better background models for the single nucleotide variants. We haven't done anything with expression and of course it's important to incorporate expression, a methylation, we'd also love to incorporate as well, but each of these data types requires itself some analysis, you know, you can't just all go into the algorithm and pull out the magic. Dendrix, I didn't describe for mutual exclusivity so that will be something to come. And I'll conclude with the acknowledgments. My group, Fabio Mendina, you've all my colleagues contributed a lot to the algorithmic developments and my students, Max Lysers and Sinti, we've been doing a lot of the analysis. For the AML data, we've been working a lot with the AML group at WashU and I didn't describe the AML results in particular, but that curated data went into this analysis. The algorithms are available. We encourage you to download and use them if you'd like. We'd also be happy to collaborate and work with you and tuning them to your analysis. And with that, I'll conclude. Thank you, thank you, Ben. Questions? Matthew? Sure, so, Ben, really fascinating talk. I think one of the really interesting things is this sort of novel generation of new candidate pathways, which looks like it's corresponding in some cases to other recent studies. And I just wonder if you have a vision of how to move from definition to validation of these pathways through orthogonal datasets which would both bring more power to the approach and also allow you to refine it. I mean, in brief, I get asked that question. I don't have a good answer. I'm not an experimentalist. We'd love to validate some of our predictions. We've talked to individuals about ways we might do that. It's a tricky business because there's all sorts of crosstalk between these and we pick out one component and the experiment that you would do, that would say it's somehow the pathway or the subnetwork that's important as opposed to the individual genes is a challenge. So, I'm likely, I mean, you, I'm sure, could give a better answer than I could for how to do this. And I'd love to sort of talk to you about it more offline. If I had a better answer. One more. Yeah, computationally speaking, I think this is a very nice model. It's very similar to the, like, when people analyze the gene regulatory network across different species and also, like, protein-protein interaction network. So, regarding to the model, on the graph model side, so, since this will be handling very large graphs, right? So, very large, huge graphs. I think a specific graph property, like, related to small degree, or even, like, trivial approach could be applied to kind of improve the efficiency side. Sure, so, yes, the graph is large. It's really not that large when you talk to people that do, you know, internet graphs and stuff. The heat diffusion encodes a lot of the types of things you're talking about in terms of small degree and, you know, small worlds, power law, all that type of business. It's, you know, in many of the approaches that look at those type of graph topologies can be modeled in terms of the heat diffusion or random walk. I mean, there are really just different ways of looking at the same types of things. Yeah, especially for the small cyber networkers, like a trivial approach. No, I mean, there is some issue of scaling that we're running into that as we make the, as the data sets grow, so that the difference between the highly mutated genes like P53 and the low mutated genes, that, you know, difference gets larger and larger than, you know, there do become some issues in how you deal with that sort of difference in scale. And that's something that's come out of this pan-cancer analysis that we hadn't really had to address before when we had 200 samples. Yeah, thank you. Ben, thank you. I think, unfortunately, we have to move on.