 I want to thank the organizers for inviting me. So I'm going to talk about our method called Paradigm, which integrates multiple types of data on patient samples for inferring what's going on in these cancers. So as folks know, in TCGA, we generate lots and lots of data, and it's often referred to as a flood. The Broad calls their system fire hose for an appropriate reason. And my point is that when you participate in these projects, you often want to do lots of different types of comparisons, from comparing expression to methylation to figure out why something is not expressed, or looking at the copy number and expression and methylation all together, this quickly gets out of control. You have lots of combinations of things you want to look at and can be overwhelming. More importantly, when you're thinking about a gene and trying to figure out what's going on with that gene, is it active, is it not active? If you've got all these different pieces of data telling you different things, you feel like you're at this stoplight, and you don't know whether to go or not. And well, at least that's how I feel, and this is often how many of us feel. This is what it makes you want to do. This is your brain on all these types of data. So our particular approach is to say, let's use a knowledge-based approach. And the analogy I like to use is you're kind of like a detective or a car mechanic in this example, and each patient is a different accident, let's say. And something different went wrong, and some things are more serious than others here. And if you could try to do data mining on these car wrecks, if somebody handed you a ream of data, how fast the car was going, the direction, what people were saying, some of it's relevant, some of it's not relevant, you're going to be better off if you use knowledge about how the system works. And I like car talk, so I'm showing click and clack here, right? People call them because they know a lot about cars and can figure out and diagnose the problem. And this cartoon shows a radiator running off and the mechanics looking in the engine and saying, I know what the problem with the car is, that you don't have a radiator. Now you laugh, but with this data set, it took a little bit of knowledge in this case to know what was missing. So in cellular systems, obviously we have put together at least some of the circuitry and the machines inside cells. And so we should use those. And I'm going to show you a system that defines a computational model to represent these types of systems. And we benefit from all these efforts out there, and there are many I didn't list, that's the ellipsis at the end. We've drawn from Reactome, Keg, Biocarta, NCIPID, many different institutions. And our favorite, of course, combines all of them, pathway commons from Memorial Sloan Kettering. And so we try to suck in all that data to learn something about what's going on in a cell. So to motivate why we want to do this beyond just data fusion, just think of a simple example, we've got a transcription factor and you're looking at the expression of the transcription factor. And there's, let's say, three different transcription factors shown here. You've got two that have high expression shown in red and one that's lower expression. And we know that expression is in everything. And so it's almost a teleological argument. But how do you figure out whether something's working or not? How would you figure out that an enzyme's working, even if you had magic goggles and you could look inside a cell and see that it's bumping around and moving in a cell and chewing things up? You're going to look at its secondary effects. You're going to look, did it actually metabolize substrate? Or did it, it's a kinase. Did it actually phosphorylate a target? And for a transcription factor, is it turning on its targets, right? And so that secondary evidence tells you something about the activity of the transcription factor. So in this case, you assume or you infer that the transcription factor's on and that might confirm your expression evidence. Another case, you might see that, oh, well, the targets aren't doing anything downstream of the factor. And in this case, you would think it's off. Either the post-transcriptionally or even translationally, we didn't activate this protein. Or it's not localizing correctly. Or there's a mutation that stopped blocking its function. Or it's co-activators, right, aren't around. And on the reverse, you could have a low level of expression of a factor, and yet it's still enough to have potent transcriptional activity. So you want to look around the neighborhood as the argument here to figure out what's going on in these things. And one more, so that's one piece of the puzzle, is to look at neighbors. And the other idea, too, is in this previous example, we inferred that the factor was on because of its downstream targets. But suppose I ask Gatti to give me Gistic plots now. This is a different type of data, copy number data. And all of these just serendipitously, all the targets are amplified now. And so I could explain away those overexpression via amplification. And so I'm less likely now to think that the factor's on. Maybe I still think over my prior expectation it's on, but it's not as high anymore because I have another piece to explain the up-regulation of those targets via a cis-regulation type of machinery. So to model all those two pieces of information we're also standing on the shoulders of giants here. There's been lots of development in the 80s and 90s and even currently by seminal work in the field from Yudia Pearl and Heckerman in the early 90s and more recently by Daphne Kohler and Neer Friedman and Aransigal. There's lots of people in this list. And I would recommend folks read this really nice review article by Neer Friedman in Science in 2004. So it's getting dated, but it's still a very nice read. So these Bayesian networks and probabilistic graphical models that they describe give us a very nice way of modeling lots of different data and dependencies. And we can learn something from data where we might have had a knowledge bottleneck before. And so just a simple example here. Let's go back to the diagram we had from the nice work from Sloane Kettering and the GBM study. And we have an oncogene MDM2 that is known to inhibit P53. So there are two parts to the system that we model. One has to do with the regulation of MDM2's activity. And the other part has to do with the interaction between it and P53. And just as a quick toy example, the model that we have, so when you see our activities for genes, it's actually a little bit more of a rich representation that looks something like the central dogma for a gene. You have a certain number of copies in the genome. You can express it. You can have a certain level of protein and a certain activity in that protein. And all these variables are beliefs that you infer from data. And these little black boxes show you constraints that help you infer those beliefs from data or from other beliefs in the system. And you can propagate this information to infer something about a higher-level thing like apoptosis or activities for these genes. And that's what we use downstream for our downstream analysis. And so the big picture looks like we take a cohort of patients, various types of data, run it through our pathway models, and then we produce one matrix that we can now do analysis on. So we don't have to think about all these different modalities anymore. We can just think about, is the gene active in this sample? And provide this new matrix for analysis. So for the ovarian study, the obvious signature here from the paradigm analysis was this FoxM1 signature. So when we zoomed in on this, all the patients pretty much had a up-regulation of this known mitotic regulator, FoxM1. The slightly more interesting story about it is that it has two isoforms. And one part feeds into proliferation. The other part feeds into DNA repair. And there's a lot of disruptions in the genome in all the ovarian samples. They're getting constituted activity signaling through like ATM and ATR, turning on genes like FoxM1 that, if they're not being spliced correctly, are promoting two different very opposite kinds of things that you want to happen in a cell, both this proliferation switch and this DNA repair switch. So FoxM1 also regulates BRCA2, for example. So very interesting story surrounding FoxM1. If you take the pathway activities and you try to define subtypes for the ovarian samples, then the good news there was that we could actually start seeing a delineation of meaningful subtypes. So this purple cluster shows you that they have slightly better survival patterns than the rest of the patients. We've recently worked on the colorectal paper led by Raju, Curjulapatti, and David Wheeler. And in this case, the story isn't so much FoxM1, but activated MIC throughout. And that's an interesting piece of information as we've seen in the mutation data and other types of genomic perturbations. Wnt and TGF-beta signaling pathway genes are mutated and those all impinge on this misregulation of MIC. And that also bears out in the pathway analysis. And so one other type of analysis that we're doing with the pathways is we can take two groups of samples or patients and look for markers of one subtype versus another, say, and then hone in on subnetworks that are markers for a particular cohort. And we're working on this for the luminal basal comparison so in the breast cancer model. Just to show you, this is the closest we get to the dreaded hairball, but you can see that there's... So blue is more active in luminal and you can see the expected sort of ER signaling pathways. And then you have some other intriguing pathways among the proliferative ones for basal, shown in red, like HIF-1-alpha, for example. So the way we can use that hairball is to do something like a master regulator's analysis, like Andrea Califano likes to do with arachne. You can look upstream in this example of a basal marker, such as FoxM1, like I showed, and sort of by chain of reasoning of the regulation hierarchy, you see that there's a polokinase. And so the prediction there is that basal cells will be more sensitive to a polokinase inhibitor, and this actually pans out in a cell line model shown in Joe Gray's lab with his cell lines. So this plot here shows you sensitivity to a polokinase inhibitor for basal and cladin-lose contrasted against those in luminal cells. And the reverse is true as well. You can look up a marker for luminal, like a luminal hub, and in this case it was an HDAC, and so the prediction is an HDAC inhibitor would be more sensitive in luminal cells, and that's what turns out to happen in these cell line models. And you saw a nice example yesterday from Sam Meng just to go through that really quick because I wanted to show you one more result that Sam didn't have time to show. So he's developed a clever method where you can run our pathway analysis twice, one where you connect the gene downstream to its downstream targets, infer an activity for it, another where you connect it to its upstream targets, infer an activity, and then just look at the difference to get what he calls the discrepancy in the activities that are inferred. And he showed you an example, sort of a positive control for RB. You can see that the mutated cases, he's seeing a lower discrepancy, which corresponds to a loss of function event, and he showed you the pathway surrounding these things. So we've tried this for a few positive controls, and he showed you P53, and you can kind of squint and see that for the cases in red around the circle plot, tick marks are patients, sorry, I didn't mention that. You can see a lower activity being inferred. And so I asked Sam late last night, actually, can you please run this for the lung squamous results? And as you saw before, he was predicting for NFE2L2, this known oncogenic gene that he's getting a positive discrepancy. And there are 30 mutations in CDK and 2A, and consistent with other deletions, homozygous deletions in CDK and 2A, he's predicting loss of function. So that's interesting. But now, the power is, and these are sort of for more frequent-like events, but you can now start actually drilling into some of these more lower-frequency events. And there are some intriguing stories, I think, in there. And I wanted to just point out that some of its highest-scoring discrepant genes now are not the most frequent, right? So you actually have a hypoxia-inducing factor up here for in seven samples, why would that be? And among these up here are gonna be possible new targets that you could go after for your drug-able genome, for example. So we even have a map kinase kinase up there that might be worthwhile. And on the other end of the spectrum, there are some other loss of function events that we might wanna pay attention to. So you might ask, what do you do if you don't have good pathway models for genes? How can you infer activity or do these mutations mean anything? You can plot them against clinical information, and so this is just sort of an overview of, you could show some phenotypic information against these pathway activities and infer a connection between mutations or phenotypes. And just really quickly, since I'm almost out of time, we've done this for, piloted this in the colorectal study, and you can cluster the mutations based on these signatures. And you can see, you can look up that APC and P53 tend to have the same correlations in the colorectal study, for example. And it confirms that APC mutations are correlated with MIC activity, in this case, anti-correlated with the repressed targets of MIC. And on the other end of the spectrum, we have TGF beta pathway mutations, so those cluster together. And in the middle, you have RTK and P53 kinase pathway mutations. So the obvious idea here is, if you have a mutation in gene X, and it looks like it's associated with the same activities and possibly different patients, perhaps it's also acting in the same pathway based on this type of association analysis that Ted's doing. And so I'm basically, at a time, I'm gonna skip to the end. Obviously, we wanna use these to look across multiple cancers. The pathway activities give us a way to do that, and we're working on pan-cancer analysis, basal comparison to ovarian for the breast work, and so on. So I hope I showed you that we have a nice model for integrating a lot of different data sets. We use knowledge about pathways, we're trying to expand that with predicted interactions now. We can stratify patients with that, find predictive subnetworks, and so on, and use it to predict hopefully more of these rarer mutations, and the inferences allow us to connect cancers across different data sets. And hopefully the last slide that I just skipped there was just trying to make a point that we can connect subtypes together and maybe get a clue about therapies. So I wanted to just say a special thank you to the Broad team here. They've got a paradigm working in fire hose, and this is not a trivial feat, and a lot of these big network methods, by the way, take a lot of CPU time to run, so this is really nice that it's gonna put the results in the hands of the public, actually. And so you don't have to go off and implement these yourself. And this is my group that worked on the integration analysis. I've highlighted the work of the folks circled there, especially Saming, who you saw speak earlier. And this is work in collaboration with David Houser, who actually heads the whole team, and Chris Benz, and Jing Zhu ran a tutorial yesterday, and she runs the engineering staff. So thank you, and I'll take any questions. Sorry, I went a couple of minutes early. Time for one quick question for Josh. Crystal clear. Nope, okay. Well, I'm sure you'd be happy to take it up over coffee if something emerges. So thank you, Josh.