 So, I'm Director of Biocomputing here at OICR, I work on curated pathway databases and they add their application to cancer data analysis. What I'm going to do is basically give you a slideshow that's based on a review article that I helped to co-author with the members of the International Cancer Genome Consortiums a mutation consequences working group where we reviewed the state of the art for pathway and network analysis. I don't know if that pre-print got into the, it was distributed, it can be distributed to the students, I sent it to you. It's on the wiki. It's on the wiki. Okay, so there's a pre-print been submitted not reviewed yet up on the wiki for you to look at. Obviously it's an unpublished document so don't distribute it. This is based on that. So, okay, so I'm going to talk to you about pathway and network analysis, building on what Veronique talked about earlier today. So why do you want to do pathway analysis? Well, the main reason is that when you do a genome-wide screen on a cancer cell, you typically end up with hundreds to thousands of individual alterations. Copy number changes, changes in mRNAs, point mutations in indels and genes, and it's a very complex scenario. You want to relate these changes to biology and you start to run up against fundamental statistical problems of multiple dimensionality and multiple testing. Let's say you want to relate changes to a set of genes to patient clinical outcomes such as response to chemotherapy. If you're making thousands of hypotheses, it's easy to find associations that occur by chance. You don't know whether they are truly biologically significant. So what pathway analysis allows you to do is to reduce thousands of genes into tens or less of pathway alterations which may reflect the underlying biology of what's being changed in the tumor cell and then you can take those tens of hypotheses and relate them to clinical parameters, behavior of the tumor in a much more statistically powerful manner. It also allows you to find meaning in the long tail of rare cancer mutations. I'm sure you've talked about this earlier and seen this in your own work. That is a typical cancer cell. You have a cancer population. You have a few genes which are very frequently mutated and then a long tail of increasingly rare mutations, many of which are incidental, not real, but there are some drivers hiding in there. How do you make sense of these rare events? If you can relate them to pathways, then you can increase these rare numbers of mutations into a more common set, more frequently occurring set of pathway alterations. Also, pathways allow you immediately to tell biological stories to go from observations to mechanisms. I'm now going to try to define what pathway and network analysis is. Broadly speaking, it is any technique that makes use of a priori biological pathway or molecular network information to gain insight into a tumor or other biological system. It is a very rapidly evolving field. You ask two different people who work in the field what pathway analysis is. You'll get two slightly different answers and there are many different approaches. I cannot in this next 40 minutes or so give you a comprehensive assessment of all the things that people are doing. I'm going to try to touch on the main points and point you to a few useful tools that you could apply to your own work. Now we approach the question of what's a pathway and what's a network. I think we all know what a pathway is because it's what we learn in undergrad biochemistry. It's a series of biological reactions which take some inputs such as glucose and create some outputs such as ATP with many, many, many molecular interactions between the one and the other. Pathways can be applied to small molecule transformations like intermediary metabolism or to more complex things like transport reactions, assembly of subcellular machineries such as the mitotic spindle apparatus, signaling any of those events in the cell can be turned into a series of molecular interactions and reactions. Typically pathways are and pathways are usually represented something like this where you have a series of inputs, series of reactions and a series of outputs, some modulatory factors here. What we're looking at here for example is the simplified version of the epidermal growth factor receptor signaling pathway which starts with EGFR and it's a ligand EGF that forms, they form a complex. The complex is active, it converts ATP into ADP and eventually I can't even read this as a downstream effector ultimately leads to increased cell growth. There are a number of regulatory interactions that inhibit these steps that able them to be regulated. A network on the other hand is a more loosely defined graph of interactions between molecules where again you can draw arrows between different molecules but they don't imply any directionality. It's a series of interactions of some sort. It can be regulatory interactions, they can be binding interactions such as protein-protein binding, they can be metabolic modification reactions such as a ubiquinolation step. The important thing here is that the network structure is a much more general structure in which you can fit a lot of different meanings. Typically when we represent pathways as networks we can start to add lots of information which is less well understood. So for example in the EGFR pathway we can convert the EGFR pathway into a series of network reactions. One thing you'll notice is that we've lost the directionality, the time sense of things and we just have a number of interactions and so we maintain some of the regulatory steps, some of the binding steps but we're also able to add a number of additional proteins to the network where we don't know what the precise interaction is but we know that it affects it in some way. So for example if you can do a yeast to hybrid protein interaction screen with baits from the EGFR pathway you'll get a bunch of things that bind to members of that two components of that pathway. You know they bind, you know they may be biologically important in some way, you don't know exactly what the nature of that interaction is or you can do a synthetic lethal screen or a suppressor enhancer screen and find genetic interactions between known members of the EGFR pathway and novel or novel or more poorly characterized gene, you can put them into the network but you don't know exactly what they're doing. So one uses one uses networks when one wants to get a more comprehensive but shallower view of the genome. I'll give more into that in a second. So regardless of whether you're doing analysis on pathways or on networks you need two ingredients, two essential ingredients. One is you need your experimental list of altered genes, proteins or RNAs from the tumor you're studying and you need a database from which you can get pathways or networks. So we're going to talk about pathway databases first and then we'll talk about network databases. So pathway databases are typically human curated collections of pathway information that come out of the literature. They offer a biochemical view of biological processes, they typically capture cause and effect and directionality of the biological processes and they give you human interpretable visualizations which are intuitively obvious. Disadvantages of pathway databases they typically they only cover the part of the genome that's really well understood so they're only capturing you know I don't know a quarter of biology or less, we don't really know. And because it's a human, because making pathways is a subjective process different databases disagree on where you put different pathways. So you can have a one pathway database that will talk about EGFR signaling to use my previous example and another will take EGFR and put it into a lung cancer pathway because EGFR mutations are frequently seen in lung cancer. So is it a lung cancer pathway or is it an EGF pathway or is it a pathway of ligand signaling depends on who you are and what your interest is. Here is a very familiar example of a pathway database that most of us have used at one point or another, the KEG, the Kyoto Encyclopedia of Genes and Genomes, and this is a representation of, if I can read it, the pentose phosphate metabolic pathway. Here is another database that I work on along with Robin Hall who will be around sooner or later in the next hour called Reactome. And here we are looking at which pathway is this. I guess this is again, oh, this is N, okay, this is N-CAM 1 signaling pathway. Again, you can see that each of the ovals here is a molecule and it's showing a series of reactions which is directional from extracellular signaling events to cytoplasmic and then nuclear events which activate the downstream genes of N-CAM. Reactome as an example of a pathway database is hand curated by a team of curators at OICR, New York University and the European Bioinformatics Institute. It uses rigorous curation standards so every time a curator puts a reaction down on the map, it's traceable back to the primary literature. It's usually a wet lab experiment that was performed that confirms the correctness of that reaction. It's human oriented but we do an automatic projection of pathways onto non-human species using orthology information which has its problems. And it's currently up to about a little over 1500 human pathways and 7,327 human proteins at the last release. Reactome gives you a Google map style reaction diagram so you can get one of those diagrams, you can scroll around on it, make it larger or smaller. You can overlay other information on top of the pathways. You can do a gene overrepresentation online in exactly the same way that Varanik showed you or you upload a list of genes and it will tell you what pathways are overrepresented among your gene set and will then give you a colorized picture of the pathways with your genes highlighted. And you can do some cross-species analysis. I'm interested in pathway and human, I can see what the corresponding pathway is in C elegance. That's also open access. Okay, networks in contrast, in addition to capturing the well-understood portion of biology which pathways do, networks can cover all sorts of relationships, genetic interactions, physical interactions, co-expression, genontology terms, sharing, pathway adjacency. If I have a catalyst of a biological reaction, then I know that that catalyst molecule is interacting in some way with the enzyme, so I can call that an interaction. What you're seeing here in this, here is one of the early yeast to hybrid studies of yeast. So every node is a protein and every arc, every line connecting them is a yeast to hybrid-defined interaction. So network databases can be built via curation. More typically, they're built automatically by aggregating lots of information from big experiments. They provide more extensive coverage of biological systems. Some of the larger ones cover more than half of known human proteins. But the underlying evidence is often more tentative because they come from high throughput screens. So here's some popular curated networks that you can get. So BioGrid covers, does automatic text mining from literature and then human curation on that. It covers 529,000 genes, not all from human obviously, but from many different species, 167,000 interactions. Intact is a primarily curated database of 60,000 genes, 203,000, and Mint is 31,000 and 83,000 interactions. I have the URLs of these things at the end. But there are, in addition to these three, which are some of the larger ones, there are about 180 other network databases. Many of them have been collected together and put into this resource called Pathway Commons, which I think very unique talked about already or maybe not. Pathway Commons, www.pathwaycommons.org has a definitive collection of many different network and pathway databases. Yeah. Yeah, it's a great question. So Site Escape is a tool for visualization and analysis over networks. It's a network tool. It has apps which connect to these other databases and will import them for you. They'll pull them in. And there is an app that actually will connect to Pathway Commons and download the network of interest. I'll show you some examples of this. Okay, so that's where you get the data. How do you choose what database to work with? I think it depends on what tool you're using, different tools work best with different databases. This is from, came out of the paper. Can you even read this? It's too small. You can. Okay, good, because on my screen I can't read it at all. There are three different types of three different classes of analysis. One can do on pathways and networks. The first type is GeneSet Enrichment, which very unique told you about in the last hour. In GeneSet Enrichment, you basically ignore the interconnections between genes and proteins in the pathway or network and just group them into bags. A bag for mitosis, a bag for cell growth, a bag for notch signaling, and so on and so forth. Then you apply your GeneSet to it and you ask, is there an unexpected enrichment of my list of genes in one or more of those bags? So that's the first and most basic type of pathway and network analysis. The second type adds topology information to those bags. So we don't just have, within a pathway, it's not just a list of genes, but those genes have relationships to each other. There are genes that direct, genes and proteins that directly interact with each other. There are positive and negative regulatory interactions. There are enzymatic reactions, a protease cleaves, it's target. The second family technique uses the topology of the network to try to build subnetworks that are altered in your cancer data set. So by using the topology information to find clusters of genes which are interacting with each other more often than you would expect in your data set and to draw a little picture that gives you an idea of the relationship among them. So you may, for example, in your cancer data set have an amplification that involves a receptor and then a deletion which involves a negative regulator of that receptor giving you two hits, which both give the same direction of activating, constitutively activating that receptor. The third type of SNETs that in this figure we're calling it, de novo subnetwork construction and clustering. The third type is more of systems biology. It's pathway-based modeling where you're actually building a kinetic model which captures all the relationships among the genes and attempts to create a quantitative model of how the mutations that you have observed or the alterations you have observed in your data set are leading to altered pathway activities. And each of these techniques, as we go from one to three, it goes from most simple to most complex and also harder to use and interpret. So number one is already covered. We're not going to go over this again just to say that it's easily the most popular form of pathway network analysis. It's relatively easy to perform. There are many, many tools that do gene-set enrichment type-based analyses. The statistical models are very well worked out. The disadvantages are you can slice and dice genome in many different ways. How you do that will affect your output. You typically end up with, as Veronique showed you, multiple enriched pathways which are related to each other in some way. So you'd have to do another step in which you take the many enriched gene sets that you get and put them back together. And Veronique showed you the enrichment maps, I assume. Yes? The nice enrichment maps. So that's another step you have to do. And it can be difficult to go the next step from an enriched gene set to understanding the mechanism that explains the tumor phenotype. So that's all I'm going to say about gene-set enrichment. Now I'm going to spend a little time talking about DeNovo subnetwork construction and clustering. So the way this works is it starts with a network. It's not a pathway, but one of these large networks of interactions. And you apply your list of altered entities, genes, proteins, RNAs to that network to identify topologically unlikely configurations. So you have a vast network of all of biology or all known biology, and you discover that your mutated genes occupy a little corner of that interaction map. What is, can you find, these techniques attempt to find these unlikely clusters, extract them, and build little models for you to look at and annotate. Here's an example from the Reactome functional interaction cytoscape app, which Robin will lead you through. The way this app works is it is using a network that has been previously built for you consisting of network interactions extracted from curated pathway databases, Reactome plus Keg and several other pathway databases. A large number of uncurated interaction networks, and we've built for you a, what we call the Reactome functional interaction network. It consists of a little more than half of the annotated proteins, 11,000 proteins, 270,000 interactions. It's a relatively large network. And what the app lets you do is to upload a list of your genes. It'll do name translation and so forth, ID identification, and extract from this big network of 11,000 proteins the subnetworks which are involved by the genes that you uploaded, and then it will do clustering. And it will turn the hundreds of genes typically into a small number of high-related disease modules, typically 10 to 30. And then you can annotate these clusters, giving a map like the one shown down here, which shows you the putative function of each of these modules. It's somewhat like the enrichment map but done in reverse. Instead of starting with pathways and then relating those pathways together, enriched pathways and relating them together into clusters of pathways, which is what the rich map does, it starts out agnostic to the names of pathways and just looks at the interactions among them, regulatory and physical interactions among them, extracts out clusters of highly interacting genes that are touched by the ultra genes that you're working with in your system, and then annotates those with pathways to give what ends up being a very similar picture. So here's an example of this in action. Here is a little corner of the FI network showing that there's actually a lot of substructure in just the wild type network. Each of these little clusters that we're seeing is a group of highly interacting genes, things like the ribosome and forming balls like this. And here is a typical data set. This is a 52 pancreatic cancer genomes sequence last year at OICR. And we're looking at our genes going this way and the frequency of non-synonymous mutations in those genes in the y-axis. And it shows, this is a very typical graph, it shows that there are some genes like KRAS and P53, which are very highly frequently mutated, and then there's a very rapid drop-off in its very, very long tail, goes way out to the right, of mutations which occur in just a few samples, but are not highly recurrent. How do you make sense out of that? Well, using the reactome functional interaction app, you can generate a picture like this. So basically, it's extracted all the genes that were in the head and the long tail, clustered them together, and then annotated them, showing that there's a large cluster of genes that include KRAS, but many other genes that are in the tail, which are interacting with KRAS more frequently than you would expect by chance, and involves the EGFR pathway, FGF receptor, axon guidance, and or B signaling. Another module is P53 and proteins that interact with it. Hedgehog is coming up, calcium signaling, the spliceosome is overrepresented in the tail, wint and get here in signaling, axon guidance, which comes up over and over again in different cancer types. These all kind of just fall out. You can zoom now, zoom into this, start to look at where the positive and negative regulatory interactions are, have this overlaid on top of the reactome pathways, and start to tell stories about what you're seeing in that long tail. So it's a nice visualization system, yeah. Oh sure, if you were to, if you take a random set of genes, you don't get a picture like this one. What the tool lets you do is to do a permutation test and ask how many times with a random set of genes you would get a module of this size of this annotation. So you can actually put P values on them. And there are some pathways which, in fact, like this module 7, which is extracellular matrix, this one, this one typically is not significant. It doesn't, it will come up randomly many times because of the, because, actually because it tends to be a group of genes which are longer and have a higher mutation rate. They tend to be false positives in the, in the underlying data. Okay, yes. So Perch cluster, do you assign by hand, for example, module 10 supplies it to someone? Yeah. Or does it have a cluster representative? Yeah, so the way, the way this works is the, so it's a multi, the side escape tool that gives you, leads you through several steps. First step is you extract, extract the subnetwork from the whole one, from the ones that involve genes of interest. And then there's a second step of clustering that gives you the cluster map. And there's a third step of annotation where it does an enrichment, a gene set enrichment for each module. And then we'll, and then we'll assign a series of enriched pathways to each module. And that's what this representation is. This is after that third step. So there's an extract step, a cluster step, and an annotate step. And you can choose to an annotate it with go terms, which is what we're seeing here, or with pathway names, or with subcellular localization. Again, instead of annotating individual genes the way you would in an enrichment map, you're making the modules first and then annotating the modules. Sometimes, but sometimes these modules are, give you insight into the, into the, into the tumor, tumor biology and clinical correlation. So one thing you can do, and this is a tool that's built into the, the reactome plugin, it's the side escape plugin, is you can take each of these modules and if you have patient survival data or response data, automatically check each module to see if they correlate to a clinical, clinical parameter. So it'll, for example, draw Kaplan-Meyer curves for you and give you a Cox pH hazard ratio and P value for each module. Another thing that you could do is to, to use the modules for a, to cluster the modules themselves. So when we take the, the pancreatic cancer modules and do a hierarchical clustering on them, we're looking here on the modules where we've color coded each of modules are going across the x-axis and individual patient samples are going across the y-axis. Each patient has a different subset of mutations in each of, in each of the modules. So you color code the modules by the number, the number of mutations that particular patient has in the genes that belong to those modules and then you cluster, then you cluster those and what this ends up revealing is that in the pancreatic cancer data set, when you look at, when you look at the frequency of module mutations, mutations in a module context, you actually end up with four different pathway-based subtypes of the tumor, starting at, there's, here's one which is KRAS negative, P53 negative, SMAD4 negative, axon guidance positive, type two is KRAS positive, P53 pathway positive, and actually I take this back, this is calcium signaling positive and so on and so on and so forth. This pattern is completely, is completely not observed when you look at individual genes because it's obscured by the long tail. When you bring the long tail information and suddenly the pattern appears. So let, I'll talk, give you, that, that's an example of a network clustering algorithm. There are many others. One that I'm very fond of is Gene Mania from the Bader group and Yuri has worked on it, I believe. No, never worked on it. Sorry, your, your, yours, yours is hyper, hypermodels is coming up. Yes, I'm sorry, there's a question. So these five, four, two are times like this on the mutation profile. Yeah. Yeah. So we took the mutation profile from each individual patient, projected those onto the modules, and then scored each module for each patient based on the number of mutations that patient had in that module. So they might have had, let's talk, we'll talk about KRAS, they, they have a KRAS mutation and then they may have had three other mutations affecting three other genes in that module. So that would get a high score. If they had no mutations in the module, it gets a low score. And then once you have scored each module for each patient, you, you, you put them into a matrix and you do a hierarchical clustering. And that's what that, and that's where this, this pattern emerges. Well, it depends on the strength, on the strength. This, so we, we started at, we started doing this when there were 20 patients. We now have over 300 patients and the same four subtypes have continued to come out. So in this case, it was very robust, but I can't promise that the next tumor type would be this clear. The other thing I'd say is that, of course, when we first saw, they said, oh, this is great. Now, let's see if there's any difference in genocytidine response or patient survival, disease-free survival. None of these four types seem to have any effect on this, except for type number one, which has a much better survival. And those turn out not to be a pancreatic ductal adenocarcinomas at all. They're endocrine carcinomas and ampullary carcinomas, which were misdiagnosed. Interesting, interesting difference there. But this, this had no, this had no influence on, on patient survival for the, the, the ductal adenocarcinomas. The, I have another example that I didn't put in here from breast, from breast cancer where we did, we made a similar module nap for breast cancer and then checked each module for correlation with disease-free survival. And we actually found the module that has a huge, huge effect, involves the aurora kinase B signaling pathway and is a, a, a very robust biomarker for poor survival in estrogen receptor-positive patients. I didn't, didn't put it up here for, for in the interest of brevity. Other questions before I move on? Other network, yeah. Yeah. So, well, we use very, we used a very simple scoring system, which is just based on the, for an individual patient, what the, so this was, we used, it was an RNA-based biomarker in this case. We averaged the, averaged the fold change across all the genes in the module, and that became the score for the module. And that turned out to be a very robust biomarker for estrogen receptor-positive survival. Stronger, in fact, actually it's kind of interesting because patients who have high expression of the genes in this aurora B kinase module have the same prognosis as if they were triple negative. So it's picking up a group, a subset of patients in the estrogen receptor-positive group who usually have a good prognosis, who have just as poor prognosis as the triple negative patients. So it's kind of clinically maybe it might be very, it might be useful. Yeah. This one? That one, yeah. Okay. Well, so each line is a, is a functional interaction between two genes. The, some lines are derived from pathways, and so we know precisely what that line means. It's a catalytic relationship, it's a binding relationship, it's a phosphorylation, or a ubiquinolation event. And others are, are, are based on uncurated high-throughput data like these two hybrid experiments, or proteomics pull downs, or literature co-occurrence. And all the, that, actually the detailed information is available inside escape, you click on the line and it tells you what the evidence for it is. And the distances don't mean anything. They're just, we just, they're just, they're just laid out to, they're, they're just laid out to be attractive. Yeah. That would, so it would be a very cool feature to overlay on top of this co-occurrence, a mutual, a mutual exclusion data. So that you could see, for example, that to me, that in a single patient, you always have to have modules zero and one mutated. But currently, that isn't, that, that you would have to do that on, on your own, have to do it manually. Sure, Gaye? Yeah, the, the, each line has a probability associated with it, with the curated ones being, having the probability of one out of hubris. And yeah, sure. You want me to go back to here? Okay. Sure. How is the uncurated interaction evidence instructed? So I took out some slides that describe how we made this, but basically we took multiple sources, multiple sources of evidence, consisting of series of proteomics experiments, high throughput proteomics experiments, from Mike Messenger RNA, co-expression data from GEO, all of GEO went into this. Literature co-occurrence, a experimental models in yeast, C. elegans in Drosophila, so suppressor enhancers screens, and some encode, and encode data for transcription factors and their bindings and their cis regulatory sites. Built a big interaction network, which contains a lot of false positives because this data is, is, is unreliable. And then we built a machine learning system based on curated interactions to weight each interaction by, by the amount of evidence it has. So in, you know, in the end, we call a reaction or an interaction likely if it's supported by multiple sources of evidence. It shares go terms, there's a yeast to hybrid experiment that supports it, and the genes co-occur in the same molecular compartment, for example. Okay, according to what the machine learning system tells us. And we kind of weighted it, so we kind of set the bound, the threshold so that we have a low rate of false positives. So when we apply it to a test a, an evaluation data set, we have fewer than 5% false positives in, in all unknown, out known, unknown curated interactions. Okay, I'm going to move on. Okay. Okay, so clustering algorithms, gene mania. It is a, it's a website that's put together by Quaid Morris and Gary Bader, both of University Toronto. It is a repository of something like 100,000 networks that have been, that have been published. And it is, it is designed to, for you to upload members of a interesting set of genes, and then it will make subnetworks for you and find other genes which, which seem to belong in your set. So you give a genes A, B, and C, and it'll find genes D, E, and F, which are more related to them than you would expect by, very, expect by chance. And it's very useful for trying to assign a function to a group of genes. Hotnet from Ben Raphael's lab is, uses a much more sophisticated clustering algorithm than Reactome uses. So react, the default clustering algorithm in the Reactome app is a, is very, very fast. So it can be used interactively. But it gets hung up on genes like P53, which have lots and lots of interactions, because they've been studied so heavily. And so, you know, you tend to find the P53 module no matter what you look at. Hotnet uses a model in which it represents the network as a metal mesh. And the genes that are mutated are hot spots in the mesh, and then it allows the heat to diffuse out. And what this does is genes that are heavily connected like P53 tend to lose heat a lot, because they're so connected, so it downweights them. And so it, it does a better job at, at clustering. And in fact, the Reactome, the Reactome side escape plug-in now offers Hotnet as an alternative clustering algorithm. It will give, it will take a little longer to run, but will give you better set of, better set of modules. URI's thesis project is, includes hyper modules, which is a, a, an algorithm that, that searches for network clusters involved your mutated genes, your genes of interest, that correlate with clinical characteristics. So it does, so you give it the, you give it tables of your patients, if you're studying a series of patients, clinical characteristics such as disease-free survival and the genes that are mutated in those patients. And it, it specifically searches for clusters which, which vary, co-vary with those clinical characteristics. And then Reactome that I discussed, that I discussed before has both our, our own clustering algorithms in there, as well as ones that we've imported and re-implemented that were developed by other people. So now I would talk about pathway-based, pathway-based modeling. So in, so the deficiency of the, of network clustering is that it's throwing out a lot of the, the, the temporal and mechanistic information that, that are there in pathway databases, by adopting this much more, much simpler model. So pathway-based modeling attempts to preserve those detailed biological relationships and, and produce quantitative models of the representing the effects of your, of your list of altered, altered genes, proteins and RNAs. And it's shading into systems by all, systems biology at this point. So this is a much less this is, this is much more cutting edge, much harder for people to, to, to run if they are not compute, not computer scientists and computational biologists themselves. And they also tend to be much more specialized for particular problems. So the older and most well-developed class of pathway-based models are partial differential equations and Boolean, and Boolean models such as an application called CellNet Analyzer. These were developed for, for biochemical systems, metabolomics, and are great at predicting, for example, the rate of production of, of a certain small molecule in a yeast fermentation reaction. They really only scale to small numbers, small numbers of genes, like a dozen at most, and require inputs of reaction constants and binding, binding constants, KMs and the like, and probably not very, very useful to anyone doing cancer analysis. There are network flow models such as net forest and network kin, which were developed specifically for certain types of signaling cascades, such as kinase cascades and phosphorylation info, phosphorylation cascades. And so if you're working in that, that particular domain, and you have the proteomics data, which of, for phosphorylation, dephosphorylation reactions, and their byproducts, then these models are for you. If you're working with a large set of RNA expression arrays, or RNA-seq arrays, and you have a number of perturbations such as cell line that's been treated with a series of small molecule drugs, or a series of SH RNAs, or a series of perturbations all in the same cell line, and you've captured a microwave from each one, you can use tools such as arachne from the Andrea Califano's lab at Columbia to create a transcriptional regulatory model of the transcriptional network that is affected by the perturbations, and find things like master regulators. So transcriptional regulatory switches that control large numbers of target genes. That's very useful. And then the most general form, and the most recent one, are probabilistic graph models such as Paradigm from Joshua Stewart's group at University of California, Santa Cruz. And this is a, I'm going to talk about this in more detail, but it's a very general form of pathway modeling that was developed specifically for cancer genome analysis. And that's what I'm going to tell you about. So the way Paradigm works is it takes a curated pathway. Here's a very, very simple version of apoptosis which only has two genes in it, P53, and MDM2, an inhibitor. And then it then turns this into a graph which captures each of the steps in the central Paradigm. So MDM2 gets expanded into the MDM2 DNA gene. It's transcribed into an RNA. The RNA is translated into the MDM2 protein. And then the MDM2 protein can become activated and give rise to activity. Same thing for P53. And then there is a interaction between MDM2 active protein and P53. It's actually a ubiquinolation reaction which inactivates P53. So there's an interaction here. And then each of these transitions is assigned a weight, a probability that they occur. And then from this you can apply your cancer alteration data. Let's say you have a mutation that affects the gene and observe change in the RNA, a copy number change in the P53 gene, maybe some mass spec data showing changes in the quantity of unmodified P53 protein. It will then, if you give Paradigm a series of such observations and it attempts to fit them to a model. It establishes parameters and weights associated with this model which explains, attempts to explain how changes in the MDM2 gene will affect the activity of the apoptosis pathway. And then once it's been trained on a large number of cancer cases, you can then run it in inference mode and it'll tell you for any particular patient with a specific set of mutations and other alterations what the integrated effect of all the mutations you've seen are. So you might have a mutation here, a copy number change there, a fusion protein here and it'll attempt to tell you whether apoptosis is up, down or unchanged. And so this surprisingly enough, this actually works extremely well given the amount of uncertainty here. Here is a copy of Josh Stewart's bioinformatics page paper from 2010. And what this is showing is the TCGA glioblastoma multiforme data set which you've been working with in which each patient has been run through Paradigm for the gata pathway E2, FEGFR, HIF1, ALFA and a series of others. Where we're looking at our pathways going down and patients going across, our donors going across, and the heat map here shows the change in the activity of each of those pathways. And what we're seeing here are actually very dramatic differences from one patient group to another when expressed as pathway fold changes. So there's good and bad news about Paradigm. The bad news is it's actually very hard to use, it's distributed in source code form, it's very hard to compile. They don't actually give you any pre-formatted pathway models to run your data through. Documentation is poor and it takes a long, long time to run. The good news is that because I'm enthusiastic about the algorithm, Guanming Wu and my group has incorporated into the Reactome site escape app, so you can run it within the site escape app. It's not quite ready for prime time yet, so Robin will not be showing you how to use it, but it will be ready within a couple of months. And even better, the Hussein Radfar post-doc in my group has reimplemented the Paradigm algorithm, so it runs 50 times faster than the original version. So instead of taking weeks to get a result, you can get the result in minutes, which makes it possible to actually use it in an interactive way. And so I'd hope that it would be ready to show to you today, but it's not quite ready. We also pre-populated it with a series of pathway models from Reactome, so look for that soon. That's the end of the talk here, and then I have a series of URLs, and then, yeah, right, and then there are, that's not true actually, and then there are more detailed references in the pre-print that's up on the wiki. So I'm trying to understand, when would you use a pathway-based model versus Reactome or a network-based model? What question would you ask? Yeah. Well, so I think ultimately, in order to understand, if you want to get a general idea of what pathways and processes are altered in your system of it, in the system that you're working on. So you have a hundred different examples of a sarcoma, and you want to understand what's frequently, what pathways are driving sarcoma than any of these methods is going to give you that, is going to give you that list of driver pathways. You can use the easiest and fastest one of Gene Set and Richmond for that. If you want to understand, if you want to find subclasses of, subtypes of tumors, then Gene Set and Richmond will not work, may not work so well, and you would want to go to the next step and use the clustering, the network extraction and clustering systems, because those are actually very good at finding subtypes. If you want to go beyond that and start developing precision medicine, asking precision medicine questions, what pathways are affected in a particular patient, and how is it affected, and what drug might I give that patient to reverse those changes or to search for synthetic lethals. Then you need to do modeling, and that's when you would want to go to the most sophisticated pathway-based modeling approaches. Other questions? Yeah. These pathways are created using biology knowledge. Can we further enrich our biology knowledge from these pathways? I mean, I feel like they're restricted to what we know so far. Can we, based on what we observe in these pathways, can we enrich our biology? Well, so, you know, the short answer is yes. I mean, you know, this is, so what the pathway analysis, pathway and network analysis does is it puts your, you know, puts your experimental findings into the context of what's previously known biology, which otherwise you would have to, you know, get from the, you'd have to extract from the literature, and you might, you may not be familiar with, you know, obscure parts of the literature which are relevant. So, for example, axon guidance. Two years ago, nobody thought that axon, no one would have thought of axon guidance signaling pathways in the context of cancer. By putting cancer data into the pathway context, axon guidance, which, you know, had been the domain of neuroscientists, suddenly comes up as a, you know, as being a pathway involved in tumor and metastasis. So, you've now added some biological knowledge to the, you know, to the literature, which ultimately will get reincorporated into the pathway databases. Is that, that really, you know, so you're building on, same thing, you know, everybody in science does. You build on top of the previous knowledge, you add new knowledge to it, and then it becomes part of the established corpus of information. Okay, yeah. Okay, so the question is, is if you, yeah, I don't think I understood the question. So you have two, you have tumor and normal? Yeah. Okay, yeah. So the question is, well, if you have tumor and normal, how do you do, how do you do network analysis? Generally network analysis on it, right? And the, so typically all these, all these techniques require you to require you to provide, to provide a gene list. So you have to go through the tumor and normal and decide which genes have changed relative to, relative to normal. And so for mutation data, it's typically a yes or no. It has a functionally significant mutation or not. There's a whole other, as you've heard, there's a whole other series of techniques for deciding when a mutation is a significant likely driver mutation. So you have to pre, you have to run those things first in order to come up with a list. If you're looking at RNA expression levels, then you can provide some, but not all of these network systems, a full chain, a quantitative full change value. Same thing with copy number changes, which are quantitative. So we can put that in the reactor, the full change? Yeah, yeah, actually you can, yeah. You can put full, you can put full changes in and you know, one of the handy things is it'll give you a pathway diagram with each, with the full change represented as color change. With your genes? Your genes. Yeah, yeah, you say genes A, B, and C are all increased and gene D is decreased. And it'll give you a pathway diagram in which, you know, there are three red boxes for your increased genes and one blue box for the decreased genes. What's the limit of how many genes can be increased? Practical limit. Robyn, what's the practical limit of how many genes that you can upload into reactant for a colorization analysis? It's a good question. I don't know if you've really tested the limits. I mean, the new, the new conversion, you could take very large files up to 5 to 10 megabytes, so. Well, though, sounds like hundreds of thousands. Alright, any other questions? I'll turn it over to Robyn and I will hang around here too and maybe more questions will come up as you work through the exercises.