 Sure, I am actually the Director of Informatics and Biocomputing here at OACR. I work on community databases, knowledge bases, such as WIRM Base, and what I'll be talking about today is the reactome database of human biological pathways. So you've talked about an enrichment of gene lists, and now we're going to go one step further on to discover your own relationships or discover novel relationships among the genes in your gene set using pathway network analysis. So the main motivation for looking at biological pathways from the computational statistical point of view is a dramatic reduction in the data size, instead of having thousands of individual genes that may come out of your ONIX experiment and thousands of hypotheses about which ones may or may not be biologically significant. pathway network analysis allows you to reduce that to dozens of pathways and reduce the number of hypotheses so that you can find biological processes which are enriched in your set of interest without the penalty of the multiple testing correction. This also allows you to look at a set of rare events, such as the long tail of rare cancer mutations. Those of you who work in cancer know that if you look at mutations, there are only a few genes in any given tumor type which are recurrently mutated, and then there are hundreds of genes which are present that you're mutated in just a few percent of the population. You don't know whether those are passenger mutations which have no significance or driver mutations which in rare instances are contributing to the tumor phenotype pathway network analysis can help you sift out the ones which are significant for the ones which aren't and discover the processes in which they participate. A similar problem pops up in rare germline mutations such as autism spectrum disease which has a strong hereditary component but there are few of any genes which are commonly mutated in the disease instead of seems to be the driven by many many different rare mutated many different rare germline variants. And again pathway network analysis has been used very successfully in this case to identify the underlying biological processes from this long tail. Similarly, you can tell biological stories because you're working on well amitated, well understood biological pathways, there's literature backing up the patterns that you're seeing so that you can identify hidden patterns and long lists of anonymous sounding genes. You can extract the pathways and build them to create mechanistic or quantitative models to explain your experimental observation. You can predict the function of unannotated genes by identifying genes which are novel genes or poorly annotated genes which interact with well known genes and start to make guesses about what those are doing. And you can identify molecular signatures using pathway network analysis by finding different sets of patients or cell lines or whatever you're working on organisms which differ one from the other at the pathway level and relate that to their phenotype. So I've been talking about pathway network analysis a lot, I haven't actually said what it is. It's a very broad term, I define it broadly as any analytic technique that uses pathway or molecular interaction information to gain insights into a biological system. It's very rapidly evolving field, lot of different approaches and anything I say today will probably be out of date in a year. So let's talk about what the difference between pathways and networks is. So pathway is basically two ways of looking at the same data set. The pathway related view is the traditional view that you learned in biochemistry 101 from a Leninger where you have a linear or branching set of biological reactions involving proteins or lipids or small molecules or metabolites and they're organized end to end in a way that seems to have some directionality. You can, there's an upstream portion and a downstream portion and it relates to causality and it's a great way of understanding what the system is doing in a teleological way. It probably doesn't actually relate to the way cells see what's going on inside them, which is probably much more complex and branching and looping and interacting, but it's a good way for us to organize the knowledge. So in the pathway oriented view of the epidermal growth factor receptor, let me see if the mouse here is going to work for me. No, this mouse doesn't work. This one works. Is my cursor visible to you? Yeah, you can see it. We have a reaction center here. This is where the reaction occurs and we have two inputs into the reaction. We have the EGFR receptor and we have the EGFR ligand and they interact with each other to form the EGFR complex and this reaction is negatively regulated in this case by a protein called LR1G1 and in fact this is a very simplified view. There are many different regulators of this reaction. EGF and EGFR, then two copies of this associate with each other to form a dimer. So the input here is the receptor ligand monomer. There are two of them that come together to form the dimer. This then catalyzes a reaction in which ATP is converted into ADP and the output from that is the phosphorylated version of EGF, EGFR. You shouldn't have to take notes on this, this is just an illustration and it's positively regulated by a protein product from the SRC1 gene. So this pathway continues and continues and continues until you finally get stimulation of mitosis and cell growth. If we look at this in a network fashion, however, we've reduced each of these genes and protein products and small molecules into a series of interacting entities. In this type of view, there's our nodes, which are the circles, and there are edges, which are the various arrows and lines that indicate interaction. So in this view, EGF activates EGFR. These two are inhibited by LRG1 and I believe this is drawn incorrectly, but EGFR is activated by the EGFR ligands activated by SHC1. This arrow should probably be pointing the other way. Just notice that for the first time. The nice thing about a network is that you can start to put in less well understood interactions. So if we have, say, proteomics, co-immuniprecipitation data that says that KRT17 and EGF co-immuniprecipitate, we can put in interaction edges, which indicate that there is some sort of physical or genetic interaction going on here. We don't know exactly what, but it's useful to have that bit of information in there. So the advantage of the pathway-oriented view is it's easy to comprehend by people. The advantage of the network-oriented view is it's a simplified data model. It's machine readable, and you can do mathematical modeling on this type of network. So now we're going to get down to the practicalities of working with pathways and networks. So to do any type of pathway network analysis, you need two things. You need a list of genes, proteins, RNAs from your experimental system. And so these can be genes that are, these can be variants in a genetic screen that you've identified or associated with a phenotype. They can be up or down regulated genes in an expression, a data set. It can be microRNAs in a small RNA sequencing experiment, et cetera, and so forth and so on. And you need a database of pathways or networks. And then you need a tool that will combine the two and tell you how they're related. So I'm going to talk about databases, the databases first. So there are basically two types of sources for network pathway information. One are pathway databases, which are oriented towards this human view of biological networks. And the other are network databases, which are the machine, which give you the machine readable view. The primary examples of the former, the pathway databases, which is confusingly labeled here, reaction network databases, because that's a formal description, are reactome and cake databases. These databases describe biological processes as a series of biochemical reactions in the way that I showed you in the EGFR example. And they're able to represent many, if not most, of the events and states found in biology. Here, this is an example of the reactome database data model, which shows the fundamentals of the data model centered around a reaction. A reaction can be modulated by a regulatory gene or protein or other small molecule. There can be a catalyst activity associated with a reaction. And a reaction converts a set of inputs into a set of outputs. The inputs can be small molecules, such as sugar molecules and intermediary metabolism. They can be two proteins, which are dimerizing. They can be a protease and its substrate. And the inputs then produce a series of outputs. So if you're talking about a proteolytic cleavage event, the input would be one input, the uncleaved protein. The catalyst would be the protease involved. And the outputs are the two cleaved products of the original protein. If you're talking about a transmembrane signaling event, the inputs is the unactivated version of the protein. The outputs are the activated version, and so forth. And the other big database for pathway information is KEG. It's a curated database compiled from published material. It includes information on all sorts of small and large molecules across many, many different organisms, and it provides a map for how each of these are organized together. So here is a typical pathway diagram from KEG showing the cell cycle. And it's showing the various components of the cell cycle organized as a series of reactions in the way we showed you before. In this case, they use a little black dot, black or open dots instead of the squares to show the reactions. But it's essentially the same thing I showed you before. Reactome, similar to KEG, is a pathway database. The main difference between KEG and Reactome are a couple differences. One is that Reactome focuses on very deeply curated human reactions, human pathways, and other organisms come along for the ride, but are not directly curated. KEG attempts to do all organisms. So it's much broader but shallower. The other difference is that KEG has licensing restrictions. It's not free for commercial use. Reactome is completely open source and open access. So Reactome covers pathways for metabolism signaling, other biological processes. Every pathway and every reaction is traceable to one or more references in the primary literature. And there's an editorial process which ensures that there are multiple references for key observations. And each pathway gets peer reviewed at regular intervals so that we have a pretty good sense that what goes into the database is correct. Reactome also is interlinked with many other online databases and provides built-in data analysis and visualization tools. Here is an example of one of the visualization analysis tools in Reactome. It is a, and you'll be seeing, Robin will be showing you something like this during the workshop that follows. This is a Google map of a double-stranded break repair pathway. What you're seeing on the left side is a hierarchical list of pathways, sub-pathways, and sub-sub-pathways. On the right is a scrollable and panable window that you can kind of drag around and zoom in and out of, which shows you details on the pathway. And we've projected a gene set onto this pathway to show where the up-regulated and down-regulated genes in your experimental set are. So here is an up-regulated gene in the gene set. Here's a down-regulated one. I believe this was an mRNA set. And here is a complex of multiple genes that have some up-regulated and some down-regulated components in the gene set. And statistics from the enrichment analysis below. So in the very basic usage of this, you have a gene set. You have a set of genes like you've been using yesterday. You upload this into Reactome. It does an enrichment analysis for you. It tells you which pathways are enriched in your gene set. And then you can zoom into this and see how the altered genes are related to each other within each of the enriched pathways. In addition to Reactome and Keg, there are about 1,900 other sources for pathway information. Some of them very boutique, focusing in great depth on certain pathways, others more broad. There's a great resource called the Pathway Commons in which multiple databases submit their pathways in a uniform format called Biopax. And then this resource allows you to search through them. So if you're interested in a particular pathway, you can find the best curated and deepest one by doing some searches on Pathway Commons. It also provides some tools for comparing one group's view of a pathway to another, and they're doing simple enrichment analysis. OK, so those are pathways. Now we're going to talk about networks. So pathways are great for people, but they're really lousy for doing computation over it. Generally, it has to be simple. The data model has to be simplified in order to start to do statistics and make models that a computer can deal with. Typically, the networks that we work with in biology are interaction networks. And these are a collection of nodes in which the nodes are genes, proteins, lipids, RNAs, et cetera, connected by edges. And the edges describe the nature of the interaction between them. So there can be undirected edges, where there's a mutual reciprocal reaction. Two proteins are binding to each other. There's no particular directionality of that in network diagrams that would be indicated as a line without an arrow. There can be positive or negative regulatory reactions indicated by an arrow. So a typical example would be a transcription factor. Transcription factor is a relationship to a gene whose expression it activates. The transcription factor would be one node. The RNA would actually be the second node. And the transcription factor is activating the RNA by increasing its expression. A catalyst would have a positive regulatory edge, whereas a, say, a ubiquinolation reaction would have a negative regulatory effect. And that would be shown as a line with a little bar on the end, OK? So as you can tell from the examples I gave you, edges could be physical interactions. They can be functional relationships. They can be more abstract things like co-expression relationships. You know there's a genetic relationship between two genes because they go up or down in synchrony with each other. Basically, any sort of relation can be expressed with a simple model. And there are a bunch of different types of common interaction networks that you'll run into. Transcription regulatory networks is a good example. Metabolic networks, where you have a series of enzymatic activities and their relationship to the small molecules whose reactions they catalyze. Protein-protein interactions is a major one. You'll see a lot of them. And then there are higher-order networks such as disease networks, where each of the nodes is a disease. And they're connected to each other on the basis of, for example, sharing the number of genes that they share in common through genetic diseases. Here's another example, a virus host network. I have a virus that can infect multiple hosts or a host that can be infected by multiple different viruses. You can show those relationships. Yeah, of course. Anytime. Want me to go back? Yeah. From that reference, can we get pointed to, for example, I would be interested in virus host networks? So we can find out in that reference. Yeah, the question is, does the Barbasi review talk about virus host networks? Yeah, I believe this figure was stolen out of that review. It was stolen out of that review. And it is one of these examples. It's actually a great review of how you build networks and what they can be used for. And Mark Vidal is the first author. He's sort of the father of high-throughput physical interaction screens. And Barbasi is a mathematician and cancer oncologist who uses network analysis in his work. So you need to get network data from a database, just like you do pathway information. Network databases can be built automatically or via curation. If they're curated, you have a team of curators. It goes into the literature and pulls out papers that report physical or genetic interactions. If they're built automatically, it's usually coming from high-throughput experiments. These do hybrid experiments, proteomic immunoprecipitation experiments, the co-expression data. And many network databases use both sources of information. Typically, a network database has more sense of coverage of biological systems. The largest pathway databases cover no more than half the genome. Network databases can cover much more, but a lot of the associations that they report are false positives from the high-throughput experiments. So you need to keep that in mind. And then the next point is the relationships and the underlying evidence are more tentative, typically, than the pathway databases, because the nature of the interactions is less well understood. There are, again, like the pathway databases, there are multiple sources for curated networks. Each of the network databases share with each other or steal from each other. So you have to take some of this with a grain of salt. They're not completely uniformly curated. The three which are recommended here are biogrid, intact, and mint. Each of them has a different number, has different curation standards. Biogrid covers multiple organisms, but is relatively shallow. It has 529,000 genes and 167,000 interactions. Intact is much deeper and focuses on human. So it has 60,000 genes and 203,000 interactions. You can see it's kind of a dramatic difference in the gene to interaction ratio. And mint is smaller, but even deeper still, 31,000 genes and 83,000 interactions. We'll just look at a typical interaction database. We'll give you a interface like this one. You search for a gene of interest, in this case P53 or TP53. And it came up with 7,708 interactions for P53 that have been mined out of the literature. But notice that most of these interactions are actually published high throughput experiments. So these are a series of co-immunoprecipitation experiments. This actually came out of supplementary data published in the paper. OK, so often your choice of the source of the data is not going to be dictated by the tool that you use. And some tools have been tuned to work with particular data sets. Others are more general and will give you the option of populating it with a data set that you have an interaction or a pathway data set that you've downloaded, or even one that you created yourself. And we'll give you some flexibility. OK, so I'm going to talk now about tools that one uses to implement pathway network analysis. And so this is a figure from a Nature Methods paper that I helped write with a bunch of other authors from the International Cancer Genome Consortium. And it just got accepted over the weekend. So I'm happy about that after about a year of trying to get it accepted to Nature Methods. We broke down a pathway network analysis into three different categories. The first is the one that you did yesterday, enrichment of fixed gene sets. And in this type of analysis, you start out with a series of predefined fixed gene sets, such as go subcellular location terms, or go biological process groups. And then you do an enrichment analysis. And then you can supplement this by using a pathway database colorization. So the example that I showed earlier in which we uploaded a gene list into Reactome, it found overrepresented pathways and then showed you a colorized picture. That's basically what we're showing here. The second type is de novo. And so we're not going to talk about fixed gene set enrichment anymore because you're already familiar with it from yesterday. The second type, which is what you're going to focus on in the workshop, is de novo subnetwork construction and clustering. In this style of analysis, you start out with a large interaction network. This is typically done on networks. And present the network with a list of genes that came out of your experiment. And the algorithms will attempt to identify non-random groupings of the genes in your gene set, pull them out of the network, and try to build a subnetwork around them which shows how they relate to each other. If your genes or proteins or other small molecules are related to each other, it will find non-random clustering and pull it out. If they're just a random gene set, they'll be scattered all around the interaction network and you won't get anything out that will make much sense. And what this allows you to do is to identify new relationships that do not necessarily correspond to classic pathways that are hidden inside your data set and to identify subtypes in your data set which relate to biology. Then the last and most sophisticated type of analysis is pathway-based modeling, where you go back, where the algorithm is actually built around a computational model of how each entity in the network is related to the others. So preserves the catalyst and write positive and negative regulatory relationships and attempts to predict, given a series of experimental observations, such as mutated genes or RNA expression changes, what the downstream effects of that combination of alterations will have. And so this will give you the ability, these attempt to predict the integrated effect of multiple alterations in a quantitative fashion. What are these three methods good for? Well, the enrichment of fixed gene sets should probably always be the first thing you turn to. A typical question that you can use, this is good for, is asking what biological processes are altered in whatever you're studying, in my cancer or in my perturbed cell line or in my cases versus controls. And it usually sets you off in the right direction. For de novo subnetwork construction clustering, you can ask if new previously undescribed pathways are altered in this cancer or data set and are there clinically relevant subtypes within the data set, can we distinguish one patient population from another patient population based on which subnetworks are activated or inactivated? And pathway based modeling, you can go down to the personal genome level given a working cancer. So all my examples are in cancer, sorry. In a particular patient who has a series of mutations and hypermethylation of a gene resulting in a series of RNA expression changes and protein changes, what is the downstream integrated effect of all those changes? Can I make predictions of what drugs I can give the patient to take advantage of these changes or reverse these alterations? So enrichment of gene sets, this is covered in module three. This was taken from a, oh, this is interesting, but it didn't fix this since last week when I gave the cancer genome talk. It's the most popular form of pathway network analysis, either to form great end-user tools. Cisco model is very well worked out. The disadvantages are there are many possible different ways of slicing and dicing gene sets. So you have to choose which ones to use or look at multiple ones. The gene sets are typically heavily overlapping, and when you get an enrichment result, you'll get a series of hits in gene sets, which may actually be related to each other. So you have to do another round of inspection or analysis to merge things that are related together. So you may get, for example, cell cycle and cell cycle checkpoints coming up as two hits, and you have to realize that those are probably related to the same underlying changes. And finally, when you look at genes as a series of bags, there may be regulatory relationships within those bags that you're not able to see, such as an increase in an upstream activator or a pathway coupled with a decrease in an inhibitor. They'll both appear in the same bag, but you may not be able to see that without further inspection. So now we're going to talk about, in a little more detail, de novo subnetwork construction and clustering. Basically, you're applying a list of altered biological entities to a biological network and finding topologically unlikely configurations. Typically, as measured as finding clusters of altered genes that are closer to each other in the network, then you would expect by chance. And there are various ways of measuring this. Typically, the main way is the average shortest path among them, which is counting up the number of hops between each pair and then averaging those shortest paths. You can then extract the clusters of these unlikely configurations to make subnetworks, and then you can annotate them using gene-set enrichment, in fact, to identify which biological processes correspond to these clusters of gene networks that you've found. So clustering is the process of grouping the biological entities into small communities such that the members of these clusters or communities are more connected with each other than they are with entities that are outside the cluster. As you would expect, if the interaction network is any good, highly connected proteins share similar properties. Typically, they'll be members of the same pathway or they'll be members of the same molecular complex or they'll be members of the same set of genetic regulatory interactions. There are a large number of ways of clustering, and this is actually the main difference between the major difference between different network analysis algorithms. The oldest and most widely used clustering algorithm is one that was developed by Gervin and Newman a couple of decades ago for use in analyzing the worldwide web to find communities of related users. It's very accurate, but a little slow. Many algorithms now use a faster one, faster but less accurate algorithm called Markov clustering. Both Gervin, Newman, and Markov clustering have a problem with ascertainment bias. If you have a gene which has a lot of interactions going into it or coming out of it, like p53, p53 and other highly related genes will always form a cluster and it will come out as a false positive in your data set. And the only reason for this is because p53 has been so heavily studied that there are a lot of interactions in the literature. And so there have been a number of corrections applied to this. The one that, in practice, works quite well is an algorithm called HotNet written by Ben Raphael at Brown University, which models the network like a metallic lattice and then applies heat to different points of the network to indicate gene activity. So if you have done a microwave experiment and you find that a gene is up-regulated in the model, you make the up-regulated gene hot and down-regulated genes you make cold. And then the algorithm uses heat diffusion equations to diffuse this activity out among the network. And the way it helps with highly connected genes like p53 because p53 has many, many wires coming out of it and the heat diffuses more rapidly. So it corrects, to some extent, for the ascertainment bias. And it's kind of mysterious why it should work, but it actually produces biologically intuitive results. So people like it. I like it. These three algorithms and others have been incorporated into a large and ever-growing set of clustering algorithms. The two that I want to point out to you are both implemented as cytoscape applications. And since you've used cytoscape yesterday, you can start using them now. One is called hypermodules written by Yuri Reimond, who is a postdoc in Gary Bader's lab. And this identifies network clusters that correlate with clinical characteristics or other phenotypic characteristics. So if you are trying to find a cluster of genes which is associated with response to a drug in a patient or a cell line or a mouse, this will weight the clusters for those that are associated with that phenotype. Very useful. And then the Reactome Functional Interaction Network Cytoscape app, also known as Reactome FIViz, offers multiple clustering and correlation algorithms, including Hocknet and Markov clustering and Girdman Newman, designed to work specifically on the Reactome FI network. And it's kind of a Swiss Army knife for doing this type of de novo extraction and analysis. And you'll be using it later this morning. And you'll be using it in your exercise, and Robin will be demoing it. So here's a typical output of a network clustering algorithm. We started out with a much, much bigger network than the algorithm extracted the members of the gene list that you gave it and may also extract other genes which are related to the ones, highly connected with the ones you gave it, that aren't actually in your data set. And then each cluster gets marked in a different color. You can draw polygons around it and then annotate them according to what biological processes are enriched in them. I highly recommend a particular interaction network, the Reactome Functional Interaction Network. This is basically a repeat of things that I've said, but it is a good compromise between curated and uncurated gene sets. We're going to skip through this and just talk about how the FI network was constructed. We started out with a series of pathway databases and converted each pathway into a network based on an extraction procedure that's shown here, but I won't go through it, to create a set of annotated functional interactions. Covering around about a third of the genome. And then we added to this a set of high throughput data sets taken from a number of network interaction databases, the literature and other high throughput data sets. It includes things like the transcription factor mapped from encode, a bunch of protein-protein interactions in various species, the sharing of go-biological processes, and then used a machine learning technique, naive-based classifier, to remove most of the false positives from the set. And then we took the curated functional interactions and the predicted functional interactions and combined them together into a single, very large functional interaction network spanning about half of the human genome. And we update this every year as of the last release at the end of 2014. There were a total of 11,780 proteins in this data set, about half from pathways and half predicted from high throughput experiments, covering roughly 58% of the genome. And there are 336,000 functional interactions there. So it is quite a large network. So the way you do de novo subnetwork extraction and clustering is you start out with the functional interaction network, which the tools download for you behind the scenes. This is showing just a little bit, little corner of the network. It's much larger than this. Present it with your genes. These are up and down regulated genes, for example, in a expression, microwave expression experiment. The algorithm finds out where they are in the network. It connects them. And then optionally, if you wish it to do so, it will identify linker genes, which are interconnecting the genes in your gene set, forms a network with those, and then extracts a subnetwork which relates your genes and the linker genes to each other. And you proceed from there. Here's an example of this at work. This is from a couple of years ago when we started sequencing pancreatic cancers at OICR. Here's a typical. This is the first 52 genomes that we did. And this is a typical look at somatically mutated genes. There are a couple of genes which are highly frequently mutated. In the data set, KRAS is mutated in 90% of patients, P53 in about 60%. And then there's this long, long, long, long tail of things that are only mutated once or twice. This goes on for a long, long way to the right. Looking at this or doing gene set enrichment analysis doesn't tell you much, but you can do a subnetwork extraction and clustering to get a picture like this one. And what this is showing pretty clearly is that there are a series of highly interacting modules each centered around one of those high frequency driver genes. Here's KRAS. And then there's mutations in a whole series of other genes which interact with KRAS. And they're interacting with KRAS more frequently than you would expect, and with each other more frequently than you would expect by chance. And it also includes members of the ERB, FGF, and EGFR pathways, as well as some unexpected things such as axon guidance. The second largest module involves a number of infrequently mutated genes involving participants from the Hedgehog and TGF beta signaling pathway, and so on and so forth. The interesting thing about this is that we're now up to about 450 genes, and the module map has 450 patients that we've done sequencing on or combined our data set with other groups. And we get the same module map that we did when we had 50. So there's a lot of information embedded in this that you can make use of, even at very small numbers. Now we can then ask, if we map these modules onto patients, are there differences from one patient to one donor to another? So what we did is for each of our donors, we looked at the mutations in that particular donor, mapped them onto the modules, and scored each module according to how many genes in that module were mutated in that patient. And you get this really kind of cool thing out of this. What we're showing is we're showing patients going down and modules going across. The color indicates for each patient how many genes in that module were mutated in that patient. And we did some adjustment for a total number of mutations. And we get a very clear pattern of four different tumor types, one which is negative for modules 1, 2, and 10, one which is positive for 1, 2, and 10, one that's positive in 2, and 10 only, and so on and so forth. The next thing to, so we've actually looked at these, these are K-RAS negative patients. There are a bunch of interesting findings in here. You can then look among these tumor types to see if there's a difference in phenotype. In some cases, you find it, in others, you don't. If you look in pancreatic cancer, the main finding is that patients with tumor type 1 have a much longer overall survival, but it's a very rare phenotype, as you can see here. It actually turns out that the patients who have mismatch repair deficiencies. But in other cases, such as, here's an example in breast cancer, and I know my picture has disappeared. I don't know, it was here last week when I gave this talk. That's odd. Anyway, there's supposed to be a picture of a, this is a same experiment done on estrogen receptor positive breast cancer, where we're looking at expression levels of genes. ER positive breast cancer, typically these patients have a very good prognosis, have long survival, have good response to estrogen receptor inhibitors, but a few patients don't do so well. And we did an analysis comparing across a series of 500 tumor microarrays and identified a module that's highly variable among the groups. And it's, I don't have the picture to show you, but it involves Aurora B. kinase plus the mitosis related pathways. And if the gene is in this module, or about 13 genes in the module, if they have high expression, the patients have much worse survival than those who have low expression in the members of this pathway. In fact, the difference is such that this group of patients has the same prognosis as patients with triple negative disease, which is the worst subtype. So we found a very strong biomarker of prognosis based on a network analysis. And at the time this was published, it was the strongest such biomarker that had been found. It was not obvious by looking at a gene at a time. OK, the last topic we're going to talk about is pathway-based modeling, where you apply, again, you apply a list of altered genes, proteins, or RNAs to biological pathways. These algorithms attempt to keep the detailed regulatory information that's present in the pathway database. And unlike any of the other techniques, these modeling techniques allow you to take different types of molecular alterations and look at their effects together. So for example, you could profile a patient population in which you have genomics data, expression data, copy number changes, methylation changes, small RNA changes, and put them together into the model and have it predict how together these changes are going to affect the pathway activities. And this is where pathway modeling starts to shade into systems, biology, cell modeling. There are multiple, this field has a long, long heritage. It goes back 40 years to partial differential equations performed on bacteria and yeast growing in fermentation chambers, where people use enzyme kinetics, KMs, and KIs to model the rate of consumption and production of metabolites. There's an online PDE system that you can use called CellNet Analyzer that works quite well. But it's really designed for biochemical systems, not for higher-ordered signaling or cell regulatory pathways. Also, these PDEs don't work very well when you get more than a couple of dozen of genes or proteins involved. They no longer become computationally tractable. The second class of pathway-based models are called network flow models, which model information flow through a pathway or pathways. These are best developed for the analysis of kinase cascades, net forest, and network kin are both Netherlands-based web services that are designed for analyzing protein kinase and phosphorylation data sets. Then there's a large class of network-based reconstruction methods designed at identifying the transcriptional hierarchy, the transcriptional regulatory hierarchy from sets of expression arrays. These enable you to find master regulators, for example, in a system of perturbations. You can identify the top-level transcription factor, which regulates all the changes below. And the exemplar of this is a RACNI from Andrea Califano's group at Columbia University is specifically designed for the case where you have RNA seek or array data across the same system where you have at least 100 perturbations. So for example, B cells that have been treated with 100 different drugs. And you've done expression profilings on that. You can present that to RACNI, and it will identify the transcription factors, which are driving all the variation across those perturbations. And lastly, the most recent type of models are probabilistic graph models, which attempt to build up what is called a Bayesian network, where you have a series of nodes, molecules, which influence the others, and attempts to propagate the influences from the top down. The most advanced version of this is a system called Paradigm that was developed by Josh Stewart at the University of California, Santa Cruz, and widely used for cancer analysis at this time. We're going to talk about Paradigm in a little bit more detail, because it's been built into a site escape app. So the way PGMs are designed to allow you to integrate multiple omics data types onto the same pathway. So you can take copy number changes, RNA, expression changes, mutations, proteomics data. You can explicitly look at activated, phosphorylated versus inactivated, non-phosphorylated versions of proteins. And it will identify, it will integrate these data together and make predictions on which pathways are changed and tell you whether they're up-regulated, down-regulated, or not changed. And then you can take those pathway activities and relate them to the phenotypes, clinical or experimental phenotypes of interest. Here's a simplified view of how Paradigm works. It starts out with a really a traditional pathway. Here's P53, and it has a negative regulator, MDM2. And downstream of P53, there are multiple other steps, and finally it leads to apoptosis. In Paradigm explicitly models the central dogma. So for MDM2, there's the MDM2 gene. It drives production of the MDM2 RNA, which drives production of MDM2 protein, which then is activated to become the active protein, which then negatively interacts with P53, which again goes from gene to RNA to protein to P53 active protein, which eventually leads to apoptosis. Each of these steps has a weight associated with it. It's actually the probability that a two-fold change in this entity will result in an end-fold change in the downstream product. So using this model, you can model a whole bunch of things. You could have a mutation in the MDM2 gene, which will reduce the weight. You can have an mRNA change in the RNA, which will reduce the amount of RNA and will propagate down to the hair. You can have a copy number change in P53. It could be a deleted. You can have some mass spec information that indicates that the protein has changed. And by using the model, it will integrate them together and tell you, well, when you have all these changes, is apoptosis increased, decreased, or not changed very much? And this actually works quite well. Here's an example from cancer analysis of gluoblastoma multiforme. Similar to the pancreatic map that I showed you before, these are different patients. Here are different. Each row here is the predicted activity of one of the genes in the model. And you can see clear differences from one patient to another. Here's a group of patients who have a decrease in GATA interleukin activity. If you were to look at the individual mutations or RNA expression changes, you'd actually find a lot of heterogeneity in here. But by a variety of methods, these changes all lead to the same decrease in GATA IL2. Similarly, they have an increase in EGFR. So it's a very nice general technique, quite powerful. They're good and bad news about paradigm. Bad news is that it's distributed in source code form. It's hard to compile. It doesn't provide you with any pre-formatted pathway models. You have to figure out how to put them in yourself. Documentation is almost non-existent, and it takes a long time to run. The good news is over the last year, the Reactome team has put together a side-escape app that incorporates paradigm. It's in beta testing now, and it's still under active development. But it automatically downloads and uses Reactome pathway-based models. And we've improved the performance of the original algorithm so you can now use it interactively. So the tool that Robin will be leading you through today, Reactome FIViz, in addition to the subnetwork extraction and clustering, which you'll be doing in your workshop exercise, it also has experimental paradigm support. So that's the end of the lecture. I've given you a whole bunch of URLs here. And look for a view article covering this in Nature Methods coming up sometime in the next few weeks. So I'm happy to answer any questions. Yes, Stephen? Yeah. Just maybe it's totally irrelevant. But with the big network that you put up before that, you might think we're in cancer. Notice the axon guidance came up, and it made me wonder, because I've seen axon guidance come up in a couple of different experiments. And it makes me wonder if axon guidance is actually relevant there, or if there are a whole lot of genes that have been documented and been related to axon guidance that actually have a whole lot of other functions that we don't know about yet. And I'm just wondering how much influence that has on this type of analysis in Europe. Yeah, so just to repeat for the recording, the question is, what is the relevance and meaning of axon guidance as module five for pancreatic cancer? Well, so this first gate came up quite a few years ago and I was quite surprised at axon guidance and actually argued with my co-authors that we just couldn't call this thing axon guidance, because the same genes are also involved in other biological processes, including angiogenesis. But it's driven really by Robo-slit interactions, which are key for chemotaxis. And it actually turns out that there's a correlation between mutations in this pathway and metastasis and tumor mobility. And you can actually show that you can cause the cells, pancreatic cell lines, to become more mobile by knocking out certain components of this. So I think that in this case, the main problem is that the name of the pathway is too specific. It's really not axon guidance, but more cell chemotaxis, organogenesis organization, that kind of something along those lines would probably be a better way of describing it. Axon guidance genes have also come up in quite a few other tumors in recent years, so it's a real observation. Yeah, Roger? So we have a lot of tools that are discussed in one. So for the simple users, not our practitioners, would you recommend that we try a bunch of them? And we'll probably get different answers. How do you handle this? Just look what's coming between what your findings are? So the question is, what do you do when there are multiple tools and they're giving you different answers? In my lab, our standard operating procedures, we try several tools. And then first of all, we try to validate the results before we put any confidence in them. And if we seem to be getting equally valid partial answers from each one, we then take the union or intersection, depending on the circumstance. Not very systematic. The systematic way is you benchmark everything against the gold standard set. But particularly in this field, network analysis, it's usually not possible to do that. OK, thank you very much. I'll be hanging around for about a half an hour.