 Hi, everybody. I'm Lincoln Stein. I'm the director of informatics and biocomputing at the Ontario Institute of Cancer Research up the ways in the same place that Michelle is from. And I'm going to be talking about network and pathway analysis, building on the material that Quaid and Wyeth have given you earlier. I'm going to also, after this lecture, Robin Hall, who is right here, is going to lead you through a tutorial on using some of the tools I'll talk about. And I will be here for a little while to answer questions, and then I have to sneak out because of a combination of academic and child care responsibilities. So obligatory, creative commons, open access, open source, et cetera. So do I have a laser pointer? I do. This place is really well equipped. So you've previously talked about various types of functional pathway analysis. You've had lectures on overrepresentation analysis. This is what the David tool runs on. It uses Fisher exact test in which you take a set of differentially expressed genes, partition them into the up-regulated and down-regulated groups, and then compare to a set of reference genes to see if they're drawn from pathways to see if there's an overrepresentation of your down-regulated genes in those pathways. Another way of doing something similar is a functionally called functional class scoring. You'll have heard of this as GSEA, where you don't need to pre-select the cut-off point, but instead you rank your genes and it identifies the background for you. The third class, which we're going to touch on, is pathway topology tests, where so each of these systems is based on the idea of taking the whole gene space and creating arbitrary bins. So a gene is in DNA, one gene is in DNA repair, another gene is in apoptosis, a third gene is in mitotic regulation. And viewing your data in that context, the problem with that is that it's a very naive way of thinking about the genome. In fact, every gene has multiple functions that can sit in multiple of these bins at the same time. And there are relationships between those bins. One up regulates the other, another is an inhibitor, another is a catalyst for an essential step. And so if you have a map of the interactions among genes and the nature of the interactions, you can get additional information. So David, GSEA, and then we'll be talking about the Reactome Functional Interaction Network and a tool called Paradigm. So this is basically what I've already said. So the advantages of pathway analysis is that you can draw on these large curated databases knowledge bases of biological pathways. Pathway databases are usually curated. They provide a biochemical view of biological processes similar to intro biochemistry with detailed diagrams of each step along the way from one substrate to another. It captures the cause and effect, the detailed biological information, the mechanisms, and it provides, can they provide you with human interpretable visualizations of your data set in a pathway context? Disadvantage of pathway databases are they provide a sparse coverage of the genome. So in general curated pathway databases don't cover more than about a third of the gene space. And because these databases are created by people, people never agree with each other. And so the different databases disagree on where one pathway starts and another pathway ends. But you're very familiar with some of these databases, I'm sure. How many people here have not seen or used KEG? Excellent. So as you all know, KEG is a heavily curated and very rich database of biological pathways in human and many, many other species, actually started out as a prokaryotic database. It focuses on the enzymatic aspects of biology. So in this diagram, for example, which is part of sugar metabolism, it's showing all the enzymes as EC numbers. But it also does cover higher level pathways such as cell division or tyrosine kinase cascades and signaling. Reactome is the database that Robin and I work on. It's about 10 years old. It has slightly different focus from KEG. It starts out with human high level pathways, such as signaling pathways. And focus, it stays at the human level. We don't cover other organisms. Reactome uses a common formalism for drawing its pathways called SBGN, Systems Biology Genotation, in which every action is represented as a node in the diagram. And then there are a series of inputs into that reaction and then one or more outputs. And it's used to represent things as diverse as a unphosphorylated sugar molecule and ATP coming into a reaction being catalyzed by an enzyme and the phospho form of the sugar coming out. Or it can be used to describe a movement of one of an extracellular molecule from the outside of the cell to the inside of the molecule or of post-translational modifications. It's a pretty general way of describing biological processes. As you can see from this, you can just sort of follow. You probably can't read it at this resolution, but you can just follow along the steps in a series of biochemical reactions in a way that's very typical from reading textbooks. So, reactome is hand curated, it focuses on human. It's very rigorous, which is why it's also a very slow and painstaking process. Every reaction in reactome is traceable to the primary literature. So you can go from our website and step right into the PubMed entry, which describes the experiment or experiments which prove this reaction occurs in the human cells and what the characteristics of that reaction are. It does not, we do not try to annotate pathways in non-human species, but we do this automatic projection via orthology onto other species, so you could actually download pathways in non-human species, although they're not, they don't have the same accuracy as the human pathways do. And we provide a series of user interface tools. You find that we give you a Google map style reaction diagram diagrams that you can zoom in and out of. You can overlay your own data on top of these diagrams. You can also overlay other people's data, such as interactions with small molecules from drug databases or interactions with protein-protein interactions from proteomics databases. You can upload a list and find pathways that contain your gene list. You can do a simple gene overrepresentation among reactome pathways, although there are many better tools for doing that outside reactome. And you can give it a pathway in human. You can find corresponding pathways in other species. Most important thing about it is its open access, meaning that we encourage people to download the entire data set incorporated into their own tools, build on top of it, and republish it. And many people have used reactome as sources for more sophisticated tools. One example of this that I want to draw your attention to is the Pathway Commons project at Memorial Sloan Kettering in Chris Sanders Group. This is a resource that was developed in response to the problem of proliferating pathway databases which don't agree with each other and use different representations and data models. What Pathway Commons does is it enforces a common data model on each of the pathway databases using a data language called Biopax. Thank you. I was saying biopsych. What is it? Biopax. And so on a regular basis, each contributing database, there are nine of them at the current time, dump all their data into Biopax format. It gets taken up by Pathway Commons and then you can search through it. And it will give you a list of all. So you can put a gene name in here and it'll give you all pathways from all pathway databases that contain your gene and you can compare them directly and decide which pathway is most suitable for the type of research you're doing. So what kind of analysis can you do on top of a pathway, on top of pathways? Now, the ultimate type of analysis is systems biology. So if you have the binding constants for each of the proteins and small molecules that are interacting and if you have the kinetic constants, you can take a pathway diagram and you can build a kinetic model consisting of a set of partial differential equations or using a methodology called Boolean modeling and predict exactly what will happen when you increase the expression level of a gene or knock out a gene with a mutation or inhibit part of the pathway with a small molecule inhibitor or a drug. And that's the ultimate kind of modeling one can do. The problem is that in the vast, vast majority of cases, there's not enough information to do systems biology with pathways because we don't know the binding kinetics. We don't know the rate constants. We don't even know simple things like the concentration of salt in the cellular compartment in which the reaction is taking place. And systems biology is getting there, but it doesn't do a good job yet at modeling the movement of molecules from one part of the cell to another, which is a major feature of most interesting reactions. So we're left with a set of much more much simpler and naive tools. And the most simplest and naive is pathway colorization. And you can guess what it is. You upload a gene list. The database finds all the pathways that contain your gene list. And then it gives you a diagram in which it's colored parts of the diagram where your genes are. You can see where they are in relationship to each other and maybe make a guess of what common mechanism links the members of your gene set. And there's simple colorization where it just marks your genes in red if they're contained in the list. Or you can, or most databases, offer a heat map scale so that you can give gene and its expression value. And it will give you a graded picture. And some will give you a little, if you have a time course, some will give you a little movie of how the pathway changes. So here's an example from a reactome of pathway colorization. We have here a list of genes which are mutated in glioblastoma multiforme. It's from a TCGA publication a number of years ago. It's in a standard format. We upload it. And then reactome gives dozen overrepresentation analysis and gives you a list of all the pathways that are overrepresented in that gene set along with a p-value. And then you can step into this, click on it. So we can see the signaling by PDGF has the smallest p-value. So we click on that. And that shows you a pathway diagram in which it's colorized each of the genes that was on the list according to some color scale that you give it. In this case, it was the severity of the mutation and shows you how they're related to each other. The ugly black squares are actually complexes which contain multiple genes. If you mouse over them, it shows you the color scores for the members of that protein complex. So this is fun and can give you useful information. It can give you figures that you can use in talks and papers. But it's very limited. And the main reason that one of the big reasons it's limited is that most of the genome isn't in the pathway database. So if you have a list of 500 genes that you're interested in, you'll be lucky if 150 of them are actually in any of the pathway databases. So the rest of them are terra incognita. And that's because pathway databases only capture the well-understood portion of biology that's been published in peer-reviewed journals and has been confirmed. So in order to apply network, apply pathway analysis to the rest of the genome, you need to start reaching into less well-understood relationships. And you get into the domain of high throughput data, genetic screens, phenotypic screens, co-immunoprecipitation, physical interactions like yeast 2 hybrid, as well as conceptual links like sharing of go terms or adjacency in pathways mentioned in the literature. So we're going to talk about networks. So networks are very different from pathways. In a pathway, we have the precise topology of the relationships. We know the nature of each interaction and what it does. In networks, we have much looser relationships between two genes or proteins. So we're going to spend a little while talking about networks. So for example, a typical network that you might work with is a network of protein-protein interactions. And tomorrow, you'll get a great talk from Quaid on gene mania, which is the uber database of gene and protein interactions. You can use a protein or gene network typically consists of a series of vertexes, also known as nodes, and a series of edges, which connect the nodes by some relationship. Very different from the reaction diagram that I showed you before, where the node is actually a reaction and proteins and other molecules feed into and come out of the reaction. So these are always bimolecular. There's only one edge between every two nodes. Things that you run into in networks is you can run into, you can have cycles. So that's a closed path through two or more nodes. And you can have different kinds of edges. You can have undirected edges, such as this one. So gene A interacts with gene B, but they're peers of each other. Or you can have weighted edges. So for example, if I'm talking about proteases, I can say that gene A is a protease that cleaves gene B. Or gene A is a activator of B. There's directionality there. And I can also have weighted edges. We can talk about confidence scores. I'm very confident that this is a real interaction because it has a score of 10. I'm not so confident of that one because it has a score of 7. Of course, the meaning of any of these weights or directions depends on the context. So to map biology to a network, you need to choose what the nodes mean. You need to choose what the edges mean. And you need to know what the directionality if any means and what the weighting if any means. And you can use this as a very general way to represent protein-protein interactions, regulatory relationships, genetic interactions. So gene A is a suppressor of gene B, protein sequence similarity, pretty much anything. But it's critical. But you'll see lots of networks thrown around. And it's always critical to understand how this network maps the biology. So I'll show you a few examples here. Here is a proteomics experiment published about a decade ago in which a group did pull-downs of protein complexes in saccharomyces cerevisiae. And they did mass spec to dissect the components of each complex. What you're seeing here is a network of 500 or so genes. And the arcs, so each node is a protein in this case. And each edge connecting them indicates that those two proteins were co-immunoprecipitated. So they're in the same complex. Here is something that is similar, but it represents sequence similarity. So this is all of uniprote, in which they did a blast P against between each protein pair. And they connect them together if they have a high protein similarity score. And you start to see protein families coming out. They're color coded in different ways. And it's a representation, which is very similar to the protein protein interaction network, or the protein protein complex network, but it means something quite different. So here we're looking over evolutionary space. And here we're looking over something that's existing in a current cell. So here are a few more network concepts that are useful to note. So going back to our abstract picture here, the degree of a node is the number of edges that connect it. So this node is degree 4 because it has four edges coming out of it. This one's degree 2 because it has two. The shortest path is very simply defined as if you choose any two nodes in the network, what is the shortest path that will get you from one to the next? So the shortest path for this pair happens to be two hops. In a social network, it's between five and six hops. You've heard of six degrees of separation. In the internet age, it's about five. I can send a message, I can send a letter to anybody else in the world through five intermediate people. You just have to know who those people are. Then there's this interesting concept of between-edness, which is harder to understand. And what between-edness is, it's a measure of how important that node is for the connectedness of the entire network. And so the way to explain it is every two nodes has the shortest path that connects them. So between-edness of a node is a number of shortest paths that go through that node. So you take every node in the network, find the shortest path between them, and every time that shortest path crosses a node, you bump up that node's count by one. So this fellow here, who has degree four, has between-edness six because six shortest paths pass through it. Here's another one that has degree four, but he's not as popular. He only has a between-edness of five. And the interesting thing about high between-edness nodes in a biological context is these tend to be the bottlenecks in information flow. These are the regulatory molecules. They interact with a lot of other genes or proteins. And a lot of the information processing passes through them. So they tend to be highly enriched for lethal genes. And they also tend to be interesting genes that pop up in cancer screens and a lot of other disease processes. There are also targets for drug interventions. So the final concept here is the idea of the scale-free nature of biological networks. How many people have heard about this? Yeah, OK. So there are many different ways that you can link up a network. You could start with 10 nodes and just add edges randomly between pairs. And that would give you a network like this. Or you can use a rule in which the chances of adding new edge to a node is proportional in some way to its degree so that more connected nodes get even more connected. And that gives you something which is much more bunchy like this. It's called a scale-free network. Or you can have a rule that you choose one node, which is kind of the super node, the controller. It has highest degree of freedom. And then there's another tier outside that where there are only a tenth as many connections. And this continues. And that gives you what's called a hierarchical network. Now, as these have different properties, the one that's easiest to explain is the relationship between the degree and the probability of that degree, nodes of that degree, occurring. In a random network, you get a bell-shaped curve around the mean degree. So this is k here is the degree. And pk is the probability of a node of that type occurring, basically the number of such. And so you get a nice peak. In a scale-free network, you get something quite different. You have an exponential relationship between the degree of a node and the probability of that node occurring, so that you get lots and lots and lots of nodes that have a low degree and very, very, very few nodes that have a high degree. So here we've actually plotted the degree in a log scale. You can see a linear relationship. A hierarchical network will have the same properties, although there will be this bunchiness in the relationship because of this kind of rule and how we hook up the different tiers of nodes. So the way you can distinguish between a random network and either the other two is by a graph of degree versus probability, how do you distinguish between a scale-free network and a hierarchical network? Well, it turns out that if you can cluster the nodes by using community clustering algorithms that have been worked out for the World Wide Web and other places where you basically find little communities of nodes which are interacting with each other. And if you graph the degree of a node versus the clustering around that node in both a random network and a scale-free network, you actually get a linear relationship. Your neighborhood remains pretty much the same no matter what your degree is. In a hierarchical network, however, you get an inverse relationship as the degree increases, its clustering gets tighter and tighter. And you could actually see that visually here. The reason that this discussion is important is that biological networks turn out, by these criteria, to be scale-free networks. There are a very small number of genes that have a large number of connections. And these are choke points in the biological network. You remove them, and the network falls apart. A large number of genes have a very small number of connections. They tend to be leaves. You remove them, nothing much happens to the network. The genes cluster quite a lot because of this property. And when you look at what's happening biologically, these tend to be functional pathways. They tend to correspond to pathways that are known and pathways that are not known. And the clusters themselves are scale-free, meaning that there are lots and lots of small clusters of just a few genes that are working together with each other. And then there are a few large mega-clusters, like the one around P53 that encompasses hundreds or thousands of genes. Before I move on, any questions about the network jargon? More properties, or any of that? OK, great. So we had pathway databases. Now we're going to talk about network databases. Somebody has to put these together. So there are more network databases than there are pathway databases. And part of that is due to the fact that you can build them automatically using computation. There are also curated network databases. Their popular sources of curated networks includes BioGrid with about 600,000 genes in it, intact 60,000 genes, and Mint, a smaller number of genes, but a lot of interactions relative to the size of the database. You can actually see this very really reflects curation priorities, whether you're going to go broad like BioGrid with lots and lots of genes, but relatively few interactions of the number of genes, versus intact, which has fewer genes, but it has more interactions than BioGrid does. The nice thing about bi-molecular databases is it's very easy to combine them. You can take all of these and put them together, and you'll have one big network that has information from them all. But you have to be careful, as we'll see later. Then there are ways of building network databases without curation. So you can take text mining approaches. One very popular thing to do in the early 2000s was to take all abstracts from PubMed, or all the corpus of all the bodies of articles from PubMed, and look for, identify the gene names and find an association score between them. If they occur in the same paragraph or the same sentence, then they're probably related in some way. And you can build up a big network of related genes. It's much faster than hand curation, but it's not perfect. For example, if you see Hedgehog mentioned in a paper, you're talking about the gene Hedgehog, or the protein family Hedgehog, or you're talking about the Hedgehogs that live in hedgerows and eat ants. And to resolve this problem, you have to do natural language processing, but this is difficult. Google can probably do it, but most academics can't. Popular resources built on top of these include iHop, which is a fantastic resource that lets you hop from one literature reference to another, following the gene links, and PubGene, which is a part of NCBI. Other ways you can make a network database of interactions are high throughput experimental screens. East to hybrid protein interactions are a very popular source. Protein complex pulldowns, mass specs, as you saw in the yeast example is another one. Genetic screens, such as synthetic lethals and enhancer suppressor screens. Also another popular thing to do. Again, none of these sources of information are perfect. In particular, yeast to hybrid interactions have taken the proteins out of the human cell, if that's what you're interested in, and put them in a different context in yeast. And so you may get interactions which occur in yeast that don't occur in the human cell, or vice versa. And even if you do see a physical interaction, it doesn't necessarily mean that it's a biologically important functional interaction. And so for example, actin comes up as a major interactor in all these screens because it happens to be a very sticky protein. And probably not all those interactions are real. The same thing with protein complex pulldowns, actin comes up all the time. And genetic screens are hardly sensitive to changes in the genetic backgrounds. If a suppressor in one Drosophila strain or yeast strain, it may not be a suppressor in the other because of other interacting genes that you don't know about in the genetic background. So what does one do to reduce the error rate in big network databases so that we can actually get useful information out of them? You'll hear tomorrow about Quaid and Gary's approach with Gene Mania. And I'm gonna give you some simple examples of doing this and then talk about my own group's research on the problem. So basically the way to reduce the noise in these big networks is to use multiple sources of evidence to up way interactions that have several different independent sources of evidence pointing to them and down way ones which only came up in one screen or another screen. Or to add external information to the problem. So a simple example that worked very well for filtering these two hybrid interactions is this concept of a party hub that Mark Vidal's lab came up with that Dana Farber in Boston. They noted that their initial yeast to hybrid screens had probably about a 40% false positive rate. So 40% of the interactions they called were not real interactions so when they tried to validate them, they did not validate. They came up with this idea of party hubs in which you look for partners that are expressed in the same time and in the same place in the cell. So if they're both expressed in muscle cells in the Golgi apparatus, then they're more likely to interact than if one gene is expressed in muscle, the other gene is expressed in liver, they're never co-expressed, or if one gene is a, lives in the Golgi and the other lives in the nucleus. Those are likely to be false positives. And that very simple filtering will reduce the number of false positives from 40% down to about 10%. More complex ways to do is you take many different sources of curated and uncurated evidence and you combine them in some intelligent way. And so that's what the Reactome-FI network is. So to illustrate this, when we started building this network, it was really in response to my group's difficulty in using Reactome to analyze cancer data sets because there wasn't enough coverage of the genome. And when we started building this, we were at version 35 and there were only 5,000 proteins in about 1,000 pathways in Reactome, it's a lot, but it's only covering 25% of the genome. So three-quarters of our cancer screens were coming up with nothing. So we wanted to expand Reactome's coverage. And so what we did, as though first we took all the curated pathway databases or a subset of the better ones and combined them together. So we took Reactome, we took the National Cancer Institute's Protein Interaction Database, Panther System, Keg of course, several others, and from these we extracted all the interactions. So we took these complex pathway diagrams and turned them into a big network of bimolecular interactions, but each edge is curated and it has a direction. Usually we know what the role of the protein is. One is activating another or inhibiting the other. Okay, and so that got us up to about 30% of the genome. And then we took a large source number of uncurated high-throughput screening sets. We took human protein-protein interactions from several of the protein interaction databases that I talked about before. We also took interactions from protein-protein interactions from flyworm and yeast and translated them into via orthology and to human protein-protein interactions. We took text mining protein interactions from a resource called GeneWays. We used gene co-expression in order to capture this temporal co-expression of the genes. We took go annotations and we took PFAM interactions. And for each protein pair in the genome, we gave it a weight indicating how many of these independent sources of evidence were supporting that interaction. And in order to do this in an intelligent way, we used machine learning, we used a naive Bayes classifier, which is the very simplest kind of machine learning system, in which we learned how to weigh the each piece of evidence by training the classifier based on the curated interactions, a set of curated interactions taken out of this. And we trained and tested and trained and tested until we entuned the thing to have a false positive rate of about 1%. So we wanted a very accurate functional interaction network. And this gave us a network of about 11,000 proteins corresponding to 9,500 genes, 210,000 functional interactions. And we increased the coverage of the genome to 50%. I wish we could get higher than that, but there just isn't enough information anywhere to allow us to annotate the other half. Here I'm showing 5% of the network. And you can see the scale-free properties showing up pretty well here. Here are those clusters. And you can see a lot of little clusters and a few really, really big clusters like that one. Because of the way we tuned our classifier, our false negative rate is about 80%. A lot of that is due to just missing, not having genes in the network. Okay, so how would you go about using this? Well, the methodology we've developed and implemented for using the functional interaction network to make sense of gene sets involves active network extraction, clustering, and annotation. The steps are simple. We take this, the functional interaction network. And this network is a melange of many different cell types at many different developmental stages. It doesn't say anything about any particular cell type or developmental stage or disease state. In order to do that, you take a gene set that you've derived experimentally, say all the up-regulated genes or all the genes that contain copy number changes, and extract from the melange network a subnetwork consisting of the altered genes. And then you apply clustering algorithms to create a map of interacting clusters. And then you can use the over-representation analysis that you've heard about over the last two days to annotate those clusters with over-represented pathways. And so what we've actually done is discovered our gene sets from the network and then annotated it using the over-representation analysis you've heard of. And typically this will take, you can start with hundreds of genes that are altered, and it will reduce, this methodology will reduce this to a manageable set, usually about a dozen or two dozen disease modules, which you can then step into and look at the relationships among the genes in order to develop hypotheses. So I'm going to give you some examples of this in action. So the TCGA project published a set of 900 genes, which have recurrent apparent driver mutations of them in ductal adenocarcinoma of the breast. They published this back in September. When you take that list of 900 genes, extract it from the interaction network and cluster it, cluster and annotate it, you get this map. You get a large network corresponding to focal adhesion and interactions between the extracellular matrix in the cell. You get tyrosine kinase signaling. You get adherence signaling. You get notch and wint signaling. You get DNA repair. These are all things which are previously, these are all pathways which are known to be altered in breast cancer. In addition, you get some other things which are not previously seen. Axon guidance, which was recently published as a altered pathway in pancreatic cancer, shows up as a very strong signal, as well as ubiquitin mediated proteolysis and interactive ligand receptor interaction. As well as things which are probably false positives such as the mucin cluster which show up in lots of cancer screens. This gives you a nice snapshot of what's going on. It's easier in many ways than going through the long lists of often contradictory gene-set enrichment analysis results. Here, oh, I've lost it somehow. Okay, I have gone on to the... It's crashed, has it? Okay, let's go into the good part, too. Yes, of course, that's a good time to do it. That's correct. So, the procedure, and this is part of the tutorial that Robin will lead you through, the procedure is you discover the modules in your data set. And then you annotate those modules from the databases of your choice. So you can annotate it actually with keg pathways or with gene ontology terms or with any of the other tools that you've learned from this morning. But the advantage is you're labeling 12 different modules. You're not labeling... not any of the big list of 100 significantly enriched gene sets. There can be contradictions between the annotations. If you label, for example, with keg pathways versus reactant pathways, keg pathways may... keg annotation may say something like pathways in cancer, which is not very helpful. It often does that whenever you're looking at a signaling pathway. And reactant might say row GTPases, which would be more helpful. Or it might be reversed and reactant may give you something unhelpful and keg might give you something good. So there's still some amount of inspection you need to do. Does that make sense? Yeah. Yeah, they may appear in multiple places. And they often actually do. And it can be real. So to give a recent example, we're looking at pancreatic cancer. We know the pancreatic cancer that carries mutations which are enriched in axon guidance. It's actually a signaling pathway. Okay. Yeah, I can go. All right. But in the module map, actually, there are two different modules that are labeled with axon guidance. This is an older picture from when we had in Suicones this morning. But here's module nine is axon guidance. In later data sets, we actually have a module 10, which is right here, which is also axon guidance. It turns out that there are different sub pathways in axon guidance. Okay, so here's pancreatic cancer mutations. This is from OICR's sequencing work. And the interesting thing is we see some of the same modules. We see Winton could hear and hear again as they were in breast cancer. We see extracellular matrix, which was there in breast cancer, axon guidance, as they said. And then you see things that these are sort of common to different cancers. And then you see things which are unique to pancreatic cancer, such as MAP kinase and hedgehog signaling pathways. You see a big signal from the B-cell receptor and EGFR signaling. KRAS here is huge, and that's a major driver of pancreatic cancer. And so you could actually use... Oh, actually, this one does have it. Here, we have module seven is axon guidance and module nine is axon guidance as well. Good. You could actually use this as a signature of different cancer types. And then here's another way in which it's useful. We're obviously very interested in using the molecular information in pancreatic cancer to find subtypes of the disease, which might correlate with different clinical outcomes. So here we've taken 45 different patient samples and put them on a map in which we've listed each of the genes, which is mutated at least once. And we've marked the genes if they're mutated in that sample, tried to do hierarchical clustering on it, and you get a whole bunch of nothing. We have no clustering, whatever, of the patient samples, except for possibly this cluster of two patients who are KRAS negative. Yes. On the previous slide. Sure. You said that a figure from further analysis of axon guidance were basically just like subtypes. That's correct. Is there anything to be made from the fact that it doesn't look like there's a lot of connection between the two? There actually are a lot of connections here. It's not projecting very well because we reduced the intensity to make the picture look nicer. But I have not explored this enough to be able to answer your question intelligently. Actually, I would just be hand waving. Yeah. I suppose you don't have to answer that. I guess you would have to know enough about the biology of axon guidance to know why the two subnetworks would be differentiable from each other in this particular analysis. Why aren't they just sort of along together? One of them is Roboslit signaling, which is a signaling pathway in its own right, which is part of axon guidance. The other is Ephron signaling, which is also used, also part of angiogenesis. My feeling is that these really are two separate pathways, which for historical reasons both contribute to axon target finding, and where I've been put together as axon guidance in the literature, but are not very strongly connected. That's how I interpret that. What its relevance to cancer is, which I think is what you're asking, that I can't even really speculate on. Yeah. That's right. Then why not name them? Well, that's why we call them Nodule 7 and Nodule 9, which then makes the wet lab people annoyed. What's Nodule 7? I forget what that is. That's axon guidance. I thought you said Nodule 9 was axon guidance. Well, that was the same thing. That was one, too. Well, we try to organize the data to reflect our best understanding of biology, but we cannot, without really annoying people, do wholesale namings of pathways, renaming of pathways just because the data says different. You can see lots of examples here of K-RAS is involved in B-cell receptor pathways or B-FGF and EGFR signaling. It's a key point. The entire module is named after all four of those pathways, and we can't say that it's one or the other. It's a little bit of all of them. This is the disease-specific pathway. Any more questions before I move on? Getting back to this, this is disappointing. There's no clustering of the patients. However, if you cluster the modules, cluster patients on the basis of the modules, you get a very distinct pattern. Here we've scored each module according to the number of mutations that occurred in that module, also weighted by the frequency of that mutation. Again, here we have the patient samples. Here we have the modules. Now we actually get very good clustering. We've got three patients who are K-RAS negative, axon guidance positive. We get about a dozen patients who are K-RAS P53, and I forget which one this is, and axon guidance negative. You can count one, two, three, roughly four different patient populations. If you look at their survival, this set has a very poor survival. The others have better long-term survival. We're now following up on that to see if it's reproducible in independent data sets. From what looked like a very disappointing result, we can actually distinguish apparent subclasses of pancreatic adenocarcinoma. We're not the only people who do this. I want to draw your attention to other pieces of software that do similar work. There's Hotnet from Ben Raphael's group at Brown University, which works on either expression data or single nucleotide variation analysis, and offers you a cytoscape visualization. It basically does something very similar, but uses a different model in which it models diffusion of heat across a lattice. That's why it's called Hotnet. And WGCNA, which is AR package, which is specialized for expression analysis, will also give you disease-specific modules from expression data and an underlying network. I'm going to end with another example of using network analysis to discover biomarkers. One of the problems in cancer is that you have a large population of patients, all of whom get treated with the same chemotherapy. Some of them respond to the chemotherapy. Others don't respond to the chemotherapy, and you don't know why. Ideally, you'd like to have some sort of marker, which would allow you to distinguish between the patients who are likely to respond from the patients who are likely not to respond, so that you can treat the people who are going to respond with the drug and avoid the unnecessary toxicity for a treatment that isn't going to work for those who don't respond. And you could also use this to stratify risk. If you know that you have a population of patients, some of whom have an aggressive disease that's going to kill them in a year and others have an indolent disease that you could follow if you knew who they were and only treat them if it starts to behave badly, you could stratify them into a high-risk group that you treat and a low-risk group that you don't treat, thereby avoiding the complications of treating everybody with an aggressive therapy when they don't need it. So the problem, so this is in the cancer domain, there is a lot of work to discover biomarkers based on single nucleotide mutations or expression changes or copy number changes, which will differentiate one group of patients from another. The problem is that when you're looking at 22,000 genes, it's very easy to over-train your selection. You can take any random set of genes and probably distinguish between the responders and non-responders just on the basis of chance. And there's actually a paper published recently on breast cancer which showed that selecting genes randomly can find biomarkers very easily in that dataset. And that's because there are just so many different genes, you can find them by chance. As soon as you take a biomarker found in one clinical cohort and try to replicate it in an independent cohort, a lot of them fail. The field is littered with these unreplicated biomarkers. Other problems are the diseases are heterogeneous. If you have lots of subtypes of disease and you don't know how they correspond, how they're distributed among the patients, then you need really large cohorts of patients. And another problem is that the tumor itself can be heterogeneous. You can have a single primary tumor which has two subclones, one of which has mutation which makes it sensitive to the drug and the other of which doesn't. And you treat the whole tumor, you get 50% response and then the resistance subclone goes out. So here's an example of using the disease module maps from the functional interaction network to discover a biomarker. And the general principle is you start with a disease module map from the previous step. You then take an expression analysis set or a CNV set or a synchromucleotide variation set from multiple patients and you can perform a principal component analysis on the modules and then select clusters of modules which correlate with clinical parameters. Or in the easiest case, you just test each module against the patients to see if the presence of a module correlates with differential survival or response. So here's an example from breast cancer again. This was work done by Guanming Wu in my laboratory. He built a network. He extracted the active breast cancer network from a New England Journal of Medicine article published in 2002. That was 295 patients, 12,000 genes analyzed. And then he validated with an independent data set from 2006 roughly the same size. And he found a module named Module 2 whose expression level correlates with more aggressive survival. And so in particular in estrogen receptor positive breast cancer which is usually considered a more treatable and a better prognosis subclass. If you are unfortunate enough to have hygiene expression in Module 2 then your survival is dramatically reduced relative to those who have lower expression. This is a Kaplan-Meier curve. How many people are unfamiliar with this? We haven't seen it. What we're seeing is these are years here and at year zero 100% of the patients are alive and then as you go further on the number of patients who are still alive decreases because they've died of their disease. And then died of their disease. And you remove patients who have died of other causes or are lost to follow-up. And so it's got a good p-value of 10 to the minus fifth here. You have to ask does it replicate? Yes in the independent data set. It replicates very nicely. And then we went on and replicated it in multiple other data sets and it seems to be a good marker. Now what is module 2? Module 2 is because it's derived from the functional interaction network we know how we can annotate it. We know what it does. It corresponds to kinetochore maintenance pathways and aurorabicinase signaling both of which are involved in mitosis. And so we're showing that an upper module that's involved in mitosis if it's up-regulated it leads to decreased survival and this makes sense because of the higher mitotic rate of a more aggressive tumor. So that's a nice story there. Now I'm going to stop my examples here and move on to the tutorial. I'm just going to end by saying that there's much more work that needs to be done in this type of network analysis. The main thing that the Reactome functional interaction network doesn't do is it doesn't allow you to look at multiple types of molecular lesion simultaneously. You can look at expression data or you can look at mutations or you can look at copy number changes or you can look at methylation or whatever but you can't integrate them together in an intelligent way. And furthermore, the analysis that I've showed you doesn't take advantage of the directionality of the edges, the directed edges. Although we know that a particular mutation is activating of KRAS, we're not then following the arrows downstream of KRAS to see what the overall effect on the activity of the pathway is. And so that's what a new generation of algorithms exemplified by Paradigm from Josh Stewart's lab at UCSC attempts to do is that given a pathway, a network diagram representing a pathway, Paradigm will take several different types of molecular data and integrate them in a way that follows the arrows so that increase of activity of MDM2 will decrease the activity of TP53 and then it propagates this effect so that at the end of the day of gene expression and mutation changes, it tells you whether the pathway has changed, has gone up, its activity has gone up, its activity has gone down, or it hasn't changed. And it will give you a heat map like this where we're looking at samples versus pathway activities. What this is trying to show is that in all samples, this pathway, whatever it is, it's increased. This pathway is decreased in some patients and increased in others. And so here's a nice example from a paper they published about a year ago on how this works showing that in glioblastoma multiforme, they're able to capture four known subclasses of glioblastoma multiforme very nicely and they've annotated the pathways and they've changed here on the right. They also have a nice display which allows you to see all the raw data on top of the pathway activity data. Here's the inferred activity of AKT1 pathway and you're looking at copy number changes here, expression changes here, mutations, and so on. So the bad news about Paradigm is that you can only get it in source code for me. I have to compile it. When I tried to compile it, I couldn't do it. I had to ask Guanming to do it for me. It's not documented. They don't give you repositories of formatted pathway data. They don't give you any examples of how to convert experimental data into the input files. And basically, right now, it's a package that only its authors can run. The good news is that at Reactome, we're working on a web service, an implementation of Paradigm that will run on top of the Reactome functional interaction network and you'll be able to run Paradigm from within Cytoscape probably about a year from now. Okay, so take-home messages. First pathway network analysis gives context to alter gene sets. It goes beyond what you can do with gene set enrichment analysis. The type of analysis you can do differs greatly in the complexity, power, and usability, from simple things like path theory, diagram colorization to very complex things which I can't give you a ready-made tool to use yet. The type of analysis will work in progress, but I think that all signs point to it being a very powerful tool in the future. And I've given you some URLs for you to look at afterward. So, time for questions.