 All right. Well, I'm Director of Informatics and Biocomputing at Ontario Institute for Cancer Research, and I'm one of the PIs on the Reactome Project, which is a knowledge base of human pathways, and I'll be appropriately enough lecturing on the use of pathway databases to help in cancer genome analysis. And so here are the usual obligate copy, anti-copyright statements. So what we're talking about, what you've heard about this morning are use of overrepresentation analysis and, I'll just put them up, various ways of taking lists of genes and using their functional relationships to infer the mechanism, the meaning of a gene list. And so you've heard about David this morning, I believe, and that is a class of algorithms called overrepresentation analysis, where you look at a set of genes that have been sorted, bags of genes that have been sorted out by their functional annotation, what do they do, and you look for biases in the representation among those bags of genes within a cancer gene set. Another related class of gene set analysis is called functional class scoring, where it is basically the same thing, the genome has been partitioned into a set of genes, each of which corresponds to a different functional annotation set, and then you use a ranking statistic to look for unexpected patterns in the representation of your cancer genes among those bags. The last type and what we're going to talk about here is pathway topology measures. Both of these types of pathway analysis are strict partitions of genes into one bag or another. This gene participates in DNA repair or it participates in cell cycle regulation, it's not allowed to do both, you have to make a choice about what the gene does. And furthermore, once a gene is in the bag, all the relationships among the DNA repair genes is basically lost. We know that one is an inhibitor, another is a cofactor, another one participates in creating a multimer which binds to chromatin to turn on a gene. All that information is lost in these partition-based classification systems. We don't want to throw that information away, it's very useful. So what we'll be talking about now are techniques for analyzing the mechanisms that connect a set of genes that are altered in cancer via the prism of the pathway and the relationship among the genes in that pathway so that you get mechanistic information out. And so the tools that I'll talk about are the Reactome Functional Interaction Network and some paradigm, a very, very new system that is coming online now. So Gene Set Enrichment Analysis has limitations. One of them is that there are many different ways of slicing and dicing the genome, you can do it by disease or you can do it by molecular function or by a biological process. And it's not always obvious which is the right dimension at which to create your gene sets. Second, when you do get a set of enriched gene sets, you frequently get a lot of overlapping things which seem to be related but you're not quite sure how. So for example, you might have a pathway-based set based on keg telling you that these are genes involved in chromatin maintenance, that's one set is enriched in chromatin maintenance. Another set related to pancreatic cancer is enriched. Does this have something to do with each other or these independent sets? You don't know without sorting through them. And then as I said before, the bags of genes obscures the fact that there are actually complicated relationships among the genes in those bags. So pathway databases are the first step in disentangling this and understanding the relationship among affected genes in a cancer genome set. The advantage of pathway databases are usually they're highly curated, highly accurate knowledge bases that are derived from the experimental literature. They use a biochemical view of the biological processes similar to what we learned in freshman biochemistry. They capture cause and effect and mechanism and they can give you human interpretable visualizations usually as a Leninger style pathway diagram. The advantage of pathway databases is that because they're curated, it's very labor-intensive to create these databases. They don't cover the whole genome, they typically cover only a little corner of the genome. And also because it's a human manual process and people can disagree on where the subjective boundaries of pathways fall, different pathway databases will disagree on the boundaries of pathways. So one database may put DNA repair and cell cycle checkpoints into one big pathway because they're highly related, another may split them up. The pathway database that most of you are probably familiar with is keg. How many people have not heard of keg? OK, that's great. So keg is a Kyoto encyclopedia of genes and genomes and is a curated database of intermediary metabolism in human and several hundred other organisms, many of them are prokaryotes. They also have sections that deal with higher processes such as cell cycle regulation, signaling and so on, things which are usually more applicable to cancer. The core of keg are these diagrams which show, I don't know how well this is projecting, it looks fuzzy to me, even up here. But these are all small molecules, these are all sugars of various sorts, and it's showing the enzymatic steps which transform the sugars in sugar metabolism. I think down here it's creating some amino acids, but I actually can't read this very well myself. React on the database that I work on with Robin Hall, who will be your lab instructors this afternoon, is similar in many respects to keg, but focuses almost exclusively on human pathways, and so we provide a lot of coverage of higher order regulatory processes. So here we're looking at the NCAM-1 signal transduction step here, and it's showing some details in which the NCAM-1 receptor is forming a dimer in response to a ligand binding event, and then it shows the downstream signaling events. And because it is a curated database, you see a hand-drawn diagram here, you see a bit of text that's been written by a PhD-level curator, or in fact a principal investigator, guest author, has citations, and down here there's much more information about each of the steps that you're seeing. So Reactome is hand-curated, every reaction, every molecule is traceable to some reference in the primary literature. It's primarily human, but it does, we do project our pathways onto non-human species just for the sake of completeness, and we just use orthology information to make our best guess of what the pathway looks like in other species. And we have a series of, we have a Google map-style reaction diagram that you can overlay information on, I'll be showing you this. You can find pathways containing your gene lists, you can calculate overrepresentation of your gene set and pathways, and you can find, if you have a pathway in human, you can find the related genes and pathways in other species. A big thing that distinguishes it from Keg is that Reactome is open access, and Keg is a licensed model. So with this, the main thing that you can do is take your gene set, upload it into Reactome, or in fact many of the pathway databases will do something similar, and it will then show you a series of diagrams in which the genes that you have uploaded are highlighted in their diagrams, and then you can see if they're clustering in a way that looks suspicious to you, and if so, try to make hypotheses about what the effect of the mutations or changes in expression you're seeing are. Now I said that a major problem with pathway databases is that they're subjective, there are several of them, and they all disagree on what the pathways are and how they overlap. That's being addressed by this resource at Sloan Kettering called the Pathway Commons, in which about nine different databases, including Reactome, and at one point including Keg, but it may no longer be in here. Do you know if Keg is now part of Pathway Commons now, or has it been removed? It's been removed, yeah. Each of these pathway databases has agreed to export its data in a common format called Biopax, and then those pathways are imported into the Pathway Commons, into one big database, it's a union of them all, and now you can do things like search for a pathway or search for molecule, and it will bring back everybody's pathway that has that molecule in it before it's related to that pathway. So it's a nice resource when you feel that you're not getting the full picture to get everybody's view in one convenience spot. So here is the major thing that one does with Pathway databases, it's a very primitive operation but it can be effective, which is pathway colorization. You upload a gene list, database calculates an enrichment score in each of the pathways, using GSEA or overrepresentation analysis, and displays a rank list of those pathways, and then you click on those pathways and it gives you a colorized picture of that pathway diagram with the genes that you uploaded highlighted. And then you can download it as a picture and put it into a publication, if you would like. So this is an example from Reactome, we have selected in our file browser a list of genes which are mutated in glioblastoma multiforme, upload it, it's giving us a list of overrepresented pathways, the one that's at the top is signaling by platelet-derived growth factor, it has a p-value of 3 times 10 to the minus 11, and it lists all the genes in your list which are contained in that pathway, and then as we go down there are increasingly lower p-values for enrichment in signaling by near-growth factor, in hemostasis, insulin receptor and so on. And then if you browse into that pathway you get the diagram, again I'm sorry this is not very, this is looking out of focus, and it shows you the PDGF pathway with the genes that you uploaded highlighted, and it's using a colorization score to show you the statistical significance of that hit. Black things in this particular, in this representation are, correspond to multi-mers, complexes in which one member of the complex contained your gene and the others did not, and if you mouse over this it shows you a little view of all the components in that complex. So that's colorization and Robin this is going to be an example of that, are they doing that this afternoon? No you're just doing the network stuff, okay, relatively straight forward however, I just spent the rest of the time talking about network analysis. So the problem with the pathway databases is that they really only capture the well-understood portion of biology, and there's nothing that you can't get out of the pathway database that you couldn't get out of assiduous literature searching, and in fact the pathway database may just be a way of getting an entry into the literature. In order to get at the novel portion of biology you have to look beyond what's covered in the current literature and go to the high throughput experiments, and here we're talking about relationships among genes which are hypothetical, which are suggested by high throughput experiments but we don't understand the exact relationship among them, and they can be things like genetic interactions, you have a epistasis in yeast, or you have genes which are co-regulated in humans, so among a series of expression arrays whenever one is up, the other one is always up or always down. Total interactions, if you do a mass spec on a immunoprecipitation, you find that the protein products are associated with each other, but you don't know exactly how they're associated with each other, or they share go terms or they're close together in pathways. These are things where we think that the genes are interacting with each other, they're related to each other, but we don't know exactly how. If you look at this level of information, you can reach out and touch genes which are not well annotated in the literature, but which probably have something to do with the biological process you're looking at. So I'm going to talk about a variety of networks, I'm going to first introduce some terminology for you first. So biological networks consist of a series of vertexes, also called nodes, and edges, which connect one node to another. So typically when we're talking biological networks, the nodes are proteins, or they're genes, or they're RNAs, and the edges might be physical interactions between them, or might be regulatory relationships between them, or might be something more abstract than this such as frequency of co-mention in PubMed abstracts. So a cycle is a loop among two or more, or three or more nodes, and there are two types of edges that we'll talk about. There are undirected edges. So these are things which don't have functionality such as frequency of co-mention in the same paper or a physical interaction, whereas directed edges imply a directed relationship between one protein or another, such as this protein is an enzyme that cleaves that protein, or this is a regulator which upregulates that gene. So there are many ways of...network is a very facile data model. You can represent many different things using a network. Simple way of mapping biology to network are protein-protein interactions where you have each node is a protein, and each edge is an interaction between them, but edges can be other relationships such as a kinase activating a target, or a epistatic reaction...epistatic relationship in a genetic study, or similarities such as protein-sequence similarity. And when it's critical to understand what the network is representing before you start working with it for obvious reasons. So here's an example of an early network from about 10 years ago. These are a series of protein complexes that were pulled down from Baker's yeast and analyzed by MassSpec. Each node is a protein that was identified by MassSpec, and each edge indicates that they co...that those two proteins co-precipitated with each other. They were complex with each other in some way. So that's a protein interaction network. This is a very pretty representation of the protein-sequence similarity network among a large number of organisms, and this is representing the gene families, the phylogenetic tree. In many ways they're very similar looking, but they mean quite different things. So here are some more network concepts that I'll be referring to. So each node in a network has between...has at least one edge connecting it, and some have one edge, some will have two edges, some will have three edges, some will have more. You can have hundreds of edges coming out of a node. The count of the number of edges it's going into or coming out of a node is called its degree, higher degree, the more edges it has. The shortest path is a property of two nodes, and it indicates how many hops it takes to get from node A to node B. In this case, the path length is two, that's the shortest way to get there. You could also take a more roundabout route, here'd be one that involves one, two, three hops, but typically we measure the shortest path, and it gives us an idea of the degree of connectedness in the network. And finally, there's a concept of betweenedness. For every node, there are some number of shortest paths that go through that node. And the number of shortest paths indicates how popular that node is. So there are some nodes here which are kind of out on the fringes, they're the social outcasts of the network world, others who have a few relationships, but then there is this one wildly popular node which not only has a lot of edges coming out of it, it has degree four, but it also is kind of in the middle of the network, it's central to the network, and so most of the shortest paths between any arbitrary set of pairs go through this one. Now the key of this is that these central nodes are actually ones which are holding the network together. And if you map this to a biological regulatory network, these tend to be kind of master regulators, things like P53 that everybody talks to, and there is a tendency of these ones to be more likely to be involved in diseases than the genes which are on the outside of the network. The last concept before we actually get into the meat of it is the scale-free properties of the biological network. So you can construct networks in various ways, and this is one of the simplest ways of doing it, is you take a bunch of nodes and you randomly generate a set of edges, so you would pick all pairs and sometimes you make an edge between them and sometimes you don't, you just do it randomly, and you'll end up with something like this. And if you were to graph the relationship between the degree of any node, so the degree is K, and the number of times nodes of that degree occur, you get a bell curve like this where there are most of the nodes have a certain K, it depends on how many times you ran the randomization, and then it tails off in either direction. Another type of network is called a scale-free network, and in this type of network there is a very different distribution of degree. The vast majority of nodes have a low degree, a degree of one, ones that have degree two like this one here or that one there are some factor, some fold less frequent than the ones above it, and as you increase the number of, as you increase the degree, the number of nodes with that, having that degree, drops off exponentially, so nodes of degree two are tenfold less likely than the nodes of degree one, nodes of degree three are tenfold less likely than the nodes of degree two. If you graph that out on a log scale, in this case, you would, because it's exponential, see an exponential, linear exponential drop off of the probability of a node by increasing degree. Then finally you have hierarchical networks where, again, it has a same, it has the same rule of, same distribution of a degree by probability of a node having a particular degree, it drops off exponentially, but there's considerable structure, so there is in fact only one node of high degree, and then there are ten times that many nodes of lower degree, and it has this very well-defined structure. When people started looking at biological networks, there was a lot of disagreement about what kind of, what the network properties of biological networks are. Turns out that they're scale-free. They follow this exponentially decreasing, exponentially decreasing law. To distinguish it from a hierarchical network, you can look at the relationship of the clustering coefficient, which is a measure of the size of neighborhoods, and in a scale-free network, as the degree increases, the size of the neighborhood actually remains constant. The clustering coefficient remains constant. In a hierarchical network, it drops off, and so basically every biological network that's been looked at, including gene networks, but including other things such as branching of bronchioles in the lung, turns out to have scale-free properties. And the implications of this are that a very small number, a small number of genes have a very large, disproportionately large number of connections. They have high centrality, they're choke points, and they tend to be the disease genes. A large number of genes have a small number of connections, those are the leaves, and genes cluster, so that if you were to take this network and analyze the neighborhoods, you'd find a high degree of clustering around these highly connected nodes. And the cluster, for various reasons, the cluster sizes are also scale-free. You find lots of small clusters, and a few large clusters, and the size of those clusters are related to each other by a power law, by an exponential law. So now, I'll stop here and just ask if I have confused you or bored you at this point. Yes, question here. Well, so connectivity is a measure of, so there are two ways of measuring connectivity. You can just measure its degree, which is its immediate number of neighbors. Or you can measure the centrality, which is also known as the between-edness. And that measures not just connection to your immediate neighbors, but your neighbor's connectedness to other genes. So two nodes that have the same degree of four, if one of them is connected to genes or to nodes which have a higher degree than the other one, then it'll end up with being, it'll end up having a higher centrality. Meaning that if one of those, one of the genes in its neighborhood wants to talk to another one, it will find that its shortest path will go through that central gene more frequently than any of the other genes in the set. Okay? Yes? Yeah. So the way that between-edness is calculated is choosing every pair of, every pair of nodes in the network, computing the shortest path between them, and then recording the, for each gene on that shortest path, you record the fact that it was on one shortest path. And then you continue to do that, and then you tally, at the end of that, you tally up the number of times each gene was on the shortest path between two other genes. So some genes will never be on a shortest path. Never often the periphery nobody cares about them. Other genes are sitting in the middle, they're highly connected to their neighbors, and their neighbors are highly connected to others. Okay? Okay. So now we get to network, network databases. So you can build network databases, network biological databases automatically or via curation. And there are a very, this is a very popular thing to do. There's a Canadian-led initiative called BioGrid that's collected 529,000 genes from literature, spanning 167,000 interactions. Obviously, these are not all human genes. These are genes from human and many other species, intact database. You look at the numbers, they're kind of interesting, 60,000 genes and 203,000 interactions, which meaning that they are doing, they're looking deeply at each gene and collecting more interactions from the literature. BioGrid is looking more broadly because they have more genes and fewer interactions. Mint database, which is an old one, has 31,000 genes and 83,000 interactions. Each of this is, each of these network databases representing a different, you know, a different slice of biology. And fortunately, there are efforts such as the gene mania effort. Did you hear about the gene mania this morning? Yes? No? Sort of? Maybe I'm passing. It's a, yeah, okay. Gene mania is a local effort from Gary Bader and Quaid Morris' labs here to bring in all the interaction networks from around the world and put them into one convenient spot. Yeah. So in addition to the curated sources, there are uncurated sources for interactions. One very popular approach is to do text dumps out of scientific literature databases such as PubMed and then you measure, you calculate the frequency with which two genes are co-mentioned and if they're co-mentioned, they probably interact in such way, in some way. Obviously it's much faster than hand curation and also obviously it's not, it's not perfect. There are all sorts, this is a part of the problem of language recognition. If there's the mention of hedgehog in a paper, are they talking about a gene or the species? You have to use contextual clues to figure out what they're talking about and it's an actual language processing which is difficult. However, there are some very popular resources that have been built on top of them. One is iHOP, which is a great resource to play with. You enter a gene name and it tells you every other gene that's been co-mentioned in the literature along with that gene and then you can hop from one to another and then there's a similar resource called PubGene which I have not used all that much. And then there are completely, then there are some more uncurated interaction sources. There are experimental techniques such as the yeast to hybrid protein interaction studies where pairs of genes are put into yeast and if the two genes interact, they rescue the cells so that it can metabolize an essential nutrient and it survives. And you can work your way through the entire matrix of all proteins in over a matter of months to years using this technique. I've already talked about mass spec analysis of protein complex pull downs. Then there are genetic screens, synthetic lethal and enhancer suppressor screens in which you knock down one gene and then you look for epistasis by knocking down other genes to see if the knocking down the two genes contributes to a lethal phenotype. None of these are perfect techniques. In particular, Y2H interactions take proteins out of their natural contacts and put them into the yeast cell. And even if they're physically interacting for real, that's not the same as a biological interaction. Protein complex pull downs are plagued by sticky proteins. For example, actin turns out to interact with everything just because it's sticky. And genetic screens are sensitive to the genetic background, ironically because of network effects. A synthetic lethal in one strain of yeast may not be a synthetic lethal in another strain because there are other things that you don't know about which are also interacting and are not being looked at in that study, they're not being controlled for in that study. So the way to work your way through this very noisy data is to do integrative approaches. A single source of evidence such as Y2H interaction is not sufficient to call a true interaction. But if you have other sources of information such as a mass spec pull down or co-mention in the literature or co-expression in a microarray study, then combining those three sources of positive sources of evidence points probably pretty strongly to the fact that these genes are in fact interacting. And so one simple way, there's some simple examples of doing this. So for example, yeast to hybrid interactions are known to have a high false positive rate as much as 40% false positives in the first screens that were published. But if you do a simple filtering of this to find to only keep those pairs which are expressed at the same time and in the same sub-cellular location, then you filter out many, many if not most of the false positives. More complex example is to take multiple sources of curated and uncurated evidence and to use machine learning to call the true positives. So here's an example from Reactome and you will be using this in your laboratory. So version 35 of the Reactome database, which was actually about a year ago, contained 5,000 reactions and 4,200 roughly reactions, well curated high quality, but it only covers about 25% of your genome. And if you were to look at a cancer data set with this, 75% of your mutated or overexpressed or hypermethylated genes would not even be in the database, which is less than perfect. So we wanted to expand Reactome's coverage to a larger number. So what we did is we started with curated pathways from Reactome. And then we extract, we used an algorithm to take the pathways and turn them into a series of bimolecular interactions, while preserving the regulatory relationships such as phosphorylation reactions or inhibition or activation reactions. Combine that with similar work, with similar data extracted from NCI database, Panther database and Keg, cell map and TRED, and that gave us a network of curated bimolecular interactions from pathway databases. We then took a large number of networks from uncurated high throughput experiments, including protein-protein interactions from yeast to hybrid studies and pulldowns, interactions in other organisms, flyworm and yeast, text mining data, gene co-expression, domain interactions from PFAM. All of this being bits of evidence that two genes interact, but none of them being definitive. And then used a machine learning technique called a naive-based classifier to create a set of predicted functional interactions. So what we did is we took all these pieces of evidence, we then trained that with curated information extracted from the databases, pathway databases that we believed to be true, and this derived a classifier system in which each piece of evidence gives a weight to the probability of that being a true interaction, and then when you apply it to unknown relationships, it gives you a prediction of true functional interactions. So this gave us a network of almost 11,000 proteins, about 9,500 genes. A gene can, of course, make multiple proteins through RNA isoforms, and 210,000 functional interactions. And so that increases the coverage from 25% to 50%, so not all the way there, but it's gotten a lot further, and by a series of measures that we did using by curation and by comparing to other people's predictions, we predict a false positive rate of less than 1%. High false negative rate, however, we're still missing a lot of interactions. This is what it looks like. This is 5% of the network, so it's sort of zoomed in here, and you can see the scale-free properties of the network. There are a few genes which are highly connected, and they form these very well-defined clusters. And if you go in and start annotating them, you start to see things coming out like ribosome here and DNA repair there. Yes? And we're going to run out of time? No, absolutely not. Well, so it's a false negative, the false negative rate is derived from being able to re-predict curated interactions. So we withhold the curated information from it, ask it, do these two interact, and 80% of the time it gets it, something that does interact that we know from the literature is not predicted by the classifier. That ends up being an 80% false negative rate. Now, if we do the same test with known interactions and ask, how often does it predict something that we know to be false? Because we have negative evidence, we know that they don't interact, it's less than 1%. Okay, so we tuned it that way because we wanted to make it, we wanted to enhance the coverage without contributing a lot of noise. It's an arbitrary thing. Okay, yes? So these are pooled interactions, right? Like it's not necessarily context specific. That's correct. Now we're going to get to the tissue-specific context-specific part. So this network, it's a very good point, actually. This network is all of human cell biology without any reference to the fact that some genes are only expressed in some cells and that they may be developmentally regulated and they may only be expressed in the early embryo and not in the adult. So this is a scaffold from which you have to now infer the network that's happening in your particular tissue type and context. And that's called active network extraction. And so what we now routinely do with cancer data sets and other disease sets is after we have created the functional interaction network, which has now been published and we've built tools on top of it, you take the genes which are altered in your data set. So somatic mutations in a cancer data set are genes which are hypermethylated or under copy number alterations or which have changed expression patterns extract from this overall network just the ones which you're interested in, the ones which are altered. And then, so that gives a subnetwork usually of a few hundred genes, a few hundred to a few thousand genes, which are altered in the disease of interest. And then we use community clustering algorithms to identify the genes in that set which are interacting with each other more frequently than we would expect by chance and annotate them. And that gives us usually a series of disease modules that are measured about a dozen to two dozen. And this actually is very simple technique, it runs very quickly, actually works quite well for taking a very complicated data set with a lot of annotations which you don't understand and turning it into a kind of a very simple overview of what's going on. So here, for example, is a list of 900 genes that have somatic mutations in breast cancer taken from the recent TCGA publication. We extracted those genes from the network, we clustered them, and then automatically annotated those genes. And what comes up is signaling by tyrosine kinase receptors, focal adhesion, extra cell your matrix, notch and wind signaling, cell adhesion molecules, axon guidance which is interesting, more and more cancer genomes are coming out with mutations in these pathways, DNA repair. So you're looking at that, it kind of makes sense and then you can zoom in on this and see the actual relationships among the genes that are involved and use pathway colorization in order to go back to the pathway database and make build hypotheses about what those mutations mean. Here's the same thing for pancreatic cancer, again you get some models which are the same, axon guidance, lintin coherence, extra cell amactrix, focal adhesion, and then there are ones which are different. So for example, MHC class 2 comes up and ERD-B, EGFR, KRAS come up here that are not characteristic of breast cancer. What we find when we look at different cancers is that there are some modules which are the same and other modules which are cancer type specific. Now it is a little bit better than just making pretty pictures. You can start to discover substructure in the patient population. So in the same pancreatic cancer project that we're working on, we're very interested in seeing whether there are subtypes of pancreatic cancer which are distinguished by different mutations. But if you just try to do this on the basis of genes, you don't get good clustering of samples. So here we're looking just at single nucleotide variations. If there is no mutation, it's blue, if there is a mutation, it's red. These are the patient samples going across, these are the genes going in the columns and we've attempted to do a hierarchical clustering and you basically don't see, there's basically no clustering of the patient samples at all, in fact it looks like a mess, right? So kind of disappointing. However if we go to this module map and instead of clustering the individual genes, we just score each module on the basis of the number of times a patient had a mutation in one or more of the genes of that module and do the same thing. You get this. Now here we're looking at the patient samples again in rows and then the modules, modules one through 12 are the columns and now we actually see a very strong population substructure. We have one group of patients here, another group here, another group here, another group here and then a very interesting group up here which is characterized by being K-RAS cluster negative and so then the obvious next thing to do is to connect this to histological stage and clinical characteristics to see if there's some, this makes a difference in the patient outcome which we're doing but the, oh okay, I'm almost done, in fact, this is very good, all right. And okay, so I'll give you an example in a little bit of doing that. Before I move on, in addition to the reaction on the FI network, there are other algorithms that do similar things. So Hotnet, which was written by Ben Raphael in his lab at Brown, can be used for expression or CSNB analysis. You run it using a local installation of Python and MATLAB and you visualize results inside escape and you'll see in your lab that for the React MFI network everything's done on the server side. Once you have side escape installed, you can just run it with inside escape and then there's an R package called WGCNA used for expression analysis and again it uses network information to find clusters that relate genes which are co-expressed. And to install it you install a package within the R statistical language. So I'm going to close with the discussion of using networks to discover predictive and prognostic biomarkers. So the idea of a biomarker is that you have a population of patients with a disease and you can use a test, either a molecular test or histology or proteomics, to distinguish different classes of patients within that group that you would not see just looking at the clinical characteristics of the patient. And so for example, if you have a disease in which some people, the orange people, have an aggressive disease and progress quickly and others who have an indolent disease who you could just afford perhaps to watch and not give them aggressive chemotherapy with the accompanying morbidity, you could test them to classify them into two groups and the high-risk groups you immediately treat with surgery and radiation therapy, chemotherapy and the low-risk maybe you give them a more conservative approach where they just get surgery and then you follow them. So the challenges in discovering this type of biomarker are three-fold. When it's over-training, there are 22,000 genes and in cancer you typically can see alterations in hundreds or thousands of genes and the typical patient cohorts you have to work on are in the hundreds. So it's very easy to find a set of genes that nicely predicts survival in that one dataset of 100 patients, that one cohort of 100 patients, but as soon as you apply it to an independent patient cohort, it turns out that that biomarker doesn't work at all and the field is littered unfortunately with papers on biomarkers that didn't replicate in independent studies. Another problem is that disease heterogeneity, we like to think that there's only one kind of ductal carcinoma, ductal adenocarcinoma of the breast, but there are at least four and probably more and if there are many subtypes of disease and you need even larger cohorts than you think you need, and lastly there's a problem of tumor heterogeneity, you can have a single primary tumor that has subclones in it and maybe one clone is the high-risk subclone and the other is the low-risk subclone and we tend to analyze the tumors as one big, we mash up the big piece of tissue and we don't take into account the fact that it's got a subclonal structure. I'm only going to talk about the over-training problem, the networks can help with that problem, it can't help with the other things. So the way that the network analysis helps with over-training is that instead of training a classifier on 22,000 genes, you're training on 10 modules and that reduces the number of ways that you can find an association by chance. The work I'm going to talk about was done by Guanming Wu, a research associate in my lab and published in genome biology last year. So this is a work that he did on breast cancer using a microarray expression set from 2002 published in the New England Journal. So this was 295 patients, 12,000 genes were profiled by microarray and they measured the expression profiling, they measured how long the patient survived from initial diagnosis until death and then evaluated with a completely independent data set from 2006, roughly the same number of samples, roughly the same number of genes and here in this case they measured either recurrence or death. So I'll show you the module map in a second but the test was very simple, Guanming simply made the module map and then tested and got about 15 breast cancer modules and then he tested each module for association with survival and he found one module called module number two which is actually a very good predictor of survival in the estrogen receptor positive subclass of patients. So we are positive patients usually are considered to have a relatively good prognosis, they have less aggressive disease, they're responsive to tamoxifen and other estrogen receptor antagonists but it turns out there's a subpopulation of patients who have high expression of module two which do much more poorly in, survive for a much shorter length of time than those who have low expression of genes in module two and you have all seen Kaplan-Meier graphs at this point right? Good. Great. Okay and the p-value is good three times ten to the negative fifth. In an independent data set, the independent data set it replicates and then we went on and we replicated in a bunch of other sets as well. So it's a good module, it's a good biomarker, it's actually potentially clinically useful because you would like to know usually estrogen receptor positive patients receive less aggressive patient, less aggressive therapy, if you know that they're going to have more aggressive disease then you might want to know this and give them more aggressive treatment. An advantage of the network analysis is that we can look at the relationships among these genes and try to explain why that biomarker works and in fact when we look at what module two is, it is a module that involves aurora b kinase signaling and kinetochore maintenance. Both of them pointing towards a role in mitosis. So this is a marker of proliferation which stands to reason. So if the cells are proliferating more quickly, it's more likely to be an aggressive disease. Last, so this is where the React and Functional Interaction Network stands. This is the lab example that you'll do. The one problem with this analysis type analysis is it only allows you to look at one type of alteration at a time. So these examples we've looked at either single nucleotide variations or we've looked at single nucleotide mutations in the pancreatic data set. In the breast cancer set we've looked at expression levels. You can look at CNVs, you can look at methylation, but you can't look at more than one alteration type simultaneously. Another deficiency is that we have all these little directed edges in here. We know that KIF20A is an activator of CDCA8 and INSENP and aurora kinase B, but we're not actually taking, we're not actually using that information to take the expression data and predict overall effect on pathway activity. So the integrative techniques, which are really just coming online now, they're either an area of active research, promise to allow you to take various types of data, expression data, exome or genome sequencing data, copy numbers, microRNA and short microRNA profiling, short hairpin RNA knockdown screens, et cetera, et cetera, and using the functional relationships within the network and the pathway, integrate them together to get out pathway activities, which you can then mine for relationship to clinical characteristics. So the technique that, in my opinion, is showing most promises from Josh Stewart's lab at University of California, Santa Cruz, and it's a technique called paradigm. You feed paradigm two things. You feed it a directed network diagram, in this case we're showing a little example from P53, and you feed it multiple variation data sets from several samples, and it builds a model of what the effect of those variations on each individual pathway is, and it gives you a map like this where we have samples going up. I took this out of their paper, so it's using a different, it's rotated from what I showed you before, and then each row here is a pathway. So it's showing that in some samples, these pathways are increased, other samples, this pathway is decreased, and it comes also with, so here's another article out of their paper showing how you can use this to cluster, this is glioblastoma multiforme into four different groups based on the integrated effects on pathways, and they've also developed a very nice circular visualization in which each spoke is a different patient, and each ring is a different type of molecular alteration so that you can quickly, and then the pathway relationships among them are shown with directed arrows, and so it's a great way of getting a sense of what's going on in the entire patient set. The bad news about paradigm is that you can only get it in source code form. It requires a bunch of third-party math, graphic libraries, they're open source, but it's very difficult to compile. I couldn't do it myself, and I tried it at, Guanming was able to do it. The scant documentation on how to use it, they don't actually give you, you have to format the pathway data in a particular way, they don't actually give you any pathway data. They don't have any examples of how to use it, and so it's basically, it's used by that group and nobody else's used paradigm. The good news is that because it is open source, the Reactome team is working on a web service implementation of this thing that we hope to roll out within a, you know, later this year or maybe early next year with the side escape plug-in to go along with it so that you can run the paradigm analysis, and it will use Reactome as its pathway database. So to finally reach the take home messages here, I hope I've shown you that pathway network analysis can take complicated gene sets and give you useful, and reduce the complexity and give you useful information on which you can make hypotheses and correlations to disease state. The analysis differs greatly in complexity, power, and usability. The most simple types of analyses are these diagram colorization systems. More moderate is the active network extraction analysis example that I showed you from Reactome, and the most complex are integrative programs such as paradigm. And this work is very much a work in progress, I wish there was more off the shelf that you could start using right now, but over the next few years I think this is going to really start to dominate the field of cancer gene analysis. And then I've left you some URLs to take away. Thanks very much.