 So, I am actually interim scientific director for OICR and I am director of informatics and biocomputing here at OICR, the Informatics and Biocomputing Department. And I'm also principal investigator for the Reaktome Knowledge Base of Biological Pathways and it's in that role that I will be talking today. And we're going to be talking in more depth on how one can use information on biological pathways and networks to improve our ability to understand and interpret empirical genomic data. So, the main reason that people are interested in pathway and network analysis is the statistics. The genomics is filled with rare events, cancer mutations, rare polymorphisms in the germline genome associated with diseases. And typically, we are always running up against relatively small numbers of patients or samples or variants versus very, very large numbers of genes. So, when one does, for example, a microarray against a experimental sample versus a control sample, you may see thousands of genes that have significant alterations in their expression and that becomes a statistical hypothesis testing problem. You have to apply multiple correction and typically you may not have sufficient power to distinguish the pattern you're seeing from random variation. So, pathway and network analysis, both this family of techniques allows you to reduce thousands of genes to a smallish number of networks and pathways, typically in the tents. And that allows you to reduce multiple hypotheses to find meaning in long tales of, here it says cancer mutations because this is a cancer institute, but it can be any rare event, a rare copy number variation in the germline, for example, and allows you to tell biological stories to identify, once you have a list of altered pathways in your hand, now you have an entree into testable hypotheses about the mechanics of the association between the disease and the variant or whatever you're looking at, but also allows you to do things like predicting the function of unannotated genes and to give you a framework in which you can do quantitative modeling of biological processes. So, to be more specific, pathway and network analysis is a very broad term. It's any analytic technique that allows you to make use of biological pathways or molecular network information to gain insights into an experimental system. It's very rapidly evolving with hundreds of papers being published every year, many different approaches and a broad consensus on a few techniques and then a wide variety of other techniques which are more experimental. So in the examples I'm going to give you coming up and in the laboratory that Robin Hall is going to give after this lecture, we are going to use a data set from 2013 which is the TCGA's pan-cancer analysis of 12 major cancer types. These were between 2,000 and 3,000 patients who had exome sequencing done and in that set of cancer genomes, 127 statistically significant cancer driver genes came up, it's a long list and we're going to look at various ways of making this 127 set of 127 genes make sense. So having introduced that, we're going to step back and now I'm going to talk about what the difference is between pathways and networks. I would like to have a laser pointer. And do I have one? Is it just in some, there it is, okay, good, you're excellent, okay. So what's the difference between a pathway and a network? Well a pathway is the biochemical description of a biological process similar to what you learned in high school and college in biochemistry where there is a series of reactions with reactants going in and products coming out. The output of one reaction becomes the input to the next. And so here is a depiction of the EGF pathway, if I can get my laser pointer to work, always screwing around with things when I shoot the area. All right, and we're showing the EGF, so in this depiction each of the squares is a reaction and there are inputs to the reaction and outputs. So at the very top of this we have the EGF ligand and the EGF receptor, there's a reaction in which they bind together to produce an EGF, EGFR dimer. This is inhibited by LRG1. There's an intermediate product in which EGF is bound to EGFR and then another reaction in which a dimer is produced and this consumes a phosphate from ATP and then there are further downstream reactions leading to a phosphorylated EGFR, again it's inhibited by SRC1. So we can continue, this is part of a much larger pathway. This captures all the details of stoichiometry and reaction and is the way we're used to looking at these things. However there's another simpler way of looking at it which discards all the intermediate information, the detailed information that we don't really need to know about, like hydrolysis of ATP and instead focuses on the core logic of the main reactants. EGF is stimulating EGFR, shown as an arrow here and that is stimulating the production of the phosphorylated product, it's inhibited by LRG1 and you can add into this other proteins that are known to be involved but we don't know exactly the role that they're playing, they're interacting in somehow and this gives a simpler model which in many ways is easier to compute over but it also gives you less detail about what's actually happening in the cell. So we're now going to talk now about where information on pathways and reactions come from. So there are pathway databases are formerly called reaction network databases, two main examples currently in the public database world are Reactome and Keg, they both use this biochemical nomenclature to describe biologic processes in great detail and they represent this basic model is focused on the reaction, there are inputs that go in, outputs that go out and then there are regulatory steps that can inhibit or activate that reaction, speed it up or slow it down and this is a very generic model, it can be used to describe many things, it can be used to describe intermediary metabolism where for example you have glucose going in and glucose 6-phosphate coming out, first step of glycolysis or it can represent a cleavage of the pre-pro insulin to form insulin, it can describe the transport of a protein or a small molecule from the inside of the cell to the outside of the cell, the form that's inside the cell is the input, the form that's outside is the output, it can also be used to some extent to describe interactions between cells. So it can be used, you can describe proteins, molecules, complexes, non-coating RNAs, all sorts of things can be fit into this model. K, which is the oldest of the pathway databases still existing, is a very large and well curated collection of biological information compiled from published material because it comes from published material, it means that there's a staff of curators who are taking all the knowledge from the papers and turning them into a data model from which they derive diagrams, a curated database. K has information on genes, on proteins, metabolic pathways and molecular interactions and it's, it it triphocuses on multiple organisms, it has vertebrates, it has reactions in invertebrates, it has a very very large section on prokaryotic pathways. And it organizes all these pathways into a series of diagrams. Here's a typical keg pathway diagram. How many people have not seen keg? Oh, a couple of people. Okay. I recommend it. It's a great thing. So here is their depiction of the cell cycle. You probably can't read any of it here. Each of these green boxes is a gene. When you have two boxes together, it's a complex between two, sorry, not genes, but two proteins. And the arrows indicate the reactions here. They're using little circles to indicate the reaction centers. Reactome is newer, and it's distinguished from keg by its very sharp focus on human biology. So it typically has a lot of detail on human and less information on other model systems. In contrast to keg, for which you have to pay a licensing fee to download it, Reactome is completely open source and open access. Focuses, as I said, on human pathways. It has metabolism, a lot of intermediary metabolism, a lot of signaling pathways, and many other biological processes, including many developmental processes. Every pathway is traced by curators to the primary literature, and it's cross-references to other databases, including keg, and provides data analysis and visualization tools, some of which we'll be looking at. Here is, you've all had a module on gene-set enrichment yesterday, right? So gene-set enrichment is one of the core pathway network analysis techniques that everybody agrees works and is a good thing to do. And so Reactome, as well as keg, offer an interactive pathway enrichment tool. This is looking at what happens when you upload the 127 TCGA pan-cancer driver genes into the Reactome enrichment mapper, and it shows you which pathways are enriched. So there are three panels here that I'll point out. You'll be seeing this in more detail during the exercise. The left side here is an event, is what's called the event hierarchy. It's kind of a top-down table of contents of all pathways. It opens up to show the enriched reactions, and we have a bunch of signal-transduction pathways, signaling by EGFR, by FGF, by insulin receptor, by NGF, all of which were statistically significantly enriched in that bit set. ERD-B2 is particularly enriched here, and here I've opened it up to show some of the sub-pathways and reactions which are enriched. Here is the pathway map, which is a little scrolling and zooming panel that you can expand and pan through. And it's showing a graphical representation of what's here with the green indicating significantly enriched complexes which contain the genes which are mutated. So you can zoom out here to get an overall bird's-eye look at what's enriched, or you can zoom in to see specifically where those mutations are and how they interact with other mutated genes. And then down here is the typically significance value, the p-value sorted list of enriched pathways, and everything is linked together as you click on these enriched pathways. They'll be highlighted here, and they'll appear. They'll be highlighted in the hierarchy browser as well. All right? Questions? Okay, we'll go on. So, what's the difference? Yeah, sure. When you show, when you say, those are the enriched ones, the rest is the whole pathway. Yeah, so actually in this case the entire pathway is enriched. Okay, so it's a statistically significantly enriched pathway. But the question is where are the mutated genes which contributed to the enrichment? Okay, so most of these genes, or most of the protein products of these genes aren't floating around in the cell by themselves. They're involved in complexes. So what the green here is showing is the proportion of that complex that contains a gene that is mutated. And you can click on this and zoom in to see exactly which genes were mutated from your original list. Okay, so that gives you a sense of how the mutations contribute to the enrichment of the pathway. So one of the vulnerabilities, or one of the problems of pathway databases is that the, is that different curators will choose to represent pathways differently. They'll emphasize different things. And so if you start digging into, into a reactome and keg, and you look at something that they, a pathway that they have both curated extensively such as the intrinsic apoptosis pathway, here is a little bit of how a reactome represents the activation of caspase 8. And here is how keg represents it. And in this case, keg has broken out all the members of the caspase family, whereas reactome has grouped, has grouped them, grouped the ones that are equivalent. And the, you end up with pathways that look quite different from each other. And you could go slightly crazy trying to represent results that you get in reactome from results you get in keg. I think they're both legitimate ways of doing it, but they reflect value judgments, and it is, it is confusing. Fortunately, and there are, there are other databases such as the NCIPD or the Panther database that have also done the similar, the caspase pathway, and they have different diagrams as well. Fortunately, there's this very nice effort right out of Memorial Sloan Kettering called the pathway commons, in which about a dozen pathway databases have agreed to export their data using a common representation and submit them to the pathway commons where they are, they are merged and integrated, and you can get a single, you can get consensus views of those pathways using a uniform representation through pathway commons. You can also download the pathways and use them as a basis for analysis. Okay, so I'm going to move now from pathways to networks, yes. No, it's been, it's, it's currently being updated. It was updated back in January. It was updated back in February, yeah. Okay. I actually hadn't realized that they'd, that the old site was still accessible. They should just redirect. Oh, really? Okay. Yeah. So my, you know, I think it depends on whether you want a, you know, single style of doing it. So a single standard, if you want a single standard of curation, with the advantages that you'll get, you know, uniform decisions across each pathway of how to represent them, then Reactome is, I would strongly recommend either Reactome or Keg. If however you're concerned about coverage and you want to maximize your chances of getting a, getting a hit in a pathway which may not have been curated by Reactome but was curated by Keg or vice versa, then the tools that Pathway Commons offers for GeneSat and Richmond and for visualization are, you know, is the better way to go. Or if you're hedging your bets, you do both, you do both. Which is what most people would do. Yeah. Yeah. So Reactome contains data from Keg? It does not, it does not directly contain data from Keg. It contains data that was independently curated from the literature. Pathway Commons contains information from, from Keg, from Reactome, and from 10 other, 10 other pathway databases. Okay. So now we're going to switch from talking about pathway databases to talking about network databases. And I'm going to give you just a quick intro on interaction networks, which should be familiar to you from yesterday. I'll just go through this quite quickly. So a interaction network is a very simple, straightforward data model in which there are just two components. There are nodes, which are genes or proteins or lipids or RNAs. And there are edges which connect them. And an edge means that there's a relationship between them and what that relationship is depends very much on what that network database was set up to describe. And the edges can be, can be directed. They can have an arrowhead or a, or a line indicating a directionality of the interaction, usually activation or inhibition. They can be, they can be weighted. They can be heavier edges to, to indicate a greater, a stronger relationship. Or they can be just a very simple, undirected, unweighted edges. I've already said this. And edges really can be any, it can be any sort of relationship. I'll show you a few examples of different types of interaction networks. Here is, this is an example of a human transcriptional regulatory network from the ENCODE project, where each of the nodes is a gene and the edges indicate a positive or negative regulatory relationship between them. So these are transcription factors, sorry, the inner circle here is transcription factors and the outer circle are the genes that they regulate. Or you can have a network indicating interactions between the proteins of an infective virus and the host. That's what we're showing here. You can have a metabolic network showing the, showing relationships between small molecules identified by HPLC. You can have a protein-protein interaction database generated by proteomics or by yeast two hybrids. Or you can have something, something such as a disease network and the disease network, the nodes are actually diseases and they're connected to each other by a metric that indicates how many genes, altered genes they have in common. All right, which can actually be quite revealing. Okay, so when we're talking about network database, you have to talk about specifically what network database you, what the network database was designed to show and to decide which network database or which set of network databases are appropriate for the questions you're asking. So network bases, databases can be built automatically from high throughput, omics data, or in the same way the pathway databases are built using curation from the primary literature. By and large, network databases have more extensive coverage of biological systems. So Reactome, which is one of the largest human-oriented databases, it is the largest that's open access, covers about, what is it, about 9,000, 8,000-something genes at the current time? Robin? Yeah, okay. I should know the exact number. But that's still less than half of the human genome, whereas a typical network database will be in the 15,000 to 20,000, 15,000 to 20,000 genes that will have greater coverage. In most network databases, the evidence is more tentative than in a pathway database. Typically, it captures information about things which are related because they're, say, co-expressed in microarrays. But you don't exactly know what the cause of that relationship is, what the mechanism is. And there are quite a few good curated network databases. There's BioGrid, there's Intapt, and there's Mint. Each one has roughly, has several hundred thousand interactions. Typically, in the order of 15,000 to 30,000, genes or proteins, and they have different standards for their curation, but their curation process is well-described. A couple of examples here from Intact. Because of the simple data model, you can ask simple straightforward questions such as what interacts with P53 or with P53. And in this case, it found 9,058 proteins that are described in the literature as interacting with P53. And then you have to use your judgment on how many of those relationships are, do you trust? Yes, you actually can. You can filter it by the source, whether it was from the literature or whether it was a high throughput experiment that was taken out of a table. You can filter, in some cases, by the weight of the evidence. P53 is an outlier because it's been so heavily studied, and it seems to interact with almost anything. I'm going to skip this one because it's not actually contributing to the flow of the talk much. And then talk about, so that's what a network database is. Any questions before we go into analysis? Okay, so pathway databases curated very rich descriptions of mechanics, but typically lower coverage of the genome. Network databases shallower, simpler data model, easier to compute over, as you'll see, high coverage, but very shallow. You have a question, sir? Reactome? It's a pathway database. So the ones I've talked about, reactome and Kegar pathway databases, and intact, mint, and biogrid are network databases. Now, there is, it's a little less clean cut than I've described because, in fact, many of the pathway databases also have a section on network interactions. And reactome, in fact, takes interactions from the intact database and brings them in and puts them up as an extra layer of information to try to give you the best of both worlds. Yeah, I'm trying to figure out what the pathway network, you don't know the binding to its inhibitor. Yeah. In intact, you'll just get a big list of 9,000 interactions of various sorts, and you wouldn't necessarily know what the order in which, order of events is or which way it's leading or what they're doing. So in reactome, you put P53 and P53? You'll get a, yeah, will reactome, you put in P53? Well, you'll probably, you'll get a handful of pathways in which P53 participates, and you'll get a picture, you'll get a browsable diagram of P53 in context with its pathway. And you also get a text description that's a curator or an outside author contributed to it, a little mini-review, describing the role of P53 in apoptosis or cell cycle checkpoints or whatever you're looking at. So you know that this pathway, checkpoints rejected to P53 and you know how it's related? Yeah, that's correct. And network analysis, you just get the list of genes. That's correct. Yeah, yeah. And you can actually, you can actually do things with reactome like generating a PDF, which then gives you a little, it will give you a, actually a somewhat large document that you can read through like a review. So not pleasant reading, I have to say, because it's generated by machine, but it's got all the information there. Like yesterday when you were reading? Yeah, so pathway databases are only going to give you information about pathways which are well-described that are in the literature. They will be several years behind the cutting edge because typically curators like to look at reviews and mini-reviews to get the context and they like to see experimental results replicated before it goes into the database. So a pathway database won't tell you anything about new bi... Well, won't tell you much about new biology that hasn't been studied, whereas an interaction database could bring in, you know, proteins that are poorly studied, are just annotated proteins. They have really no information. And one or more of those might be something really important. Yeah, you'll have to ask Yuri if and when he comes back. I don't know. Oh, okay, yes. I know what you're talking about now. Yuri talked about using G-profiler to look at old pathway databases like the ones in David. What he was actually doing is not looking at the databases so much as the analysis tools. Because what analysis tools typically do is in 2010, for example, David went, the people who wrote David, went and they took the pathway information out of keg and reactome as they were at that time, put them into the special searchable format they used to do the analysis, and then opened it up to the world. So everybody uses the David tool, uses the 2010 version of reactome. And since then, reactome has added 2,000 new genes, and so it's out of date. And so in Yuri's, the paper that Yuri talked about yesterday, he was using as an, he wasn't limiting it to pathway databases. It was both pathway and network databases that he examined. But he didn't do a comprehensive look at all of them because there are, if you conclude all the major and minor pathway network databases, there are easily several hundred of them. And they're always being updated. But the point of that paper was to show that you should be, you should look at the currency date of the tool that you're using and make sure it's been updated recently. Okay, now I'm going to move on. I'm going to talk about what you do. Once you have the information, the pathway network information, what do you do with it? This is from a nature, methods perspectives that I worked on with a group of people about a year ago. And the reference to it is at the end of the handout. But what we, after surveying the literature, what we did is we divided the types of analysis that one performs into three different categories. One is the Gene Set Enrichment, which you've heard about, and typical tools of GSEA and G Profiler that you've heard about, GoMiner that you've probably heard about, and the various embedded enrichment tools. For example, the Liactome Colorizer that I showed you, Keg has a similar colorizer. And each what these do is they take, they take something that is very connected. The interaction network for the human cell is almost every single gene is connected to another gene. And you kind of arbitrarily slice it into a series of gene sets. If you're using a pathway database, it's usually pretty easy because you take whatever the division is that the curators say. I have signaling pathways, I have NGF signaling, I have EGFR signaling. Even though, in fact, they interact, those pathways all interact a lot, they get broken into discrete sets. And then once you have your bags, you do a Gene Set Enrichment and you look for statistically overrepresented representation of your experimental gene set in those bags. And it gives you a series of enriched networks or depleted networks from the gene set you're looking at. And this is what, this is kind of the standard practice. Everybody does this. The second and more sophisticated way of doing this is we're calling a de novo sub network discovery and clustering, construction and clustering. And so here, you take the pathway or the network. It's typically done on networks. And you do not arbitrarily cut it into bags, but instead you project the list of genes that you have, the list of microarray, the list of genes that are up-regulated, or the long tail of rare cancer mutations, you project them onto the network and you see where they are and you look for topologically unlikely groupings in the network. Is everyone following me? So you're looking for clusters of ultr genes which are closer together in the network than you would predict by chance. And this is discovering, in theory, biologically significant relationships among those altered genes which are telling you about how they interact with each other to generate the biological disease or other process that you're looking for. Yes, sir? Oh, no, no. What? Oh, you're just stretching. Okay, I've been in train to look for you. Yes? It's a de novo sub, do you? Yeah, well, that is that the devil is in the details. What do you use as the background? So each of the techniques has different ways of doing this. I'm just going to name a few. String, gene mania, hotnet, gene go. There's a functional interaction plug-in for reactome called Reactome-FI-VIS that Robin will be showing you. And many of them require you to choose a background. So if you're comparing cancer driver genes, you might choose all genes or all assay genes as your background. If you're looking at the distribution of copy number variance in a cohort of patients with mental illness, you might choose the CNVs in the patients with mental illness against their SIPs who don't have it. Some of the techniques such as the gene mania doesn't give you the option of providing a background that just assumes uniform distribution across the genes. And if I actually listed all the tools here, there are probably about 100 tools that all try to do these similar things. You were saying whether they cluster together or not? Yes, topologically in the reaction map. So if I choose a random set of genes, project them onto the network and just sort of think of it visually, I've got a big hairball and I'm highlighting those genes. The null hypothesis is that they are going to be kind of scattered at random throughout the hairball. And the non-null hypothesis is that they're going to be clustered in some way. So if I'm looking at a biological process that involves over-activation of cell signaling pathways, then I'll get a large cluster in the portion of the network where the various GPCR receptors are, for example. So this thing is different from the first one is that in the first one you take the individual network separately. First one I decided in advance where I'm going to draw the lines. Here you take the whole view and you let the data tell you whether it's randomly distributed or not, and if it's not randomly distributed you extract clusters which you make the claim are functionally related to each other and are related to the easier study. Okay, yeah. Now you have a question. Well, they will give you non-identical results. Hopefully they will give you consistent results with each other. But if you consider that in gene set enrichment analysis, I may arbitrarily have taken, or the curators may arbitrarily have taken a pathway which has related to the disease and just because of history, they've split it in two. They said, well, this one has to do with EGF signaling and this is the K-RAS pathway, the RAS pathway. They're actually two parts of the same thing but we've split them just because of history, the way it was studied. It's possible that the gene set enrichment analysis because you split that pathway in two halves won't achieve statistical significance in either half. Or that it's going to give you a bunch of pathways which are all statistically enriched but because of the historical way that they were given names you don't actually see that, oh yeah, these actually do belong together. And that's why you end up with these things like the concept maps which did Gary talk about the concept maps yesterday. It's a technique that he and his lab developed to take gene set enrichment data and put them back together again based on sharing of genes. And it's trying to correct in a post hoc way the splitting that you get when you're doing a gene ontology based enrichment test. So does it mean that you can use reactome both in 9-1 and 2? Yep, absolutely. So reactome attempts to provide all three. And I showed you that one screenshot from one which is just the colorization of the map two and three Robin is going to show you it's in a tool called reactome F5 is. Yes, that is often the way, often for gene set enrichment depending on the tool they will often put genes into a single pathway and not allow them to appear in multiple pathways. If you allow them to be in multiple pathways then you have to do corrections. Some of the more sophisticated ones such as Go Miner allows you to have the same gene in multiple Go pathways and it corrects for double counting. Okay, so the output from de novo clustering is going to be a series of subnetworks in which the algorithm has discovered a statistically unlikely clustering of the genes of interest so it will give you a little subnetwork here it may give you several others and you then have to go and figure out what the significance of those networks are. And oddly enough, people frequently turn to gene set enrichment tools to do that they'll take this whole subnetwork which includes both the red genes which are in your query and some white ones which are not in your data set but highly interacting with the ones that you search they'll take this whole thing and put that into a gene set enrichment analysis to find out what this subnetwork does. And so the advantage of this is it may allow you to discover novel biology which you would not get from fixed predetermined sets. And then the next step is well I've got this little enriched subnetwork but what exactly does that mean for the cell? Is it driving the cell towards apoptosis? Is it driving it towards increased mitosis? And that's where the third category comes in. This is pathway based modeling essentially you're building a computational model of what happens when you increase the activity of three genes in a network and decrease the activity of a fourth. What is the integrated effect of doing that given the regulatory relationships among them? So it takes a series of mutations or other alterations puts it on the computational model and tells you and makes a prediction about what the cell will do. It'll grow faster or it'll die faster. And again there are lots of tools here I'm going to talk about one called paradigm but the output in each case is a set of gene activities after you've integrated all the variations. And in theory this gives you the most explanatory power because it can allow you to do what if questions. You can do things in principle like well if I've got these two mutations in a cancer pathway what happens if I add a drug to that pathway that'll knock out a third gene? Will it, can I use this to restore the pathway activity to the normal level or can I use it to actually completely abolish the pathway activity and kill the cell? Can you repeat what it says? Yeah, I'm sorry and I'm sorry. What does it say at the top pathway? Yeah, it's a little hard for me to read it too. But it says, you're asking what does it say on three here? Yeah, yeah, yeah. It says evaluation of potential network rules. Oh, it says pathway based modeling. Yeah, okay. So here's a summary of this. So for enrichment of fixed gene sets the basic question you can ask is what biological processes are altered in this cell? I said cancer here because it was a review of cancer based methods. For number two is other new pathways altered in this system and are there clinically relevant tumor subtypes? Again, cancer focused. And then the third is how are the pathways altered in a particular patient who has multiple alterations? Are there perhaps targetable pathways that can be used to kill the altered cell or to restore it to a normal phenotype? So if I have a targeted panel for a set of genes then ideally I should go with the second method first, identify the red dots and get the white ones here and then go for the first step to get the enrichment. That would be a reasonable strategy. Okay, I'm going to move on. I don't need too much time to run out of time here. So for pathway based analysis there are many problems in using pathway databases as is to do biological analysis. One is that the pathways are typically organized hierarchically and those hierarchical pathways are arbitrary and that will give large pathways, greater weight in many analyses than small pathways, but worse is just the arbitrariness. And so one approach that people typically take is to take those beautifully curated hierarchies and just flatten them down into a system-wide network. And then we have the problem we talked about before where there are... that the boundaries that have been driven... that have been drawn between curated pathways arbitrarily can break a single pathway into two smaller ones just for historical reasons and you end up with this problem of genes and proteins which are contributing to multiple pathways are only listed once and there are... so what one... again, what people do is to flatten the pathway down into a network and then to use the relationships to show the cross-talk between them. Other issues in pathway based data analysis is that typically you don't have just one... if you're doing patient-based analysis and you don't have just have one type of omics data you have multiple types. You have copy numbers, you have gene expressions, you have epigenetic data, somatic mutations and you need to figure out how to model all those changes at the same time. And I'm going to skip that. Okay, I'm going to skip... I'm just going to skip through things here a little bit here just to catch up. Okay, so de novo subnetwork... so for de novo subnetwork construction and clustering, this is type 2, it's just a review of what I said before. You take your list of altered genes, proteins, RNAs and you find topologically unlikely configurations. Genes that are closely to each other on the network then you would expect by chance either using a random distribution or a background of your choice. You then extract clusters of the unlikely configurations and you annotate them. Usually you annotate them with Go. To do network clustering you use the same techniques that have been used in analysis of the worldwide web or LinkedIn or Facebook to identify communities of people who are interacting with each other more frequently than you would expect by chance. So exactly the same algorithms used for social networking are used here. And there are a suite of techniques some of which are more accurate but slower others of which balance it the other way around. Here are a few of the network clustering algorithms that are used. Gervin Newman method, Markov clustering. So Gervin Newman is the for a long time the gold standard, highly accurate but slow. Markov clustering is less accurate, more stochastic, but it's a lot faster and it's scalable to very large sets. Then there's a method written by Ben Raphael Brown University called Hotnet which models networks as a metallic lattice and then heats up one of the heats up nodes which are altered and then it traces out where the heat goes. And this is actually good because it avoids a problem with the genes like P53 which are highly annotated have lots and lots and lots of edges leading out of them. It actually downweights P53 because it has so many connections. There's a method called hyper modules which is in a side escape app which was designed to find clusters that correlated with clinical characteristics and patient sets. And then there's the ReactomeFI network which gives you actually multiple clustering methods and gives you some of the type 3 computational modeling techniques as well. So here's what a typical network clustering algorithm will give you. This is just a cartoon but it's showing that there was a much larger network and from that large network six different clusters were extracted some that are small, some that are large. Each network is more highly interconnected within itself than between but you can still see that there's a lot of crosstalk within them. And then your challenge then is figuring out what each of these clusters means. In the case of the ReactomeFI VIS app which Robin will show you it gives you displays like this one where we have a network cluster in which the size of the nodes indicates the frequency with which a gene is altered or mutated and it allows you to draw pictures, annotate it and dig in to understand the details and it can do things, it can help you generate hypotheses on how genes are related to the disease phenotype and it also allows you to look at individual patients and see how they differ in terms of which network modules are affected. So in order to take a pathway database like Reactome and to allow you to do computational modeling on it you have to flatten out the data model in the way I described earlier. And so essentially what one does is you take a reaction which has inputs, outputs, activators, inhibitors, catalysts and so forth and you turn it into a series of functional interactions. So for example, one functional interaction here will be that input 1 and input 2 interact with each other. They have to do that in order to participate in their reaction. The catalyst and input 1 interacts. The catalyst and input 2 interacts. The activator interacts with input 1 and input 2 and all the outputs interact with each other in the context of a complex. And by doing this you've taken the pathway and you flattened it out into a large number of interactions which can then go into these modeling tools. Also once you have flattened a pathway database out into this type of thing you can now add yeast to hybrid data information from biogrid and intact and bring other interactions in which weren't originally curated. You have to be careful when you do this because you can easily create a mess of a lot of false positive interactions when we did this for Reactome and this is the work of Guan Ming Wu who was originally a postdoc in my lab and now runs his own lab at OHSU as a faculty member. In order to construct the functional interaction network well you need to use Guan Ming used machine learning in order to sort out high confidence interactions from low confidence interactions to create a high quality functional interaction network which has about 300,000 interactions and 12,000 gene products involved. This is growing slowly it's probably be about 13,000 by the end of the year. And this network was tuned to have very few false positives so it's not as large as some other networks but this is the basis for the pathway analysis tools that Robin will show you. Protein-protein interaction. I'm sorry I'm skipping over this so that we don't run into the coffee break. This is just a little bit of that functional interaction network it looks like this is a 5% of a much larger hairball and you see that the nodes are not randomly distributed there's an intrinsic clustering a lot of this is complexes, physical complexes. Okay and then we're just here I'm just... yeah this is a little animation I forgot this was here a little animation showing how you can project altered gene products genes and gene products onto this do find unlikely clusters bring in interact... bring in linkers that tie them together and then that gets turned into a series of subnetworks that are connected to each other. So what happens when you apply this to 127 cancer driver genes from pancancer you end up with actually a much smaller set of interconnected subnetworks and they all make a lot of sense. So here's a signaling subnetwork here's a cell cycle subnetwork here's a p53 subnetwork and here is the non-GPCR signaling pathways and you can further break these up into sub-subnetworks and study those but it gives you what's a very sensible picture if you look at other disease processes infectious disease or mental disease you get quite different set of modules. Sometimes you can use this directly for translational results so for example this is again work that Kuan Ming did a couple of years ago and if you take this and look just at frequently mutated genes and estrogen receptor positive breast cancer you end up with a series of modules you end up with a series of 13 modules similar to the ones that I showed you for pancancer but there's this one which involves the cell cycle M phase and Aurora B kinase signaling which is highly variant at the RNA level from one ER positive breast cancer patient to another and it turns out that if you have low levels of expression in this cell cycle and Aurora B kinase sub module those patients have a much better prognosis than those who have high expression this is a Kaplan-Meier curve showing proportion of patients surviving 0, 50, 120 months time going forward here and this is actually such a strong prognostic factor that patients who have high expression in this sub network have as bad a prognosis as patients who are triple negative for estrogen receptor minus so we were able to find a subpopulation of patients based just on expression levels of genes in this newly discovered network which gives patients just as bad a prognosis it finds a subset of patients who have an exceptionally poor prognosis who otherwise would have been thought to have a good prognosis and they might be candidates for more aggressive therapy so that's an example of discovering new biology just from doing a very simple type of analysis I'm going to end on pathway based modeling so in pathway based modeling you actually preserve the functional relationships between the genes in the network so that you preserve the positive and negative regulatory relationships and also you preserve the identity of what the nodes are RNAs are treated differently from genes are treated differently from protein products so you preserve the biological relationships you create a computational model of this and then this allows you to take multiple molecular alterations affecting different types of macromolecules and transform them into predicted altered pathway activities and this is really where pathway modeling becomes systems biology so there are various approaches for this the oldest approach is to use partial differential equations or Boolean models these are techniques that have been developed really for understanding metabolism and applied successfully to modeling fermentation in yeast and prokaryotes mostly suitable for biochemical systems and they become increasingly intractable as a number of macromolecules or nodes in your network exceed about 10 so they're very good for very small systems then there are network flow models such as net forest and network in which were designed for signaling cascades typically kinase cascades and work well in that limited field there are transcriptional regulatory network-based reconstruction methods such as arachne which are designed specifically for transcription factors and their targets and then finally there's a more general sense general set of tools based on probabilistic graph models or PGMs which scale well to very large networks and can be used for biological processes which are for more general biological processes it can be used for transcriptional analysis or they can be used for kinase cascades or they can be used for other types of regulation so the way the PGMs work there should be a picture here but I think there isn't is you develop a network model in which there are directed weighted arrows indicating inhibitory or activating relationships each of the arrows has a weight associated with it which attempts to model what happens if you double the activity of gene A what does it do to the activity of gene B it can be set to 1 which case it will be doubling will double it or it can be set to negative 1 which doubling will decrease it by half or it can be set to anything in between and then the modeling system propagates those changes typically using Bayesian reasoning to propagate the changes from the top of the network down to the bottom and if there are cycles there are also regulatory cycles it attempts to account for the loops as well to give a prediction of what the integrated activity changes are and the nice thing about this is that you can apply different types of omics data to these models you can apply copy number variations to it under the hypothesis that deleting a gene will reduce its activity to zero or amplifying it 100-fold will increase its activity you can apply it to mRNAs you can apply it to mutations and proteins you have to be very careful though to model mutations correctly so for example if a mutation is an activating mutation it's got to be modeled as activating the gene if it's a loss of function it's got to be modeled as a loss of function often you don't know and so that's one of the limitations of this type of analysis you can identify significantly impacted disease pathways and you can link those activities to patient phenotypes so this is an example of paradigm which is the first widely used of these PGM tools paradigm takes a set of biological pathways in this case it's a very simplified version of apoptosis pathway where MDM2 is inhibiting P53 and then it expands this to account for what the genes, RNAs and proteins are doing so here we're showing that the MDM2 gene makes the MDM2 RNA makes the MDM2 protein which then makes the MDM2 active protein same thing is happening here in P53 the gene is making the RNA is making the protein and then there's a coming together here where the active MDM2 protein and the P53 protein contribute to a P53 active protein which if activated leads to apoptosis and so each of these is associated with a direction and a weight some of these weights can be positive or negative typically they're positive here and this is a negative one, W7 is negative and now you can put mutations on top I think that's coming up in the end yes, yay, I remembered we can say well what happens if we mutate the MDM2 gene what happens if we have a copy number deletion of P53 what happens if our microRNA sorry if our microarray or an RNA seek shows a big change in the levels of MDM2 RNA what happens if mass spec shows that P53 is not folding correctly and is being degraded and so those can go all, you can put all those in on a patient by patient basis and ask what happens to the downstream phenotype that we care about and this surprisingly enough seems to work pretty well so this is a figure from the first paradigm article published in 2010 paradigm is from a collaboration between David Hausler and Joshua Stewart at UCSC and they're looking at glioblastoma multiforme from the TCGA project they took a series of about a hundred different pathways and they modeled the activity changes in each of those patients and they found that different patients have different reproducible alterations in the activities of some pathways GATA pathway, E2F, EGFR pathway and those distinguish different clusters of patients that if you were looking at the level of individual genes it wouldn't be clear clusters but when you look at the integrated pathway activities there are clear relationships between patients who are deficient in GATA interleukin signaling patients who have activation of EGFR and so on and then you can compare these four clusters of patients to see their clinical characteristics and in fact some of them have different prognostic some of them have more aggressive diseases than others and there are other distinguishing there's also some of them are distinguished by different histopathology okay here's another example from reactome FIVIS an ovarian cancer patient network we're looking here at two different ovarian cancer patients from TCGA we're looking at the pathway activities around the P53 in one case this patient has very high P53 activity low CREB binding protein activity another patient another case this patient has high CREB BP and low P53 activity so it's distinguishing even though they have similar mutation profiles they're different at the activity pathway level so there's good and bad news about paradigm the bad news is that the paradigm itself is very difficult for people outside of UCSC to use because it was just although it's open source it's hard to get it to running and they don't provide any pathway models you have to make them yourselves documentation is scant it takes a long time to run requires a cluster to run we were very excited when we read the paradigm papers a few years ago so Guanming Wu and others in my lab took paradigm and incorporated them into reactome FIVIS and we published reactome based pathway models and have improved the performance so you can now run it interactively and Robin will show you how to do this and then I'm just going to end with references and links for you and happy to take questions and then we go into a coffee break so the next the lab starts at 10.30 so maybe we could take a couple questions then people can go to coffee break and I'll stay and answer your questions during the break okay, yes no and I think that that is the that is the challenge of doing the modeling correctly so it of course it if you have and it depends on the context so if you have a deletion of the gene that's an easy case directly model that as a loss of function if you have an amplification you actually don't know the RNA levels often but not always increase as you increase the copy number sometimes you increase the copy number 100 fold and the RNA level doesn't change so the most are so RNA is RNA levels are pretty good because the RNA level goes up you can think the activity probably goes up goes down the activity goes down mutations are very tricky many mutations are loss of function mutations but they have to be distinguished between from the from activating mutations and they have to be distinguished from silent mutations which aren't doing which aren't doing anything and that's a whole other that's a whole how to do that is a whole other kettle of fish mutation significance prediction there are various families of techniques for trying to make that guess ranging from very simple techniques looking at clustering of the mutations if all the mutations are always hitting the same residue in multiple patients it's likely to be in activating mutation because there are relatively few sites that you can mutate and have a gain of function whereas if they're scattered all over the gene it's likely to be a loss of function because there are many places where you can do that but it's not it's not a hard and fast rule more sophisticated techniques use 3D modeling to predict what the effect on the active site is or you can use genetics and you can say well in cancer patients this is a I always see two hits here this is a recessive mutation because I have to have a deletion and a nonsense mutation that's got to be a loss of function whereas if I see a dominant pattern where the mutation affects the mutation typically affects one one copy of the gene not the other then that's probably activating that's using dominant recessive relationships or you do them all try to guess I'm more interested in the mutation yes you have to if you want to model activity you have to you have to have a good guess at what the effect of the mutation is before you try to model it otherwise it's going to be crap any other questions before we go to break okay one more so when possible if we have say we have great indications we can get a model no I think the idea of tuning if you're saying tuning tuning the weights in order to get the outcome that you want no that's not a misunderstanding so like if you have a mutation in the gene how that mutation will affect patient or somebody and then you can relate it back so it's like the training is of the model sure and in fact in fact if you have you know it's actually if you have an experimental system where you are systematically introducing the mutation of interest by CRISPR or you are using shRNAs to up and down regulate the genes then you can actually train the model learn the weights and now and then have a much better have a much more predictive model than one that came out of out of curation which may not capture all the regulatory relationships it is can be quite difficult to generate the requisite data sets to train