 So, hello everybody. I'm going to give you a overview of network and pathway analysis, leading up to a practical that my associate Robin Haw, known to you all, will lead you through after the lecture. Okay. So, I think you may have heard a lot of this from Yuri yesterday. So, just shuffle your feet or wave at me or something if I'm going over well-trod topology. But the main reason people are interested in doing pathway analysis in cancers is because of the dramatic size reduction of the data set. Cancer is notorious for having a long tail of rare cancer mutations. These are mutations which occur relatively rarely, one or a couple of cancers per 100 donors. They are, they don't pass statistical tests for recurrence, so you can't document that they're drivers. Yet they have a functional impact. They may affect genes that you think might be important, but you can't document them. And the idea of pathway analysis is that even if the mutation in a particular gene doesn't occur frequently enough to have statistical power to document its driverness, however, the pathway, the molecular mechanism, molecular process that gene participates in may be a driver pathway. And by aggregating genes into a set of genes that are rarely mutated into a set of pathways that are more commonly mutated, you achieve the statistical power to make a statement of statistical significance. And it also, at the same time, allows you to tell a coherent biological story about why these genes are contributing to the cancer phenotype. So I'm going to talk about pathway and network analysis kind of interchangeably. They really are two aspects of the same thing, and it depends on whether you're coming from the task from a more biological, biochemical orientation or whether you're coming from a more computer science network, mathematical model orientation. So in general, this type of analysis is any analytic technique that makes use of biological pathway or molecular network information to gain insights into a cancer or other disease or biological system. Very rapidly evolving field, there are literally thousands of papers published on different network techniques every year, many different approaches. I want to start by distinguishing between the pathway and the network views. I'm showing here a very simplified version of the same biological pathway, the EGFR receptor and its immediate downstream effects. And you can look at the same pathway in two different ways. You can use a traditional pathway-oriented view. By the way, do I have a laser pointer? Oh, yeah, I'll just use the mouse appears. Okay, great. So here is the Borenger Mannheim biochemical, Leninger view of the EGF, the EGF ligand binds to the EGF receptor. There's a negative regulator here, LRIG1. It creates a conjugate between EGF and EGFR. That leads to a dimerization reaction. There's a hydrolysis of ATP leading to a phosphorylated form that then leads to later events, which aren't shown here. And there's another inhibitor at this step, SRC1. So this is a traditional way of looking at it, but it's actually is mathematically somewhat difficult to deal with because you have many different types of interactions. You have a binding interaction, you have a dimerization, you have a hydrolysis, you have topological effects here, and modeling each of these is quite difficult. So what computer scientists prefer to do is to turn it into a network, where they don't care about the precise nature of the events that occur or their precise order, but they do care about their directionality. And what we're showing here is now EGF is shown as activating EGFR receptor that is then activating, that's probably a mistake, actually, HCC1. There's inhibiting EGF and SHC1 here should be shown as inhibiting EGFR. Interestingly, I've shown this a zillion times and only noticed the mistake this time. What you can also do with a network is you can add in nodes and interactions which are not that well characterized. So for example, if you have a proteomics experiment that shows that these additional genes such as KRT17 and SOS1 are interacting with these various proteins or you have a suppressor-enhanced genetic suppressor-enhancer screen, it can show that it's a genetic interaction, you can layer them on to the network and make some inferences about them even if you don't know precisely what they're doing, how they're interacting at the molecular level. And the other thing you should know is that you can convert between the pathway-oriented version and the network-oriented version in a straightforward fashion. Unfortunately, you can't go back the other way, there's information loss. So in all network pathway analysis, you start with two main ingredients. First, which should be familiar to you from yesterday, the first ingredient is a list of the altered genes, proteins, RNAs, whatever it is you're studying in your biological system. Typical experiment is a cell line treated with a drug versus not treated with a drug, do a RNA seek or expression array, and you identify up and down regulated genes. Second is you need a source of pathways or networks to do the analysis on, pretty simple. So let's talk about what these ingredients are. So the previous five days, you've talked about generating genes, proteins, RNAs, and so on. We won't go through that. But where do you get pathways and networks? Well, they come from two sources. The main source is pathway databases. These are curated, usually very labor-intensive databases which capture biological processes in a biochemical style. They capture cause and effect very well, exact biological interactions. Sometimes they're detailed down to the particular amino acids, which are altered within a protein by a ubiquinolation or phosphorylation step, and they're very amenable to human interpretable visualizations. You can look at the path, you can draw a picture of the pathway, you can show where the alterations are, and usually because we're trained to do this, it makes sense to us. Disadvantages of the pathway databases is their coverage of the genome as sparse. Because we don't know everything there is to know about biological pathways, we're only able to say something intelligent about maybe three-quarters of the genes in the genome, and extracting that information from literature and high-throughput experiments is manually intensive, so pathway databases tend to be incomplete. Also, because pathways are to much extent subjective, they depend on what part of the thread of the big tangled ball or yarn a particular researcher started pulling on, different databases and different curators disagree on the boundaries of pathways. So, for example, all the signaling pathways, EGFR, PIC3C, TGF, beta, they all have many interactions within the cell. Yeah, do you have a question? To me, also pathways are solid dependent, the pathway stomach cells can be different from. So, when they build these databases, what's their reference? Do they just kind of, like how do they say this pathway is specific to X cell type? Yeah, very good, very good point. Yeah, the question is, oh yeah, right, we're on tape. The question is, how do pathway databases deal with the fact that different pathways are active in different cells? And in fact, the same pathways can be routinely rewired and reused in various ways. To various extents, the pathway databases are aware of the problem and they attempt to deal with the problem as well as possible, usually by presenting consensus pathways which are kind of a unification of all the possible pathways that might be going on in any cell type. And then when there's a strong cell specific dependency, they show a cell specific pathway. So, in the reactome database, for example, we know that the liver is very different from the muscle and so when there are overlapping pathways between the two that use different isoforms of enzymes, we specifically annotate what cell type it occurs in. And the idea of a consensus pathway is that you can take the expression pattern from a particular cell type and filter the pathway, remove genes that are not actually being produced in that cell type and thereby prune the pathway. But that's another of the problems. It's also a problem with network analysis. So, here's a typical pathway diagram. This is from the Kyoto Encyclopedia of Genes and Genomes, KEG. And this is actually a fascinating union pathway because it includes both eukaryotic and prokaryotic genes. That's another detail, is that some pathway databases are species do species specific annotation and others attempt to capture all of possible biology and put it in one place. KEG does this by reducing genes to proteins and RNAs to enzymatic activities. So, it's very focused on intermediary metabolism. Reactome, which is the project that Robin and I work on, along with researchers at NYU under Peter Dostascio and the European Bioinformatics Institute under Henning Herm Jacob is explicitly human centric. It focuses on curation of just a documented human pathways and has a focus on disease-related pathways. So, we do a lot on cancer, a lot on signaling pathways in cancer, a lot of regulation of cell growth and migration. So, I'll talk about, oh, wow, this is out of date. So, reactome is hand curated. It follows rigorous curation standards. Every reaction is traceable to the primary literature. When there's been a piece of, when there's been an experiment that was performed on a model system, such as a mouse, we curated it in mouse and then we explicitly projected onto human proteins and provide reasons why we think that it's going on in the human. It covers, actually, this slide was put together several years ago. We're now at over 10,000 human genes. Robin, what's the current number in the last release? 10,200, something like that? Okay. Over 10,000 and over 1,500 human pathways. It features a very nice scrollable, zoomable reaction diagram that actually adapts to the level of Zoom to show more detail as you get closer. It allows you to find, it has some online analytics, including finding pathways contained in your gene list and calculating gene overrepresentation of the pathways. And it allows you to project human pathways onto other species. And the nicest feature about it is it's completely open access. It's actually the largest open access pathway database currently in existence. So that's pathways. Now, where do networks come from? Well, networks are much broader and they will cover less well understood relationships. So you can build anything that relates to entities in the genome, whether it's proteins or genes or RNAs or lipids, in any type of interaction, genetic ones, physical ones, co-expression, more abstract things like sharing genontology terms or being close to each other in pathways, you can build a network out of them. And these networks by and large are useful. So network databases can be built automatically or via curation. They typically have more extensive coverage of biology. But the trade off is that the relationships and the underlying evidence are more tentative and more noisy. So here are a few sources of curated networks. You can just sort of see the order of magnitude of the numbers. Biogrid, at the time I made this slide a couple years ago, had over 500,000 genes and 167,000 interactions across multiple species. And intact had 60,000, two or three thousand interactions. Mint had fewer, depends on the size of their staff and their standards for curation. There are several hundred pathway and network databases. It's very hard to keep track of them. A resource that I recommend is called Pathway Commons. It's maintained at Memorial Sloan Kettering and it has two features. One is it has a big index of all the pathway and network databases, kept relatively up to date. And second for the major pathways and network databases, there is a file interchange pathway, interchange system called Biopax in which these, all these source databases upload their pathways into the Commons for people to search and compare. So they're a union of databases. Okay. So typically you won't be interacting directly with these pathway or network databases except maybe to do visualization. You'll mostly be using people's software tools such as Yuri's G-profiler or the Broad Institute's GSEA. And they draw on these pathway databases, create their own data sets that are used as the substrate for the analysis. So let's talk about the types of analysis you can do. There are basically three different ways that you can analyze pathways and networks. The simplest one, and that's the one that you learned about yesterday, is enrichment of gene sets. And here you have a set of, you have a set of buckets. So buckets based on shared pathways or shared gene ontology terms or shared participation in gene processes. And you look for statistical overrepresentation of the genes in your gene list in one or more of these, one or more of these buckets to identify networks which are enriched in the gene, enriched or depleted in your gene list. Okay. That's the first type. The second type is where you don't have a preconceived notion of what pathways and networks are or what the buckets are. Instead you let your data discover those networks. So you take, typically you take a big network of all the possible interactions that people know about or all the possible pathways that people know about and filter them through your gene list to identify subnetworks in which your genes are participating more frequently than you would expect by chance. So that brings up, that can bring up and allow you to discover Denovo networks or Denovo clusters. And there are quite a few popular pieces of analytic algorithms in this class. And the final one is pathway based modeling where you have built a causal model of some pathway of interest such as PIC3CA signaling. And then you put your dataset, genes that are mutated in your cancer, on top of this model of the pathway to see what its integrated effects are. And this would allow you, for example, to test out hypotheses for synthetic leafels what happens to the pathway when two different genes are inactivated simultaneously. Each of these three types of analysis has different advantages and disadvantages. For the type one, it is a great way of finding out quickly what biological processes are altered in this cancer or the system that you're looking at. So it'll give you a quick read that, well, I'm looking at regulation of immune system or I'm looking at changes in cell mobility or I'm looking at something affecting the cell cycle. The second one allows you to find new pathways altered in the cancer and also allows you to find clinically relevant tumor subtypes. So, for example, this subtype of medula blastoma affects the hedgehog pathway. And this one affects a pathway that seems to be a combination of wind signaling and ARID1A. And the final one for pathway-based modeling allows you to dig into those pathways in detail to predict and later test how the pathway activities are altered in a particular patient who may have a distinct cluster of mutations and identify targetable pathways in the patient and do in silico drug screening style experiments. So we're going to go through them quickly. The enrichment affects gene sets I think you've already seen pretty well. The advantage of them is that they're very easy to perform. They're good end-user tools and the statistical models have been well worked out. And these are stable pieces. These are stable pieces of software that will give you results. Disadvantages are, first, you have to pick your gene sets because there are many, many different dimensions of gene sets to look at and you'll get different answers depending on which gene sets you check. The gene sets are very heavily overlapping and so typically you will get many similar sounding pathways or processes and you have to go through a second step of aggregating them together using enrichment maps, which I talked about yesterday, right? You do enrichment maps? Okay. That's a technique which I'm not going to cover for taking pathway hits and grouping them together according to shared genes and simplifying their results. And finally, because most of the fixed gene set analytic tools treat the gene sets as just bags of genes without any relationship among them, you're left to sort out why the relationships among the genes inside those bags. So, de novo subnetwork construction and clustering, which we will talk about, is you take a biological network and apply a list of altered biological entities, genes, proteins and RNAs to that network to identify topologically unlikely configurations among your list of alteries, whatever. That is, you're looking for a subset of the genes or proteins in your test data, which are closer to each other in the network than you would expect by chance. That is, they're talking to each other in some way. They're not randomly distributed across the whole network. You can then extract clusters from these unlikely configurations and then annotate them by adding biological meaning to them. So, here's an example of doing this in Reactome. This is using a tool called the Reactome FIVIS, which is a cytoscape plug-in. So, you start out with a network of genes or proteins. In this case, it's using a Reactome-generated network called the Reactome Functional Interaction Network. It's actually a union between curated pathways that have been converted into networks and uncurated interaction evidence from a variety of sources. Then, apply your gene list to this big hairball and you extract a smaller set of interacting clusters, each of which contains a statistically unlikely number of the genes which are altered in your gene set. So, you cluster them using the same, it actually uses the same algorithms that are used to analyze social networks such as Facebook. Then, the tool allows you to annotate each of these clusters. Here is, for example, a big cluster of cell cycle checkpoints. Here's a trail signaling area. Here's a ROGTPA signaling area. Here's DNA repair P53. Just by looking at this, you can tell that it's a cancer. In fact, this is a breast cancer cluster. Here's a little closer look at the Reactome FIVIS network. We're seeing 5% of it. It's fairly accurate. One has few false positives by a variety of measures. Here's how it can be applied to a human cancer. This is a typical cancer. These are whole genome sequencing of 50 pancreatic cancer. There are more than 200 recurrently mutated genes among them. Here's, number one is KRAS with almost 95% of the tumors are covered, followed by P53, followed by SMAT4. And then here is the long, long, long tail that stretches way out of genes which are mutated two or more times in that data set. It can only show statistical significance for the first four of these in this small data set. After extracting from the network and clustering them, you actually get a much cleaner set of pathway clusters. There are roughly 10 of them here. They typically make sense. We know that axon guidance is one of the driver pathways in pancreatic cancer. In fact, axon guidance comes out here in a cluster that also includes EGFR and ERB. Here's P53. Here's another axon guidance cluster and hedgehog. You can then make these clusters the unit of analysis. We can ask, is the distribution among clusters different, among different patients with pancreatic cancer? And in this case, that actually works quite well. We've taken, we have the various modules here. We're scoring them according to whether a patient has a mutation in one or more of the genes within those modules. Here are those 50 patients and they actually cluster very nicely into a large cluster of tumor types here, which have both module one and two frequently mutated. Type three, which is only module two. Type two, which is a combination of the three modules, one, two, and ten. And then a rare type up here, which is negative here, but is positive for module seven. So you can discover tumor subtypes in this way. Sometimes these subtypes will tell you about clinically relevant features. Here is a case in which we use the same technique in breast cancer. Actually, an expression, a series of expression experiments across 200 breast cancers and found that one of the modules, one involving Aurora B signaling and cell cycle M phase, is really an excellent predictor of survival. These are all estrogen receptor positive breast cancer patients. They have a good prognosis. However, patients with increased expression of any of the genes in this data set have a much worse prognosis. In fact, their survival curve is as bad as the triple negative patients. So you could propose developing a biomarker based on this that would identify estrogen receptor positive patients who might have a more aggressive form of the tumor and might need more conservative therapy. So here are some popular network clustering algorithms that you can use. So gene mania, which is developed here in Toronto, at the University of Toronto by Gary Bader and Quaid Morris, follows a birds of a feather principle. It has a beautiful web-based interface. You upload your gene set and it finds clusters of genes which are related to those, interact with those in your list. It's extremely easy to use, very powerful. A somewhat more difficult piece of software to use, needs to be installed on your desktop, is Hotnet from Ben Raphael at Brown University. What this does is it uses a more sophisticated algorithm that, given a network, finds, models the network as a metallic lattice. You insert your gene list with values. That is, if you know a gene is activated, you put it in as a hot value. If you know it's inactivated, like a P53 loss of function mutation, you put it in as a cold modge, cold cluster. And then it propagates the hot and cold genes across the network to find clusters which are predicted to be increased or decreased. And that will give you regions of the network. It will give you pathways that are predicted to be activated or inactivated. And its big advantage is that unusually well annotated genes like P53, which have thousands of literature references and have seemed to interact with everything in the cell, those are actually down. There's a strong bias to pull those up as significance. This corrects for that to some extent. P53 will not come up unless it actually is involved. Then there are two applications for cytoscape. How many people know cytoscape? You've all seen cytoscape. Great. So the two nice apps that you can get on the cytoscape store, one is hyper modules written by Yuri. Did he talk about it yesterday? Okay. Well, I recommend it. It's good. It is designed to find network clusters that correlate with clinical characteristics. So you put in lists of patients and the genes that are changed for patients who have some clinical characteristic that you care about, such as response to a drug versus not. And it will combine those two forms of data and attempt to find network clusters that correlate with the clinical characteristic. And then there is the REACTOM-FI network cytoscape app called REACTOM-FI-VIS. And it takes a whole bunch of the clustering and correlation algorithms out there, including hotnet paradigm and paradigm, which I'll talk about later. And it gives you a nice interactive way of converting pathways into networks, doing the clustering, annotating the clustering, and then correlating them with clinical characteristics such as survival. Okay. The last type of analysis I'll talk about is computational pathway-based modeling. So here, again, you apply a list of altered entities that you care about, genes, proteins, RNAs, to biological pathways using models which preserve the causal nature of that pathway. It preserves the detailed biological relationships and the order of events. What these attempt to do is to integrate two or more molecular alterations together to turn them into lists of altered pathways activities. And it is sort of an entry, it's a gateway drug into systems biology. So the types of pathway-based modeling, so this is where, this is cutting edge of network analysis. And it's here where there are lots of pieces of software that are in development at various stages of maturity. And there's little consensus in the community about what works the best. So I'm just going to give you a little laundry list here. So the oldest and most mature types of pathway-based modeling are partial differential equations and Boolean models, which were developed, have been developed since the 1960s, and are based on reaction kinetics. Mickey Alice Menten, Equilibria, and so forth. KAs and KMs, things that we have, we learned about in biochemistry a long time ago. They're mostly suited for biochemical systems such as metabolomics. Their primary use case is yeast in fermenters, studying the flow of metabolites, and they're good to, they can handle models up to about a dozen proteins, interacting proteins, enzymes, and genes. Not usually, not suitable usually for cancer analysis. Then there are network flow models, which were designed specifically for kinase cascades. And the two that are most widely used are net forest and network kin. If you have a system in which you're studying perturbations of a pathway that lead to a lead to phosphorylation products, these are designed specifically for modeling and making predictions from kinase cascades. Then there's a series of network-based reconstruction methods designed specifically for expression arrays. One of the best of these is Arachne, developed by Andrea Califano's group at Columbia University a number of years ago, and designed to pull out transcriptional regulatory pathways from expression arrays and from RNA-seq data, and to identify what are called master regulators. So what the transcriptional regulatory factors that sit at the top of a transcriptional pathway, it will identify them if any exist in your data set. And then there's software that implements probabilistic graph models, the most widely used of these is Paradigm. This is actually widely used for cancer analysis. And what PGMs do is to create a model of the positive and negative interactions in a biological pathway. So it explicitly converts a biological pathway into a network preserving the directions of influence. And then from a data set it can learn the weights assigned to each of those interactions. And then once you've trained it in your system, you can use it to make predictions, such as putting a cancer genome on it and seeing which pathways have been altered and what direction they're altered. So here's an example of how Paradigm works, a very simple one in which I have MDM2 inhibiting TP53, and then this is leading to apoptosis. And apoptosis is the activity that I'm interested in in modeling. So there's actually many different ways, even in this simple two-step pathway, there are many different ways to change the pathway activity. You can have alterations at the gene level, the RNA level, the protein level, and other levels as well. In each of these cases, there are different ways of changing the gene or the RNA or protein. For example, you can knock the gene out with a deletion. You can increase its activity by duplicating it. You can change its activity by a fusion that brings a promoter in. So all these various types of changes are modeled by Paradigm. And the influence of the gene on the RNA on the protein is captured. Okay? So each of these steps has a different weight. You apply a dataset to it, such as a large number of cancers, and it will learn those weights. Now the best way to work with it is if you have a TCGA-style dataset where you've got both genome and RNA, and if you have a proteome, that's even better, because then it can learn the model quite well and understand the relationships between perturbations in each of these. If you just have sequence, DNA sequence, it won't work very well. If you just have RNA, it will actually work pretty well. And if you have a choice, I would recommend a transcriptome plus a copy number array or a low-pass genome to at least capture amplifications and deletions, and that'll give you two axes. Okay, so, oh yeah. So here is various types of data that you can put into it, mutations from sequencing, CNVs, mass spec proteomics, mRNA assessments, and when you've got all this data, it actually works really, really, really, really well. Here is the first publication, a screenshot from, a figure from the first publication on Paradigm, which was in 2012, and it shows a large GBM dataset for which they had all the data, including proteomics, and they apply, they built models of about 30 different major pathways, and then used Paradigm to predict changes in pathway activity, and they found four major subtypes of GBM, each of which was defined by differing alterations in pathway activity. And unlike the example I showed you with clustering, it actually gives you the direction of the change. So with clustering, it says, well, this cluster of genes changed somehow, but it doesn't tell you how it changed. So this is showing that in type 3 GBM, there is strong inhibition, downregulation of the gata interleukin pathway. EGFR is increased in type 2 and 3, and so on, and so forth. There's a lot of detail down here. So this not only gives you clusters, but it also tells you how, at the activity level, one is differing from the other. Now the good and the bad news about Paradigm is that it's still, to this date, it's distributed in source code form. It's actually hard to get installed. You have to build your own pathway models. The documentation is scant, and it takes weeks to run. The good news is that as of several years ago, we ported it into Reactome FIViz as a site escape app. It's now very easy to install and run. We include access to Reactome-based pathway models, and we've improved performance, and it now takes overnight to run rather than weeks. So the rest of the lecture is references for you. They're in your printed notes.