 This morning about pathway and network analysis. So obviously here, questions as the challenge lies often with the analysis of huge amounts of data is to extract meaningful information and using them to answer some very fundamental biological questions. One thing that obviously path analysis or network analysis helps to do is to reduce our data size from thousands and potentially tens of thousands, maybe even more than that of data points down to a dozen or so pathways of interest or in rich, I should say rich pathways. We have writing different methods to increase the statistical power by reducing to multiple hypotheses. There's usually, particularly in some data sets, if we're looking at cancer data sets, that people tend to focus on a lot of the driver mutations and understanding the mechanistic roles that they play. But a lot of people forget about this long tail of rarer cancer mutations. And networks certainly with pathways can allow you to understand a little bit more of the mechanistic approaches and the relationships between drivers and these rare cancer mutations. They allows us to tell a variety of different biological stories. Of course, common feature is identifying hidden patterns within your gene list. The other thing to think about the pathways and networks is that they're great ways to visualize the mechanistic models that some of the experimental observations that you're doing in your lab or your clinic. Very useful in predicting functions of unidentity genes. It's also a source framework for quantitative modeling of our system of biology. Think of the pathway and the network and the reaction graph. And you can use that to build very nice molecular models. Obviously, you require additional kinetic or other quantitative data to help you with that. Then finally, and we'll demonstrate a little bit, hopefully, on this a little bit later, in the lab is insisting in a way of trying to develop and identify molecular, potentially prognostic sculptures from clinical data sets. What is a pathway or network analysis? Essentially, it's an analytical technique that makes use of biological pathway or molecular network information to gain insights into a biological system. It's a rapidly evolving field. And there are many, many approaches. Unfortunately, I'm limited with time, so I can't necessarily go through everyone. But here's an example. The most studied reason for using pathway analysis is to analyze a gene list. So here, several groups from the cancer genome atlas published a paper where they identified you actually analyze part of this data the other day in Yuri's lab, basically, they identified these 127 cancer genes. Now, they classified them as driver genes based on their mutation frequency. Now, looking at this, you really don't know what's going on. You may identify a few genes that you're aware of. And the question is, why do these mutations cause cancer? And pathway databases allow us to map these genes onto biological pathways and understand the roles in these pathways. Now, just using this slide just to illustrate the difference a little bit in terms of pathways and networks. So we're looking at the first reactions where EGF finds EGFR receptor, EGFR. So on the left, we're seeing the pathway representation of those first few reactions. And on the right, the network view. So the biological pathways typically is a series of actions amongst molecules in the cell. If you're thinking of a signaling event, you're usually starting from events just outside the cell, the interactions of the cell membrane, and the signal is being transduced all the way down to the nucleus. So there's typically some transcriptional events. But there's also metabolic pathways, which make possible chemical reactions that occur in our bodies, like glucose metabolism. And then there's also gene regulatory pathways where genes are turned on and off. Now, if we flip over and look at the network view, you would see that there's probably a little bit less information. The thing with for a network, as for a network, most of the pathways don't necessarily start at A and finish at point A and end at point B, actually, or start at point A and end at point G or something like that. In fact, they have no real pathways. Typically, they don't have real boundaries. And so the network approach is probably a better way of exploring the multiple relationships within the molecules within a pathway. And you could actually use both approaches really to help analyze clinical data sets. So let's talk a little bit about some of the pathway databases out there. Yes? In the left-hand side, or in the right-hand side, are all those genes amongst those 127? No, this is just a separate example. This is a different pathway. So on the left, it's just biochemistry. Typically, a pathway is one way. And we'll talk about the specific types of pathway databases that are out there. But yes, it's following a typical biochemical reaction. So let's look at the end of it. And on the right, it's more level of expression? No, this is just showing you just the interactions between the same molecules that you see in the pathway view. Now we're looking at a protein-protein interaction view of that same, just two ways of looking at the same data. There's a pathway view on one side of the GFR. The first reaction to the GFR signaling pathway. And then on the right, it's the same entities, just a different view. So there's a lot of pathway databases out there. Some are kind of based on manicurations. Some are based on automated data curation. I'm going to focus on reaction network databases because these ones are the ones that follow a typical data model. The unit is the reaction itself. It allows to describe many different events. The United States found in biology. It basically explicitly describes all of these different biological events and processes as a series of biochemical reactions. So we're seeing here, oops, the series of inputs flowing into the reaction and a series of outputs. These inputs and outputs can be a variety of different molecules. Different pathway databases will emphasize and focus on different molecules. Reacto, for example, will focus on proteins and small molecule complexes, non-coding RNAs, disease variants and therapeutics, whereas KEG will probably capture majority of the same thing, may not necessarily focus on disease variants. Another resource some of you may have heard of called wiki pathways, which is more of a community curation based event. In fact, it actually absorbs some of the reactome data as well. So it will be able to capture a lot of these events. These are actively maintained, curated databases, yes. The reactome just kind of confused like what the actual depth of the gene makes. Is it the G-profiler and David in the same family or is it different? It's different because we're really providing the source information. Like what David and G-profiler are doing, are they're consuming our data in some form. So we provide a variety, like pathway databases and network databases to provide a lot of data in different formats. And essentially, we'll have a file which has protein A is involved in process A, process B, protein B involved in process B and process Z or something like that. And so we provide these files and then David and G-profiler will use that information to basically provide the gene sets for doing the enrichment analysis. And CAG is the same? CAG is basically the same thing, really. I mean, it's this. The other thing that we can use as well is use these genontology terms to help describe different elements of the reaction. So we can use biological process, described reaction. If the reaction is itself a metabolic event, we can use a catalytic activity so we can use molecular function. We always have, it's important to have evidence there so we can use PubMed citation there to describe, to relate to the source information that has been used to create this reaction. Reactions are then joined together, kind of like a jigsaw puzzle to form a pathway. And then once you've collected all that information, you have a nice big picture. Now, CAG is obviously the first one of the, is probably one of the oldest established pathway databases out there. Unfortunately, it has a, basically it's not open source, you actually have to pay to get access to this data. But essentially you have a collection of biological information. It's not just about pathways, it does have a lot of other relevant annotations and a lot of chemical annotations as well. But also information about proteins, genes, metabolites, interactions and their reactions across multiple different organisms. And they provide these kind of relationship maps here that I kind of showed you earlier, showed you yesterday. The green boxes represent proteins, the lines, the white boxes, some of the represent genes and processes. There are some encapsulated pathways. These are just links between other pathway events and this pathway that we're looking at here. And then the simple lines just represent the reactions. React to them on the other hand, it's an open source, open access data resource. We just focus primarily on the curation of human pathways. We do have authors helping us to, experts helping us to author these pathways and we also go through a peer review system as well. Every pathway, as I said, is traceable back to the literature. When we cross-reference or write to different other bioinformatics resources and we're essentially providing data analysis tools and visualization tools. So here we're just looking at our pathway browser tool. On the left here we have the kind of hierarchy that allows you to interact with all the different events that are biological events captured in reactions. So that's both pathways and reactions. And as you do click on an element in this on the left panel, you see that on the right here, there's a diagram. The elements in this diagram are colored white and the colored bars, the horizontal, sorry, the vertical bars that you see tell me that a component of a particular, so there's complexes within this diagram, molecular complexes and components of that complex are part of this 127 cancer gene list that we were, well, that I presented earlier. So this is a way to visualize your enrichment analysis results and if you're interested in seeing which pathways are enriched within your data set, sorry. You can see that, sorry, the touchpad has gone very sensitive and I'm just trying to get my cursor back. No, no, I'll just point, apologies for those on the right screen, but basically at the bottom here, you see that list of enriched pathways. And as you click on those events, you see the diagram updates so you can see additional information. Now there's one thing I want to point, now I'm biased because I worked for Reactome, but I've used Keg as well, but I want to, I want just to compare the annotations for a very, between Keg and Reactome and I'll tell you why, Reactome might be slightly better than looking at Keg sometimes. So Keg, this is looking at particular process in apoptosis, Caspase 8 or Caspase, Caspase 8 or Caspase 10 activates bid. Now, basically as it turns out, the actual mechanism force mostly on active Caspase 8 and Reactome here is just showing, oh this is not a very good pointer, it's basically showing a more mechanistic approach where active Caspase 8 is directly interacting with components of bid, whereas in Keg, all you're seeing is this kind of, with Caspase 8, it's kind of transiently sitting there and it looks like maybe Caspase 8 or Caspase 10 is involved in the activation of bid. So it's not clear sometimes, when you look at a Keg diagram, just the mechanistic approaches to, like the mechanistic relationships between entities within the reactions. There's a difference between bids. That tells me, well that's another thing that Keg doesn't necessarily tell me, is that it's a truncated protein, so it's chopped. So there's the activity of bid 195, I mean I said one to 195 and then 62 are involved in the reaction as well. So there's key elements of the kind of reaction that's missing in a Keg diagram, but you'll see that in Reactome. In the previous diagram, the boxes have different amounts of lean. Yes, that's the results of the enrichment analysis. So that means that X amount of all the proteins amount to 127 are in the... That's true. Yeah, so now a good source of getting aggregated pathway and network information, if you're interested in building network models, is from Pathway Commons. It's a very useful resource. I mean you can always go to the source database, but so Pathway Commons has basically this collection of a lot of different information from a lot of different pathway databases. You could search, visualize, and download pathway and network information from this site. So I'm just gonna switch over to networks. So basically a network is a collection of nodes and edges or sometimes the nodes are referred to as vertices. The edges can sometimes be referred to as the lines. Nodes can represent many different things. Yesterday in the enrichment map, they were representing biological processes. In Reactome and the AFI in the lab today, those complexes are gonna represent genes, but they could also represent metabolites or complexes, any sort of object. And the edges themselves typically are demonstrating either a complex object or a complex object. Demonstrating either a physical or functional interaction. But it also could be an activation event, regulation, ration, almost any sort of relationship that describes node one and node two. So in this slide we're showing a variety of different types of interaction networks that are available. By far the most common is the protein-protein interaction network, but there are others. So on the left hand side, we're seeing a transcriptional regulatory network where the nodes represent the transcription factors. These are the kind of circular nodes and then the kind of putative DNA regulatory elements are the diamond nodes and the edges actually are representing the physical binding between the two. In the disease network, at the far right, sorry, we're jumping across the screen here, basically we're looking at nodes that represent diseases and the edges themselves represent the gene mutations that are, which associated with the diseases. The virus host network, on the left, nodes, so basically the nodes represent the viral, the nodes represent the viral proteins, the square nodes there, and then the host proteins are the round nodes. And then the edges are basically representing the interactions between the two. And then finally the metabolic network in the middle there. The nodes represent the enzymes, the edges represent the metabolites. And then the network, you know, depictions seem somewhat dense, shall we say, but they represent a small proportion of the interactome network maps, which themselves constitute a few percentage of the complete interactomes within the cell. So they don't necessarily capture everything that you know about all the interactions in the cell just based on the available knowledge available. Now there are a variety of different network databases out there. They're built either by manual curation or by extensive automation. They typically have slightly better coverage of pathway databases, basically just because there's more content. It's much easier to actually identify the interactions than it is to kind of build the reactions that constitute part of the pathway. So it's much easier to curate physical interactions. But I wanna point out that some of the relationships and the underlying evidence behind some of those interactions are tentative. I used to work for a curate, I used to manually curate physical interactions. And I apologize for any research in the lab here, but I read some papers that were awful. And you actually looked at some of the interactions and you couldn't tell whether A was actually binding to B or actually was, you know, whether A was actually binding to C and B was there as some additional co-factor. Yes? To explain briefly the process of manual curation. Manual curation, yeah, sure, absolutely. So basically, you typically will have, I would say the majority of curators are actually PhD level individuals. I mean, I certainly had a PhD when I started curations and I actually had a background in developing tools for like identifying interactions and genetic interactions as well. So a lot of the curators have a strong background in biology particular areas of research. You don't have to be a computer scientist to be a curator. In fact, that may sometimes actually be better. The process usually starts with identifying an area of research that you wanna cover whether that's, you know, I wanna curate yeast interactions or I'm interested in the EGFR signaling pathway or whatever and then essentially what you do there is you read a lot of papers, you do a PubMed search most of the time, you'll read a lot of papers, typically a review article just to get your kind of basis of where the boundaries are in the pathway or the series of interactions you wanna curate. Sometimes you wanna be involved in author, an expert that helps you to guide you with pointing in the right directions. You read a lot of papers, you identify out all the molecules, you try to identify the right annotations for those molecules and then you start plugging that information into a kind of curation tool that's different for different resources. And that information goes into a biological, it goes into a database, so it's stored there and then associated with that database as tools for visualization for the users and then some data resources may well have a review process that could well be other curators reviewing the materials or that could be other curators and external experts that review that information and then those are actively maintained, you could get weekly, quarterly, yearly updates of the data, so that's curation really in a nutshell. So the popular sources of human curated networks are fire grid, intact and mint, they just differ based on their content, their size of the number of interactions that they have and the number of interactions that they're curating. And they're also maybe the level of detail that they have on a particular interaction. So just as an example here, we're just showing a screenshot of a search for P53 in intact. So typical table of results is here. You see that Molecule A obviously is typically the bait, it's P53 and then the other molecules that you're seeing listed, you're only seeing one here just now which is, I'm actually gonna read the glasses, is MDM-2 and then in the other column which we're basically seeing a little snapshot of information about how that interaction was demonstrated and then whether there is a corresponding interaction in another interaction database and obviously the source of that interaction data. If you click specifically on that record, you can get more additional information about the interaction of the source, who curated that interaction and you can link out to other data resources and go back to say PubMed citations or read more about the paper. Now to compliment a lot of these databases and it's kind of leading to a question earlier, we provide data in a variety of different formats that allow people to download the information from a pathway and a network database so that they can start building their own models, a network for visualization or maybe reuse the annotation in some other way. So there's different types of data exchange, sometimes these are just flat files, sometimes these are database files, otherwise. So SPGN is a way of visually representing pathways and networks Biopacks is basically a language that aims to allow you to integrate and exchange visual analysis visual and analysis biological pathways. So I'll show you in the next couple of slides how that's relevant. Sidekick is just a standard for exchanging molecular interaction data. The intact database uses Sidekick. And SBML is there, it's just a format for helping you to build your kind of biological models. And essentially what these files are, most of the time they're XML or they're either tab delimited files or XML files that you basically import into a particular tool. So you've obviously been introduced to Side Escape earlier today, sorry, earlier yesterday. There's other tools like Navigator and Osprey and these tools differ just based on the types of interaction data that you can visualize or the type of analysis that you wanna perform. There is a variety of other tools, some of them here as well. Actually, I've repeated Side Escape just to show you uploading a network from Reactome into Side Escape using the Biopacks file, downloading the SBML file. You can upload that into Cell Designer which allows you to start building molecular models. And then that SPGN file that I was talking about earlier, that's basically taking an amyloid pathway from Reactome and allowing you just to re-visualize that in the vaunted, so basically it's a graphical editor and you can start annotating your own pathway there. Now in terms of analysis, let's talk a little bit more specifically about pathway and network analysis. Now this review was published a few years ago but this diagram actually really nicely summarizes a lot of the relevant types of pathway and so-called network analyses. Basically, you have an input which essentially is your list of genes, small molecules or proteins and you have a pathway database. It could also be a network database. And these are basically the databases representing the gene sets. They're flowing into a variety of different brands. So by most popular gene set enrichment analysis, over-representation analysis is, so you basically identify a list of differentially expressed genes and there you relate that to a variety of significant pathways. The next level is this functional class scoring. We'll learn a little bit more about this in the types of analysis. This is using tools like Reactome, Fi Network. And then finally, and actually the other tool that actually uses the kind of the gene level statistic or the gene sets statistic is the gene set enrichment analysis tool that some of you might be familiar with. That falls into this FCS category and pathway topology is really where you're actually looking at additional information about the pathway in the network and the kind of the types and the numbers of reactions, the relative position of the gene within the network, maybe the some functional impact of having a variant in a pathway or how that affects downstream events. And that's tools like SPIA and Paradigm and I'll talk a little bit more about this later. Essentially what these, all these tools are doing are analyzing your list of genes and presenting you with a set of enriched pathways, whether that is a list of pathways or whether it is a graphical representation of your results. That's just how they differ, these tools. They're all complementary approaches. You can take your lists and you can apply them to a lot of different sets. And I would encourage people to actually do more than one analysis just to actually identify the relevant pathways in your gene list. So here's another way of looking at these different, these three different approaches and actually here it lists nicely some of the tools. So basically the enrichment of gene sets. We've got G Profiler and GSEA. The Reactome website would be considered one of these tools and basically we're trying to identify what biological process could be while the altered in cancer. The second level of these kind of de novo sub network approaches and these are using tools like side escape and the plugins that you kind of learned about yesterday. So that's the ReactomeFI plugin we're gonna learn about today. We didn't necessarily talk about gene mania but there's another workshop that does use gene mania. And it's basically there. You're looking for things like new pathways altered in cancer or are there clinically relevant tumor subtypes within my data. And finally there's kind of pathway based modeling. So this is basically looking at pathway activities of the altered in a particular individual patient or their targetable pathways in this patient. Now there are some challenges to pathway based analysis which is why we may well prefer to use a network approach. One of them which I kind of alluded very quickly to in some of the pathway database some pathway databases have this kind of hierarchical organization of events. So you'll have a top level pathway which could be called something like signal transduction. And then underneath that you're gonna have a series of sub pathways which are signaling by EGFR, signaling by FGFR. And under each of these pathways you're gonna have individual Gabba1 signalisome, Matt Kiney's signaling and then under each of these pathways so you're gonna have these kind of this hierarchical organization. So sometimes it's easier to flatten this down into kind of a system wide approach and just to kind of be aware of the hierarchy but in terms of the network approach you kind of have that there as an additional annotation and I will show this later in a better example when we're looking at the react on a fine network. And the other thing is how to handle pathway crosstalk. Pathways when you're visualizing a pathway diagram sometimes you're looking at EGFR signaling but obviously that links to well actually let's take another example let's take notch signaling, okay. Notch will be connected to the variety of different components within that pathway to TGF beta signaling, also Wnt signaling, HIPPO. I think it might be a fourth one. But anyway the point is, you know there's a point where again it comes about the boundaries of the pathways, you know, shared genes between different pathways. Where do you, and basically in a network you're only listed once. Pathways that could be listed several times. And then interactions causing crosstalk are displayed basically in the same network. That's the whole point. The other challenges with using pathway databases and particularly the ways in which you upload data into the website tools. They're not necessarily conducive to a lot of now cancer omics data sets where you've got data about a single patient or you could have really kind of complicated data types like copy number variation data, gene expression data, methylation data, somatic mutations. Take Reactome website for example. We can handle gene lists, gene expression data. And you know, if you modify the somatic mutation data correctly you could potentially visualize that data through our website. But again, side escape. Some of the tools like Reactome and Fi Network could be better suited to actually analyzing these different omics data sets. And particularly when we kind of move to these kind of network simulations then it's much easier to be using network-based data for this. Now network-based data analysis, you typically are system-wide. They're gonna have larger coverage of human genes. They're typically protein-brote interaction networks. There could be gene-gene interactions based on genetic interactions as well. And the goal there is to identify kind of modules of genes that are kind of tightly connected by one another. And then you can annotate those clusters using enrichment analysis. You can use that information. You can use the network approach for some form of gene signature or biomarker discovery. We'll do that a little bit in the lab. And also, as I said earlier, it helps you to kind of classify cancer drivers and some disease models. Now, we call this typical type of analysis, the NOVO subnet construction and clustering. And this is basically where you take your list of all sort of genes, proteins, potentially RNAs as well, and you apply that to a particular biological network. You can identify some kind of topologically unlikely configurations. Basically, the subset of genes that are more closely connected to one another in the network than you'd expect by chance. You can then extract the clusters of these unlikely configurations and basically annotate them. So network clustering is basically defined as a kind of process of grouping objects together into clusters, communities, modules. I sometimes use the word modules. Yes, the question. What do you mean by what you say topologically unlikely that the genes are closer in the network that you would expect? What is your expectation? What do you call closeness of? Basically, it's the kind of, how to explain this, basically, you're looking for genes that are more highly connected with one another. Okay. And then when you do these clustering algorithms, you identify genes that are very tightly connected, but you have these kind of, but you get basically, you get these kind of hubs in the network. And then between the hubs, you have sparser connections. So the idea here is that particular, you're basically trying to kind of organize these genes into a module. And it's based on things, other information about the shortest path between two molecules. So what you're going to likely is that from the knowledge available, these are totally unrelated. Right. So network, there's a variety of different algorithms that can be used for network clustering. And basically, I've tried to summarize some of them here. Now I'm going to take a moment to try and explain this because I could understand some of the principles. They're actually, again, it's like the statistics. It's, they actually do work. There's some assumptions here. So basically the Gerber-Neumann algorithm, you basically start by removing the edges of the highs between this first. So this is a high degree of connectivity. And then you continue to break down that graph into individual nodes. Now that doesn't sound kind of all that exciting, but at the end of the day, when you start breaking up that network, the graph, you break it down into pieces. And then you're starting as you break off those pieces, you start to identify genes that are very tightly connected with one another. And then you can, and then again, you're going to get this sparse connection between these clusters. This is actually the one that you were using in the react-to-message network. It's a very useful, I mean it was initially developed for like looking at social networks and interactions between people. The Markov clustering algorithm, and now this is more of my limited, this is an, I'm going to get this right, wrong. It's an MCL algorithm. And basically it stimulates the flow within a graph and promotes flow in highly connected regions. So the more tightly connected regions there's going to be an increased flow of information. And basically you actually identify these natural groups within the graph. So if you take a random walk, you basically will identify this kind of dense cluster within the network. And again, you'll have these vertices, these edges I should say, that have been visited as you do the walk, the random walk. That one's a little bit more challenging to kind of, I should have actually put a graph, it might have actually been better to show this. Hotnet is kind of like the way it sounds. If you think of a metal lattice in a grill and you take your good old Bunsen burner or something or you turn on your barbecue and you start heating up part of that grill first, it's obviously going to get hotter. And that's where you're going to get these nice, again, nicely connected nodes. So basically if you imagine the grill is made up of a lattice of genes, as you heat up your genes in a local region of the network, basically you're basically identifying the tightly connected set of gene-gene linkages. And then as you kind of heat it up more, they're probably going to become more relevant in the analysis. That's kind of hotnet. The actual, what I'm not doing here is explaining the kind of statistics and the actual equations behind this, which I would obviously get lost in. Hypermodules basically identifies clusters of cancer mutations within the network. And you can use it actually to classify patients based on tumor subtypes when you have the relevant clinical information available. And then we're going to use the reactomify network, which basically implements a lot of these different algorithms, Gerbin, MCL, hotnet. And there's another one called Paradigm, which I will kind of talk a little bit about later. And essentially what you get at the end of the day of your network clustering are these kind of nodes. It's nodes tightly connected to one another. They form the clusters. You can kind of draw lines around the clusters. And then you see these sparser connections that connect the actual modules themselves. So the premise, the reasons we're going to talk now but it's using the reactomify network and the FIBIS app is that we're taking a network approach here to analyzing cancer gene sets because no single mutated gene is necessary and sufficient to cause cancer. Typically you'll have one or more common mutations like P53, EGFR, and so on. And then you have these long tail of rare mutations. And analyzing mutated genes in a network context allows you to reveal the relationships between the genes. You can elucidate the mechanism of action by relating the network information back to other pathway sources. And it facilitates some form of hypothesis generation on the role of these genes in the disease. Phenotype, essentially you're reducing thousands, not tens of thousands of data points down to a handful or a dozen or so mutated pathways. So it's a functional interaction. So basically it's an interaction which two proteins are involved in the same reaction. It could be an input, a catalyst, an activator, an inhibitor, components. So on the left here we have a reaction. And on the right you have the corresponding binary interactions that relate to these different reaction events. And essentially the way that we create the reactomify network is to just basically extract a lot of binary interactions from pathway databases. We not just reactom, but also keg, panther, biocard, NCI. We also use transcription factor data. So interactions between transcription factors and their targets. This is what we call the annotated FI group. And then we use that as a training set. And then we use a naive basing classifier to create a second data set called the predicted FIs. And the predicted FIs are based on these human protein-prote interactions. Modeled interactions from a variety of different organisms, gene expression data, sorry, gene co-expression data, I should say, protein-protein domain interactions, and genes that share go-biological processes. And ultimately we have two data sets, the annotated FIs, predicted FIs. When you combine that, you have this huge hairball of a network with 328,000 interactions which compasses over 12,000 proteins. Now, just to show you how the tool works, just visually. So imagine this was, I mean, this is, I'm showing you a subset of the interactions. If I was to show you that full network, it would just, it would be like a ball of string. You really wouldn't see anything. So imagine that you've got series of networks. You've got, this is part of the sub-network. You basically project your genes in there to the network. Red could be up-regulated, blue could be down-regulated, or red could be mutated, purple could be not mutated. There's elements where those genes are nodes within our network. Obviously there's gonna be edges where they bind, they form interactions, so you get these kind of smaller sub-networks appearing. Now there's still sparse in this. So what sometimes you can do is add linker genes. And these are just basically genes that provide a greater degree of connectivity with elements in the network. You then join them up, and then you take away all the other data, which is basically in the network, which is not part of your data set. And now you're left with a sub-network based on your data set. So that's one way of creating the network. The other way that you could look at creating networks is to download one of the Biopax files, and then start kind of importing that into Side Escape and start building the network from there. We're going from top down, whereas sometimes you go bottom up. So if we take the 127 cancer genes as an example here, I was talking about earlier. What's the role of the linker genes? The linker genes just provide you with, they're not part of your data set, but they provide a certain degree of connectivity between the genes within your list. Forgive me, but I don't recall the actual algorithm that's used to actually define the linker genes. But it's a minimal set of genes that can be used to connect your genes together. Does that mean that you should go back into your biological data to see what's more depth, whether they're not affected or, I mean, it's not affected and not changing? Well, we'll take an example. What if the linker gene is a transcription factor, but it's not part of your data set? See if you've got genes that are being upregulated. And what you might find actually when you try to cluster those genes together in the network is that they do form these clusters around this linker gene, which is actually a transcription factor. So the transcription factor might not necessarily be part of your data set, but the genes that are part of that module are in part of your data set. So it's things like that where the linkers can actually be kind of helpful. The thing that you'll discover when we do the analysis is that we don't necessarily include the linkers in the enrichment analysis because they're not part of your data set, but they can be informative sometimes in terms of helping you to understand more about the data that you're seeing in the network. So this is the network view that you're gonna see when you create the FI network of this subnetwork from these 127 genes. There's about 100 genes, I think, that are actually part of this network. You can see that when you do the clustering, you get four discrete nodes, sorry, four discrete modules. And then by guild by association enrichment analysis, the genes within these modules could well be involved in same biological processes. But in fact, when you annotate them, you can see the top left modules involved in signaling by EGFR, FGFR, SQIT. The module on the right, you got notch, wind, TGF, beta. You've got a TP53 module at the bottom here, then a module involved in cell cycle. Now there could be other relevant interactions, sorry, other relevant annotations associated with a pathway, particular pathways, but the top level events that are discovered in the enrichment analysis are being overlaid onto this diagram. So there's a little bit of arbitrary decision-making there about whether you believe it's more EGFR than SQIT, but it's a hypothesis-generating tool here. It's not gonna give you all the answers, or it could potentially give the answers if you're aware of some priori knowledge and you're basically using this network tool as a way of kind of, you've already come up with the hypothesis in the lab and you're using this tool to actually confirm that or not. Now, you can combine the FI network with gene expression data, and it's possible to search for network modules that are related to patient overall survival. So you basically upload expression data into the network. You use the expression data to actually create the subnetwork. You use the ortho MCL, sorry, not ortho, the MCL algorithm to basically perform the clustering analysis, and I'm just focusing here on one module that was identified in the analysis, and it just so happens it's involved in cell signal, cell cycle, M-phase annotations, and the reason that we've created two sub elements of this module is that different pathway databases are giving me slightly different annotations, but they're complementary in a sense, and they're relevant events, and basically what we're seeing on the right here is once you've identified that module, you can use, if you have the relevant clinical data, which essentially, whether, so in the case of this data set, there was information about the clinical subtypes of the cancer, whether the patient was alive or dead after treatment, or and how long, whether they were still alive or not following treatment, and so you can use that data to develop either Cox proportional hazards or Kaplan-Meyer models to basically do the survival analysis, and from there, you can basically relate whether or not the modules that you've identified could well relate to a prognostic signature based on these survival analysis plots, probably not explaining it very well, but the bottom line is the 31 genes that you see on the left here, high expression of those genes is significantly related to the breast cancer patient survival in at least five independent samples, so patients with low expression of the module genes that you see here on the left are actually surviving better, that's the red line, whereas those, sorry, low expression, lower expression is implying better, sorry, lower expression, sorry, yes, my apologies, lower expression of the genes that you see in the module there in the patients is better survival, whereas high expression of those genes in this particular model is implied, well, more deleterious outcome. So finally, I'm gonna talk a little bit about pathway-based modelling. This basically is to kind of infer how pathway states are disrupted in disease, particularly maybe there's a variant that impacts on a particular pathway. It uses both quantitative and qualitative measurements to infer the activities of various components within cancer biology, and as such, these methods relate the activity of some components with the influences and consequences of that activity or an activity, I should say, on these components. So the different types that are out there are based on different software tools, so Cell Analyzer is a MATLAB tool that provides a kind of graphical user interface and a variety of different algorithms for exploring the structural and functional properties of metabolic and signaling networks, and some of those different algorithms are provided to be there to do kind of, computational strain design or metabolic engineering. So there may not necessarily be relevant to cancer data analysis, but it could be related to other types of modelling. NetForest, which actually now has been superseded by NetWorkIn, provides tools to deconvolute underlying intracellular networks within large-scale proteomic, or sorry, phosphor proteomic data sets. So basically you can elucidate the phosphorylation events associated with a given phenotype or a disease condition. Arachnida is actually one of the more older novel algorithms out there, and you're essentially using micro-expression data there to design and scale up the complexity of these regulatory networks in a mammalian cell. But it can actually also, but it has some, let's say, it's general enough to address a wider range of network problems that are out there. It eliminates a vast majority of indirect interactions that exist by inferred pairwise interaction analysis. Then finally, Paradigm, and I'll spend a little bit more time talking about this, basically allows you to integrate a variety of different omics data types. So it's more relevant to cancer data sets where you've got copy number variation, gene expression data, variant data, also mass spec data about the protein states. And essentially the goal here is to basically find significantly impacted pathways for a given disease. And you can potentially link pathway activities to patient phenotypes. And you can analyze individual patient samples in this data. So on the left here, we're seeing just a typical event within apoptosis for MDM2 inhibiting P53. This is a simple graph view of it. For Paradigm to work, you have to basically describe this simple molecular event in a variety of different states. Now each state is reflective of like. There's information you have about the gene. The gene has obviously been transcribed into RNA. The RNA is translated into protein and the protein has an active state in the cell. So there, where we've had two events, sorry, we've basically had two molecules before, we've now got eight. Associated with that, each data type is a different data type. So with the gene data, we had mutation, variant information, we're gonna have gene expression data with the RNA, protein, we could have mass spec data. With the gene itself, we could have copy number variation. And essentially what you can do is combine this data into different, you could basically combine different data sets in the analysis and ask the question, well, if I have a mutation in MDM2 gene and there's higher level of expression of P53, how does that impact on the apoptosis pathway? Now, I'm skipping over a few different things because there's a lot of kind of things that can be seen in this analysis. The typical output could well be something like this where you could visualize the pathway activities as a clustered heat map. So in this example here of a paradigm we've analyzed the GBM data sets. So basically grouping the GBM patients based on their significant pathway perturbations, you can divide them into four clinically relevant subgroups, having significant different survival outcomes. And just to highlight something, if you look at the fourth subtype, there's clearly a distinct from the three others, you have clear down regulation of HIF1 alpha, transcription factor network, as well as over expression of the E2F transcription network. And then in two of the first three clusters, you have this elevated EGFR signaling here highlighted by red. So the clusters appear to be honing in a different biological meaningful themes that can potentially stratify patients. Another way of looking at this, and this is just a different data set. This is the analysis of ovarian cancer data sets. And basically here using the ReactomeFI network. And we're seeing here basically based on PGM analysis and a cancer, an ovarian cancer mutation data. We found network modules built around TP53 signaling which could be used to distinguish between different ovarian cancer patients. So there is different mutations, different driver mutations in these patients that could be contributing to different outcomes within the same pathway. So the good news is, well, let's start with the bad news first. Paradigm to actually use it is rather difficult source code. It's a little hard to compile. And also you have to convert all your dive, all of the kind of pathway events into these kind of formatted modules. So less documentation and it takes some time to run. The good news is that we've incorporated some of this into the Reactome site escape application. We won't necessarily be doing paradigm analysis today, but I would encourage you to look at the website and look at the analysis. And basically we've pre-computed a lot of the pathway modules for you. So you can actually do this kind of analysis. And we're working to kind of improve the performance. And I hope that at some point we'll actually have this as part of an integral part of this workshop. So just in summary here, I've listed some of the database URLs I've been talking about, you can go to. Here's some more about the analysis, network analysis, clustering tools, pathway modeling tools. Now again, it says here we're on a coffee break.