 My name is Robin Ha. I work here at OICR. I'm the project manager and outreach coordinator for the Reactome Project. You'll hear a little bit more about that during today's talk. I have quite a checkered history. It's a good checkered history. I started off as a microbiologist, got into genetics. I then went to Japan and did some post-doc in there and then came back to Canada, and got into genomics, proteomics, and then finally bioinformatics. So I've done quite a lot. Now, what's not on my CV is how I know Francis Ouellette. Okay, I can't be here without this. So, both Francis and I started off with new slabs. So there's one degree of connection there. It demonstrates the nature of the idea that really we are connected with less than six degrees of separation. It's an awesome power of the yeast. An awesome power of the yeast, exactly. It is one of the best model organism systems. You can do pretty much anything in it. Anyway, the other thing is I know Francis because the lab that he used to work in, there was a graduate student there who ultimately became my supervisor when I first came to Canada. And then after that, I ended up working on a project where Francis was a co-PI. And then I obviously started here at OICR and was doing CBW at Francis again. I still know Francis there. So I'm trying to think of the next thing. Anyway, so let's get going. So we're going to talk about variants and how we relate them to networks. I just would like to initially acknowledge that the slides that I'm kind of presenting to here are created by myself and by some others, including Veronica Hussain, Gary Bader, Lincoln Stein. And I've also used some of the slides from the EBI training resources as well. So here's the learning objectives for today. The goal here really is to understand the principles of network theory and analysis. We'll talk quite a lot about the network data that's out there. And those analytical approaches to network data analysis, visualization, and a little bit of data integration. And then finally, we'll give an overview of the Reactome-FI network. And the Reactome-FI is site escape application. And we'll actually do this as part of the lab. Yes, Francis? Functional interaction. Sorry, thank you. So what is network analysis? I used to talk about pathway analysis a lot. And so basically it's the same definition. Just scrubbed out the word and changed it to network. And it's essentially any type of technique that makes good use of biological molecular network information to gain some insight into a biological system. I always say it's rapidly evolving and that is because there's still new techniques being developed, new approaches, new data sets. Particularly in the multiomics world now, it's really important to have new algorithms and new approaches because we're now not just talking about smaller, small data sets. We're talking about much larger data sets, a lot of clinical variants. And how does that information all come together and get displayed in a network graph? So as I said, there's many approaches here. And I, fortunately, within an hour, I can only talk about a few of them. So why do we do network analysis? I think, again, it's the same. You can scrub the word out network and say pathway analysis here as well. It's a very intuitive display of information for scientists. You can visualize multiple data types within the network. And there are, certainly when you get to large data analysis, computational methods to actually automate that approach. But the important thing about network analysis is it typically satisfies a number of key use cases in biological research. I think by far the most common use case is analyzing a gene list, finding those hidden patterns within your gene list, or maybe a protein list. It's also a great way to visualize emergent models that you're developing within the lab through different experimental observations. They're useful in predicting the function of an annotated genes. It's a great framework for establishing quantitative modeling or systems biology. And as we'll learn a little bit later, there's ways in which we can use graphs and networks and pathway enrichment analysis to potentially identify molecular signatures within your data set. So, as I said, one of the most cited reasons for using a network database or network analysis is to analyze gene lists. Just as an example here, we have several, basically a gene list that was derived from the cancer genome atlas. The informaticians there identified 127 genes, which they classified as cancer driver genes based on their mutation frequency. Does anybody recognize any genes in that list? Right, nodding your head. That's good. But we don't really know what these 127 genes are doing and why these mutations cause cancer. So basically, pathway and network databases allow us to map these genes onto biological pathways and networks to potentially understand their rules within these pathways and networks. So, network analysis in biology. So, we typically try to represent a lot of biological systems as networks because there's a lot of complex binary interactions or relationships between the different types of entities that occur within a cell or potentially within a system. And so, every biological entity in some ways, whether it's a small molecule, a protein, a gene, an RNA molecule has some interaction with another biological entity. And that can be from the molecular level all the way up to the kind of ecosystem level. You know, there's a lot of different types of interactions that we can track. And a lot of the biological network analysis that I'll talk about is historically originated from the tools and concepts of social network analysis and the application of graph theory to these social sciences. So, what is an interaction network? It's essentially a collection of nodes and vertices. Sorry, nodes or vertices and edges that connect these nodes. As I mentioned, the moment ago nodes can basically represent any type of molecule, protein, gene, small molecule, drug, transcription factor, could even reflect an ontology term. And the edges themselves are basically either physical or functional interactions or some type of relationship between those two nodes. And depending on... So, the edges can be a lot of different information about how the nodes link together. But also, depending on the nature of the underlying edge information, different types of analyses can be formed. So, it's important to know the types of nodes and the edges that you have in the network. For those reasons, it's useful to highlight the main types of edges that can be found within a network. So, networks can represent different edges. There's a directed, I'm sorry, an undirected edges. This type of edge is found in protein-protein interaction networks. And it's a simple connection between the two nodes or any node without any additional information. And typically, the evidence behind the relationship only tells us that A binds to B. A directed edge is a connection which is typically found in, say, metabolic signaling or gene regulation network. And there's a clear flow of information between the nodes. Now, both directed and undirected edges can have weight or quantitative value associated with them. So, things this could depict the concept of the reliability of the interaction. The strength of that interaction could reflect quantitative expression change between these two nodes. For example, one gene, sorry, a transcription factor could regulate a gene and it could be a numerical value associated with that. It could also be sequence similarity. Two genes could be homologues, one another, orthologues. And these edges can also be weighted by other topological parameters that I'll talk about shortly. So, taking a moment to go step back and looking at the different types of biological networks that are out there. The meaning of the nodes that the edge is used within a network representation depends on the type of data being used to build that network. And this needs to be considered when analyzing your data within the context of the network. Different types of data will produce different general network characteristics in terms of the connectivity, the complexity of the structure. And we'll talk a little bit more about that as we go through these slides and in particular focus on the type of protein-protein interaction networks which are predominantly the most that you see in this world. Yes, the question. Sorry to drop on the question. On the previous slide, if you just go back. Because the network used on the left of the slide, there is for example connections between node A1 and A2 and the double view represents the weights between those two nodes. Yes. Then how to just calculate that based on the fact that the output is in two genes? With people that we didn't know. Well, as I said, these weights could be... So if you're looking at a gene expression data set, it could be a correlation coefficient that weights that because you have a connection between the two nodes and you can apply that correlation to the edge. It could be another kind of score that you have using some other algorithm. And you're kind of basically, you could extract, say for example, you could extract the interaction AB from a variety of different sources of data. And we talk about this in terms of the functional interaction network as an FI score. And that's based on the fact that anything that has a score of one is basically an interaction derived from a pathway database which we think is a really good, you know, high quality interaction. But as you step down, say for example, you have an interaction between the two proteins that may exist within a model organism system and the experiment is basically a two hybrid. You might not necessarily say that the strength of that interaction is as high. So you could use an algorithm to kind of calculate a weight measurement there based on the type of information that you have available. And so that would score the edge basically. But again, it could be like, you know, if you're doing sequence similarity, you know, key value, it could be a whole bunch of other numerical values that you can apply to weights, the node. Okay? Yes. No. So for instance, if I want to represent the two proteins bind together and activate a gene, I could just represent it as A1, A2, and track both of them bind on a DNA. Theoretically, you could do that. I typically don't look at those types of networks because, I mean, you could look at that in a pathway diagram context where you have two components, format complex. I would also think of that kind of complex. Yes, actually you could. I'm actually contradicting myself now, but yes, when you actually upload, you can download a lot of interaction data from, say, Reacto in different formats. And the Biopax file is one of these data standards. When you upload that into Side Escape, it actually does that. So I stand corrected because I typically think of networks as you see here. You know, the nodes are typically all the same. Even when you're talking about transcription factors regulating genes, theoretically they're different entities, but you may use the same kind of annotation to get that information into the graph. So realistically, you could... the term gene and protein sometimes are interchangeable in the context of getting the information into the graph. But the graph itself, the network, is actually showing different theoretically information. Good question, though. So there's the kind of common types of networks that we have. There's the metabolic network. Basically, you've got two nodes, the enzyme and the substrate. So again, this is going to get us back to your question. The enzyme could well be... was obviously going to be a protein. The substrate could be a metabolite, a small molecule, so forth. So metabolites and enzymes represent the nodes and the reactions themselves are represented by the edges. Reactions themselves could be unidirectional or bidirectional. And what's certainly important about this type of network is that the edge itself can also represent the direction of flow, sorry, the metabolic flow, or the regulatory effects of a specific reaction. Let's get back to your question there as well. A genetic interaction... so a genetic interaction is a derivation... sorry, not derivation, a deviation, sorry, deviation from the expected phenotype when combining multiple genetic mutations, when the individual mutations themselves alone do not exhibit the deviation. So this was going back to Francis and I in our yeast days. Butting yeast is classically the model system that demonstrated these types of genetic interactions, and you're measuring a single phenotype. In this case, growth rate. So basically, if you have a situation where you've knocked out gene A and there's no effect on growth, and then you've knocked out a second gene B individually and there's no effect on growth, but the combined deletion of both genes has an effect on growth, that's what we call a synthetic... well, it's actually the synthetic lethal genetic interaction. So the genes represent the nodes and the edges represent the relationships between those nodes. We have a gene regulatory network. It's common to represent transcriptional and regulatory networks with nodes being a merge of genes and transcription factors. Go ahead. So then there is an example when you showcase the negative. So in that case, A and B are not going to result in synthetic mortality. So we're going to be B and D not going to result in... Yes. I mean, traditionally, it's the same kind of effect, theoretically it could be any kind of phenotype. The question is important when you're actually merging different data types into networks is making sure that you visually distinguish between that information so that you're not confusing the user as to the source of the data. So you could theorize, you could actually bring in different genetic interactions and present that in the same network. Absolutely. The other cell signaling networks, so basically this is the communication system that controls cellular activities, the kind of transduction of a signal from potentially outside of the cell all the way down to the nucleus. It's typically an ordered sequence of events and clearly there's a flow of information that's being dictated by these edges within the cell. And the entities in this type of network could potentially be proteins, genes, metabolites, drugs, you name it. So protein-protein interaction networks now, and this is probably where I'm going to focus most of my time now are basically the graphical representation of the physical contacts between proteins in the cell. Protein-protein interactions are probably essential to almost every cellular process. So understanding protein-protein interactions is kind of crucial to understanding cell physiology and also in the normal cell but also in certain disease states as well. The interactions themselves can represent both transients and stable interactions. So stable interactions are found in protein complexes like the ribosome or hemoglobin. Transient interactions are kind of the brief interactions that modify or they carry a protein leading to some additional change. So like a protein kinase or a nucleopor importer. And these transient interactions constitute the most dynamic part of the interactome. The interactome, sorry, the interactome I should say is the totality of all protein interactions that happen within a cell or within a particular biological context. And the development of kind of large-scale protein-protein interactions, screening techniques, things like the two hybrid and I'll talk a little bit more about them in a moment, has created this kind of large volume of data out there of interaction data. And there's a variety of molecular interaction databases out there such as Intangument and Biogrid that allow you to download and use this information to help construct your network and perform network analysis. So the fourth step really in performing your protein-protein interaction network analysis is to build a network. There's different sources of protein-protein interaction data. Essentially, you can obtain this data from, say for example, your own experimental work you could be doing. I'm not sure that many people are doing these two hybrid experiments now, but there used to be a way of doing these types of identifying mass interactions. And you can choose how that data is represented and stored. The second source obviously, as I've mentioned, is a primary protein-protein interaction database. So basically, there's typically a team of curators that extract the protein-protein interactions from the experimental evidence reported in the literature. It's very much a manual process. It typically does provide you with a good source of information, but it is important to understand that there are differences within these databases and the data that they curate. So there's a group called the IMEX Consortium. This is an international collaboration between a variety of different interaction databases. Their goal is to basically share that curation effort and to work on standards for curation so that there is some unity across these different resources in order for users to access and download that information. I would say that it's probably often necessary to integrate different types of protein-protein interaction data. One source doesn't carry everything that you will need and they may not necessarily have within their data set a full representation of the types of interactions that you're looking for. The other things to consider are to avoid redundancies and inconsistencies. Multiple databases will have the same type of potentially protein-protein interaction data. There are inconsistencies in the curation that can be annotation mistakes. You can actually find one paper that says protein C does interact with protein D and a second paper will say that protein C does not interact with protein D. So the question is, who's right? Well, we're not sure. And it may be that you need to do other experiments to actually identify that information. So there's some caveats into just simply downloading the data and generating a network. So there is a variety of different IIMX partners here. I'm very familiar with it. I'm intact, which is based at EBI, Mint is in Italy. There is also a variety of different data exchange formats, systems, biology, markup language, SPML, SPGN, systems, biology, graphical notation, protein standards initiative, or SI, and then there's biopacks, which is basically a biological pathways exchange. These are all different kind of ways in which you can essentially download a file of interaction data and upload that data into a visualization tool such as Cytoscape. I'm not going to spend too much time talking about these data formats. They're just a useful way of getting data from out of the database and into your analysis pipeline. So it's important to be aware that the interaction data in these databases is derived from different experimental methods, which emphasizes once more the limitations of some of this available protein-protein interaction data. So these two hybrid I mentioned a moment ago is probably one of the most prolific tools that has generated a lot of interactions. The question, obviously, is how truly is that interaction in the physiological cell? A, interacts with B, sure, in these two hybrid system, but you may want to apply other methods that actually try to isolate that interaction within the cell. And it's clear that amongst these methods and also this data within the data in the interactions databases is that there's many kind of what we call false positives and possibly false negatives in there as well. So keep that in mind. Our knowledge of the interact always a little bit incomplete. We don't understand all the protein-protein interactions that could potentially happen in a human cell across every tissue under every condition. We don't have that information. And as I said, it's noisy as well. There is ways to use text binding to pull out interaction data. It's faster than having curation. It's not a perfect science either. This problem is recognizing gene names. Is Hedgehog a gene name or is it a species? But there is ways to deal with that and there's some approaches in natural language processing that have been used to actually create effect. There's popular resources out there that use resources. So you use text binding to create interaction networks. There's Pathways Studio, Path Text 2, and there's a group called Biotic Creative which are actually a group of researchers that are trying to improve the text binding approaches in the context of many different biological questions. So let's talk a little bit more about some of the principles of network topology and how that can be applicable to network analysis. Because understanding the complexity of that network is key to extracting useful information that you would not otherwise learn by examining genes on an individual basis. Analyzing the topological features of a network is useful in identifying the relevant participants within the network or your gene list ultimately and looking for what we call substructures in the network that could be useful in trying to ascertain the biological significance of the network that you've created. You can apply these topological properties to the entire network itself as a whole or to individual nodes and edges or structures within the network. And there's many different strategies that could be used here and again I'm going to try to basically scratch the surface here with discussing some. So we're going to introduce some of the terms small world effect. So protein-protein interactions show this small world effect. Basically it's a way of saying that there's great connectivity between proteins. In other words, it could be said that there's a maximum number of steps separating any two nodes in a small it's basically small no matter how large that network is. So I'm going to say that most of you are familiar with Facebook. Yes. But that clearly is a social networking tool. The principles there basically dictate that you know well it used to be traditionally that you know any two nodes or any two people were separated by no less than six steps. I think Facebook's demonstrated that that's a lot less now. It's really your basically my connection to you is well there's a direct relationship now because we're sitting in the classroom and I'm teaching you. But before that there could well have been maybe one or two hops between you and anybody else in this room. Clearly I've demonstrated I've known France for a lot longer and that's started off as probably a one to two hop down to one hop and then clearly we're a hub. And basically this kind of connectivity is really it allows for what I call fishing a quick flow of information or signal within this network. But you know it does pose the question is that you know if the network is so tightly connected why don't perturbations or variants or drug interventions where they're perturbing a single gene or a protein within that network have a much more dramatic effect or consequences on the network. Biological systems are extremely robust and they kind of cope with a relatively high amount of perturbation in a single gene when you're perturbing a single gene or a single protein. Now in order to understand how this can occur we have to think about more about another property that's called scale free networks. So basically protein-protein interaction networks are scale free. The number of connections each node makes is called the degree and the majority of nodes in scale free networks are basically only have a few connections to other nodes and there's so that's what just basically what we call so basically there's a large number of low degree nodes and then there is a small number of high degree nodes and these are nodes where there's more than one there's several connections. And this actually promotes this nature of the network promotes stability so basically when you do get random failures within the network the vast the vast majority of the proteins are actually unaffected in that network in terms of connectivity and you don't lose that connectedness so you don't lose necessarily the functionality of that element of those nodes in that network and basically it's inherent to changes in scale so as the network gets bigger the network the nodes in the edges tend to stay quite stable and I would say that when you start to lose more of the major hubs where you've got lots of these connections when you start losing more of them then potentially your network breaks down and you basically end up with this kind of set of isolated little graphs, modules that are not connected and that's when problems occur that's probably where we have disease states things like that. So the question down here at the start yeah I see that some of the nodes you've got like a circle on top of a circle or you don't seem to have an edge sorry this this is just for graphical purposes not actually this is just an illustration of a no-free network and I'm actually trying to remember where exactly I took this image from it's not one of one that I generated so typically you would see an edge between the nodes sorry that's a mistake there was another question somewhere okay so another feature to talk about in terms of network topology parameters is distance or shortest path so this is basically the distance between two nodes is defined as the number of edges the shortest path connecting them so if we start in the middle here at zero the nodes reflected that have that number one are basically one hop away here number two, there are two hops away and so on basically when you're doing network analysis you can have a computer basically analyze all of these different relationships between any given position in the network and any other node and so this is ways in which we can actually compute and weight networks and I'll talk a bit more about some of the algorithms that do this in a moment another term I want to introduce you to is called centrality again this is a term that was developed for social networks analysis in the case of protein-protein networks, centrality gives an estimation of how important a node or potentially an edge is and the connectivity and the information flow within that network there's different metrics that can be used to calculate centrality now it's clear for this illustration that the green and the red nodes are important does anybody have any thoughts as to which one's more important than the other and why? the red one has more connections that's right how about from the green's perspective maybe the green is more important because it's the only thing linking the other ones together whereas with the red it does not have access to the green but the other four it really depends the answers are both kind of right red and green have their different levels of importance here green's got a lot of power because the blue nodes obviously interact with the green and that's the way in which those blue nodes connect to the other side of the graph and in fact they connect through green and red as well and red might be more powerful since red can actually promote the flow of information amongst this kind of close knit community of grey nodes so this is basically a so a quick question for example like the concepts of the when we talk about the importance of the nodes you might borrow the same concept like the neural networks that resemble which each node is signed an input and then you just have the product of the input with the weight in the total of that it shows for example the significance of that node you don't have that concept meaning you say that for example the product of green is more important for the red based on the product of all the weights into the input product we have that concept we have that concept I mean that exists and that's good good question good question again that would require some experimental parameters there's got to be some data that supports the node like usually when you're doing an experiment you're generating some information that you can associate with that node and if you can associate it with the node then you can define that on across the edge and then you can do that from the whole global network and then you apply an algorithm to basically essentially try to kind of identify within this network we'll talk a little bit about this in terms of modulars but the idea of finding structurally sorry not structurally, I shouldn't use the word structure because you're not really necessarily looking but you're looking for modules of where there's a high degree of connectivity and so yes exactly it's again here's another way of looking at centrality measures so remember degree is the local it's the local term so that refers to kind of the dependencies and it's nearest neighbor global centrality measure taken to account like the entire network so something called betweenness centrality I have to say I always get my betweenness and closeness mixed up but when I try to define this so betweenness centrality is where a central node provides the shortest path between the nodes and I'll demonstrate that in a moment with this network the second global centrality measure which we talk about a lot is closeness centrality and this is is measuring the closeness of a central node to other nodes and basically this is a way of getting an estimation of the flow of information in the network between one node and another node so basically in terms of degree we're talking about the dependencies of other nodes so this is lost the cursor there for a second this connection to here to here, to here, etc closeness is basically the closeness of all to this node here oops, sorry from this node here to this here is two hops one, two that's closeness and then betweenness we're saying that this node in the middle here is connecting the nodes on the left to the nodes on the right okay that's how we do it now another important characteristic of POTIPO interaction networks is there what we call their modularity and there's a term called transivity or cluster coefficient for a network so basically high transivity means that a network contains these kind of communities or groups of nodes that are densely connected internally and basically looking for these kind of communities in the network is a nice strategy for reducing the complexity of the network and extracting functional networks sorry functional modules within the network so things that are like protein complexes for example reflect the biology within the network and there's several terms that are commonly used when talking about these clustering methods and I'll talk a little bit more about them in a moment actually I'll talk about them now so there's motifs these are subgraphs that repeat themselves in a specific network they're typically significantly typically they are statistically significant patterns they could form like a negative feedback loop there is also on the right there's these network communities or clusters and these reflect these kind of group of nodes that are connected with themselves than with the rest of the network okay and when we describe protein-protein interaction networks there's essentially two categories the functional modules and then these protein complexes modules are interchangeable functional units in which the nodes do have they don't have to be interacting within the module in the same time or space but the most important characteristic of the module is that the intrinsic functional properties do not change when it's placed in a different context and then in case of a complex essentially it's a group of proteins that interact with one another at the same time and in the same space the one thing to remember is that no assumptions are made at the internal structure of these communities we're only looking at high density sorry, yeah, high density regions there is a variety of different algorithms that are used to kind of identify these community modules it is algorithmically actually it is actually it is algorithmically challenging it's an extremely kind of complicated process forgive me I'm trying to find the right words here to explain it because it is there's a variety of different algorithms they have different approaches and different methods the goal is basically to identify regions within the network where the genes are tightly connected basically there's a whole bunch of them there's Markov clustering algorithm Chinese whispers whispering sorry, Chinese whispers whispering Chinese whispers clustering a number of them have been kind of developed on the social network and have been reapplied for biological networks the one we'll be using today in the ReactoFI network is something called the Newman-Gerwin-Fast greedy algorithm and there's no simple way of describing it but it's just like chiseling away at a piece of ice and you want a beautiful little ice cube at the end of it that's kind of something that represents your favorite mountain if you chisel off too much of that piece of ice it fractures and it breaks and you lose that structure so the skill of the person that is chiseling away that ice with the pink is to just chisel off enough that you get risen and that's the only way I can describe it because these different approaches are really kind of like they have really weird equations I don't want to present you with statistical equations or algorithms because you'll we'll all get lost the important point is these I didn't quite understand that the difference between motif and processors what I see here the motif, these are all these four groups are all connected by that central point yes that's right you don't necessarily have a central point within the clusters because with the cluster typically with the clusters you're looking at identifying more modules there basically is a high degree of connectivity within the module relative to across the modules that's what you're looking for and the purpose really of that is using guild by association genes that are falling within the same module are potentially involved in the same biological processes so the next step really is here is the annotation enrichment analysis so basically there's different ways in which you can understand the context of the biological context of the protein protein interaction networks so we use these analysis tools and I should point out it's not strictly a network analysis tool that we're talking about now but it's the combination of using the annotation analysis and the topological network analysis is a way in which you can potentially find not just functional modules within the network but also to label those functional modules with some form of annotation which could be gene ontology or it could be some other annotation that you have I have a quick question for the cloud string that you mentioned initially when you just want to run let's say the algorithm do you have to predetermine how many clusters you want? no or how the system decides on the number of proteins and protein interactions what kind of dimensions does the system need? the centrality measures that I was talking about and as I said it's kind of like getting back to the ice cube you start off with a large block you have everything there and you start chiseling away at whether you're different parts of the network and then you kind of get to a point and forgive me it's not as straightforward to tell you where that cutoff point is because for different networks and different data the algorithm is going to handle the network differently but it just basically reduces a lot of information down to a minimal and we'll see this when we do the reactomify analysis it really will be clear because it does give you a list of clusters and that number will ultimately change depending on your analysis or for example the a variety of different filtering approaches you might only be interested in nodes within the network that have a particular piece of metadata associated with them so when you filter that information that's going to reduce the size of the network and that's going to affect the clustering analysis so there's a variety of different things that we can do to basically perform to kind of preset the system for doing clustering analysis so we'll talk a bit more about that shortly in terms of the network visualization and analysis, you need a piece of software to create the network and also it's a handy visualization tool as well you're essentially uploading your data into that tool either as a table format or through one of these specialized data exchange formats the network is there to the tool is there to basically provide you with a way of navigating through the network you can analyze the network using some additional applications within the tool the question here is do you see clusters or network modules you can then label those clusters with pathway or genotology annotations and then you can export that data either as a table or some kind of image that you can present in your publication and we'll do this when we're in the demonstration, we're working through the Reactome ReactomeFI BIDS app so there's different network tools out there typically they're standalone applications most of them open source they've initially been kind of developed for social network analysis but obviously they are applying to biological data there's not a problem here the difference in these tools is maybe in the way in the support for whether they could support small or larger networks for example Gelfi is really good it can support networks where there's thousands of nodes and millions of edges that might represent data slightly better than it would see in SideEscape but SideEscape is probably one of the most popular network analysis tools because it has it's got a lot of applications for basically network representation visualization integration of other data and the analysis tools as well and we're going to demonstrate that later in today's in the lab and then finally there's Navigator which is another project locally here it's another network analysis tool that again has its open source it has a lot of it can support very large networks composed of many nodes and edges and there's a small suite of analysis visualization applications or plugins that will allow you to analyze your data sets so now let's talk about the ReactomeFI network and we'll learn a little bit more about this in the lab but essentially it's a tool that can be used to analyze large gene disease data sets and the purpose of this is analyzing mutated genes in a network context allows you to understand the relationships between those genes potentially elucidate the mechanism of action of the drivers and the interactions between the kind of passenger mutations and these drivers it facilitates hypothesis generation and the role of these genes within the disease phenotype and the take home message here is that you're reducing that kind of potentially hundreds of thousands of mutated genes down to a dozen or so mutated pathways so the functional interaction is actually based upon reliable biological network information from manually curated pathways being extended by verified interactions from a variety of other protein-protein interaction databases so in order to incorporate pathway knowledge into the functional interaction network you have to basically take the reaction itself and basically break it up into a variety of different binary interactions and that's essentially what we've done and you can then create this large functional interaction network there's basically two types of information there's the interactions based on curated knowledge from pathways and then another which is predicted based on a naive basing classifier so we're using features extracted from human protein-protein interaction databases protein-protein interaction projected from several model organisms gene co-expression data protein-domain interactions and basically what you do is you bring all this information together you apply it to the naive basing classifier and it will classify these kind of less reliable interactions they'll score those less interactions and basically we can combine all of these interactions into a much larger network and we rebuild this network on a basic on a yearly basis so just use this schematic as an illustration imagine that this white line represents react-on-fine network we can inject the genes from your dataset into that network corresponding genes and edges in the network sorry there's going to be corresponding genes in the network that correspond to your dataset and obviously by default we know that there's interactions within those nodes within the network now you see there's some sparse connections here it's not always like this with the analysis but we sometimes can introduce these little triangles which we call linkers these are genes that are not part of your dataset there's a minimal amount of these that we can inject into the network to provide some greater degree of connectivity so now what you've done is you've gone from a large network you've projected your genes into that network a smaller sub-network based wholly on your data and potentially some additional genes that provide that connectivity you subtract away everything else that's not part of your dataset and now you're left with a sub-network based on your experimental data I think somebody had a question here for these linker would it be appropriate to go and look at the expression of those to see if they would work out well so here's an example of one if you've got a gene expression dataset that you're using to build the network those linkers could be the transcription factors they'll never be expressed or you may not necessarily detect that change in expression so I actually I've done some data analysis here with this tool I'm one of the transcription factors I actually when I did the analysis I created this nice functional module right in the middle there was actually the transcription factor that was actually direct to the expression of all these these nodes and you could basically colorize whether the genes were up or down regulated in the different tissues so that's the data so basically the linker is basically not part of your data because you can't necessarily see it but it's there in a wholly different way so the question is back there so yeah sure this one here oh this one here sorry so this large network is and actually I do apologize I just realized out here the network says 291,000 interactions our network now is over 400,000 interactions yes there isn't that use for example 9 is acid fire what that should say for that if you use for example the other method of classification for example not linear loss acid fire or pressure do you keep the same networks that you just got it on the next slide or the result is going to be different that's a good question I'm not sure I would wonder if you took a different approach you would probably get a slightly different network I think but I mean why that class if it has been chosen oh a very good question I should talk to the developer about that it's something that he's I think at the time this was the kind of the standard type of classifier that was applicable to this this approach okay but I should check with them exactly because I do know that there is other approaches now that you can basically use to generate these networks it's impossible to present with different classifiers but use is different I should say network in the visual light I would think so yeah so anyway so let's take this 127 genes I introduced you at the beginning of the talk from the cancer genome atlas project so you can basically take this gene list and create something like this based on I have jumped over a few steps here but these different colored modules you can see there's four modules here the genes are more tightly connected here in each of these modules across the network and we can then using annotation richer analysis label the genes within these modules with different for example signaling or receptor tyrosine signaling pathways cell cycle p53 pathway and then signaling by different path so these kind of things kind of do make some sense when you actually take your biological knowledge and actually look at the graphs but this is a hypothesis generating tool you need to go and do some additional experimental validation or alternatively you've already done the experiment and you're actually using another experiment where you want to use this type of analysis to validate your own and you're actually using another experiment where you want to use this type of analysis to validate your own your hypothesis generation there and that previous slide you showed is that based on that cluster set in the bottom right a bunch of those genes are known to be in the sequence and so that additional genes that are not known to be there are also in some kind of hypothesis right or yes I mean you could that there are interactions between these different modules as well across the sparse connection so there's excuse me sorry I'm just using my voice a bit here maybe I shouldn't be drinking coffee just bear with me one second there's different approaches there's different approaches here you could use it as a hypothesis generating tool or a tool to validate and already experimentally derived hypothesis another approach we could take is to combine network analysis with gene expression data to potentially identify network modules that are related to patient survival so basically here you calculate the gene expression correlations for the genes involved in the functional interaction network and you assign those correlations to basically convert to the FI to the edges so you convert an unweighted network into a weighted one you then use a clustering algorithm to identify the modules and then within the ReactomeFI app we have basically two applications one called COGS proportional hazards and Kaplan-Meier model which allows you to do survival analysis and then just on the right here we've got a KM plot that's Kaplan-Meier plot it's drawn for survivability versus time elapsed for the different groups of samples and basically there's a log-rank test between the two lines to check significance so in this case the samples were divided into two groups samples that have no sorry not no they have low expression genes within the module that's the red line and then the samples having high expression in the module which is the green line and basically this particular module is demonstrated here on the left 31 genes within this module were involved in the mitotic cell apparatus so basically a take-home message from this this type of data integration analysis is that patients with low expression of module genes fared slightly better than patients with high expression of module genes within this particular module so basically this kind of single network module potentially a set of modules could be used as a way of defining a signature patient, cancer patient prognosis so now in the final few minutes of this talk and I'm going to ask if I could have maybe two minutes extra of your time there's a few more slides and then the slides towards the end are actually just more informational that I don't necessarily need to go through because it's just listing a variety of links to different pathway and network resources out there so this is just kind of a more topological approach to network and pathway analysis it's called pathway based modelling basically this approach tries to infer how pathway states are disrupted in disease it uses kind of quantitative and qualitative measurements to infer the activities of various components within the pathway in and relate that to some kind of disease such as cancer so basically the goal here is to try and integrate variety of different molecular information sorry, integrate different types of experimental data which could be used to define these kind of multiple molecular alterations it kind of skews a little bit into the area of systems biology but I hope in the next few slides to explain a little bit more about how these can be useful in a variety of different use cases so basically this is a tool called cell analyzer it's a MATLAB tool that provides algorithms and ways to explore data visually for the purposes of exploring metabolic signaling networks in the case of this computational strain when you're trying to do some kind of metabolic engineering trying to alter a strain's ability to produce a protein or something cell analyzer takes in a variety of different metabolic signaling and gene regulatory data brings it all together and tries to predict whether a strain is able to basically I'm not trying to explain this very well, apologies my brain is thinking about it I'm trying to visually think of this as well but basically it tries to explore the kind of structural and functional properties of that network and we'll leave it there NetForest and Networking is basically a way of studying the phosphorylation events that occur with a given phenotype of disease so this is based on looking at large intercellular signaling networks within large phosphoponionic data sets there's Arachne this is another novel algorithm it's been around for a long time it analyzes microwave data sets and tries to basically infer interactions based on different gene expression core expression data and finally this paradigm approach so I'll spend a little more time talking about no I think it's mostly gene expression data that it relies upon there's also a variety of other algorithms that scientists use that allow you to study the pathway of the topology of the network but PGM is probably the Seagraph models the approach here is to again integrate multiple different types of data things like gene expression data with copy number variation or sequence data or proteins data information to understand how changes within individual entities have an effect on the activity of the pathway so basically the questions you can potentially answer here are there significantly impacted pathways within a disease and then the idea here is can you link pathway activities to patient phenotypes as well and can you predict for example drug effects one drug is perturbing a protein within a given pathway but it could also be of off-target effects can you for paradigm to work you have basically this simplified network view here for a traditional pathway but in order further to be integration of multiple different data types you have to basically take each individual component of a traditional pathway and break it down into what we call a factor graph and the fact that each entity will have information about the genes the transcript, the protein and the protein state and each with each state there's an association of a different experiment of the data type and so the goal here and we're just going to use this reacting pathway as an example to ask a series of biological questions is we have here a transcription factor CTGF and NAPNAPPA which regulates self-illiteration they're regulated by these different upstream factors YAP1 WWTR1 and RUNX2 so by converting the reactant pathway into a PGM we can answer questions like if YAP1 copy number is higher is CTGF expression up-regulated or if NAPA1 activity is higher how likely is it that WTR1 is up-regulated or maybe something else, maybe RUNX2 is experimentally down-regulated so in this example here we've integrated copy number variation and gene expression data from a variant samples into this converted factor graph we perform an inference analysis and the results are shown with and without copy number variation and gene expression to see how much the pathway entities were impacted by abnormal gene expression or by copy number variation and they're coloured proportionally so the left panel here is basically showing a visualisation of the impact of the pathway entities and then the right panel here is actually showing the kind of observed copy number variation and expression values so in this slide the comparison is between tool variant cancer samples the first sample you can see here has lower NAPA1 expression when compared to sample 2 here and it's likely because the copy number variation for WTR here is higher WTR here is higher here than here and actually if we look at I'm just trying to find it on the left here yep, here we go so in this sample here the value for the copy number variation WTR1 is 1 and here in WTR2 it's actually higher so this is an approach to use the molecular information that you have to integrate it into a model and potentially predict the activity of that pathway in a patient sample so the last few slides here to summarise all of the different resources that are available that I've talked about in terms of pathway and network analysis all links here as well and pathway modelling and apparently we're on a coffee break now