 So, thank you, Michelle. Good morning, everybody. Hopefully, Saturday won't be too long today, but thank you all for being here. My name is Robin Hall. I work at the Ontario Institute for Cancer Research. I have to disclose I do work on a program, a project called Reactome. I'll be talking a lot about that today, but I will be talking about some other resources. Now, I have a background in microbiology, genetics, genomics, proteomics, and bioinformatics. So I've done the whole thing. I know Francis for a long time. In fact, I've known Michelle for a very long time. And if you think a week is hard for doing a bioinformatics course, when I first started, we did two week courses. I actually was an alumni at that time. I'm actually an alumni of the CBW. So it's funny for me to be here actually and talking to you today about pathways and networks. So let's get started. I'm going to skip through the slides here. I've got some nice illustrations of things you're going to see later today. And Yuri gave this slide the other day. I want to start off with this slide first just so we can understand a few things. Yuri went through goal one, basically enrichment analysis of gene sets. Basically, today, we're going to talk about goals two and three. And I want to say that goals one and two and three are very much complementary approaches. I want to say that they're run in parallel. So I want to make sure that you understand that the outputs here from this analysis don't just automatically feed into two. You can actually, in fact, use the same gene lists that you've been working with through the other day. And in goal one can easily be inputted into goal two and potentially could feed into goal three. Goal three with pathway-based modeling is going to require a little bit of qualitative and quantitative data, and that may not necessarily be available for all people. But really, goals one and two are very complementary approaches. You can run them in parallel and sometimes get the same results. And sometimes, depending on the algorithms and approaches, you can get dissimilar results. Okay. So with goal two, we're going to talk about de novo subnetwork construction and clustering. And this is where I'm going to introduce the Reactome Functional Interaction Tool later today. This is a tool that works with Inside Escape, which allows you to analyze your gene lists, gene expression data, and somatic mutation information as well. And it allows you essentially to answer questions like, are new pathways altered in cancer and are there clinically relevant tumor samples? The reason I talk a lot about cancer and forgive the non-cancer researchers here is that they have the best data sets right now out there in terms of sample numbers, information about clinical data, very rich annotations of gene information like experimental gene expression data, copy number variation data, somatic mutation information. So it is kind of like the better data sets to represent. And then finally, pathway-based modeling. It's a much younger approach than goals one and two. It's a little bit more experimental. But I think there's some value there because it helps to evaluate how pathways and networks, their states themselves are disrupted in disease. And it allows you to analyze more than one data set or one data type at a time. And you can answer questions as to how are pathway activities altered in a particular patient and really ask the questions like, are there targetable pathways in these patients? So I think that's where I want to start. So let's take a step back and talk a little bit more about pathway network analysis. So one of the challenges that lies with analyzing huge amounts of data is really extracting meaningful information and using that to answer some fundamental biological questions. So pathway and network analysis tries to incorporate prior biological knowledge to analyze genes, proteins in groups in a biological context. So basically the first goal really is to dramatically reduce that data size. You could potentially, when I first started in research I was looking at one gene, one protein, one pathway. Now you're looking at, you can look at, in fact, in the early days of microarray analysis you could be looking at hundreds of genes. Now you can look at thousands of genes, you can look at thousands across multiple samples, rather than just having tens of thousands, hundreds of thousands, you could have billions of data points. How do you reduce that down in complexity and take an answer of the biological question? So it's also a good approach to increase the statistical power by reducing multiple hypotheses. I put in this line about finding meaning in long tail of rare cancer mutations and that's because somatic mutation data, there's clearly some driver mutations that are most interesting and well studied, but there's these, you know, lists of hundreds, thousands of genes that are just there and it's not clear to you what those genes are actually contributing to a particular phenotype and a disease. And then you can tell a whole host of other biological stories with pathways and networks. We've talked about identifying hidden patterns within gene lists. Pathways are a great way to view what you're doing in the lab. Essentially they helped create these kind of mechanistic models to explain your experimental observations. Quaid, who isn't here yet, will talk a little bit more about how you can actually predict the function of unannotated genes later with gene mania, so I won't necessarily talk about that just now. Reaction graphs, pathway graphs, network graphs are all very useful resources in starting to build up a framework for quantitative modeling and systems biology. I'll touch a little bit more on that as I go through my talk. And then, and we'll try to demonstrate this a little bit using the Reactome Functional Interaction Network as well, is how you can use this in developing molecular signatures for identifying prognostic signatures. So pathway network analysis to me is an analytical technique that makes use of biological pathway and molecular interaction or molecular network information to gain insights into a biological system. I would say it is, having been in the game for at least 15 years, it is still rapidly evolving. And there are many approaches. I can barely scratch the surface today and the next hour or so with some of those techniques. What I've tried to do is focus on those that are actually probably the most relevant to you and the ones that have actually the largest community support and are very valuable to your research. So to explain my kind of, I don't know if you already showed this slide the other day, but I wanted to give you a different, I wanted to give you my take on what difference, what a pathway is and what a network is. So to me a biological pathway is a series of events amongst a bunch of molecules within a cell and that leads to a certain product or a change in a cell state, something like that. And you've got metabolic pathways which compose a lot of chemical reactions. You've got signal transduction which is traditionally moving a signal from outside of the cell to exterior, but I like to think that it occurs the other way as well. You also have gene regulation pathways. You'll hear a lot more, I think from that, from Michael Hoffman tomorrow. Basically you're turning genes on and off. But in the case of networks, most pathways don't have a start point A and an end point Z. There's no real boundaries and sometimes pathways intersect with one another. You get this idea of crosstalk. And as when you start having to think about multiple pathways, really you're starting to look at their interactions and then you start thinking about networks. And you could argue in some ways that, you know, the view that we have in pathways is somewhat abstract. It's something that was created, well, way back in the mid, let's say the mid 20th century. I mean the pathways have been around for many, many years, but it's something that researchers have created. Networks are less abstract, but more, they lose a lot of that information, that rich information that you find in pathways, like particular protein states and other regulators and activators that kind of modulate some of these reactions. But the bottom line is both approaches are very complementary and they help us to learn a lot about human disease. And so identifying what genes and proteins and other molecules are involved in a biological pathway can actually help us provide clues to about what goes wrong in a disease state. So there are many different pathway databases out there. If you go to a resource, think it's in the notes called Pathway Guide, you'll find easily 300 different pathway databases. They differ based on the type of information that they've curated, or have they automatically derived that from text mining, or have they automatically derived that information from a lot of high throughput data experiments. They also differ on species. When I say species, I refer to the types of organism that they reflect. So some pathways focus on human pathways, others on worm, and so forth. Also the other thing about pathway databases is that they should provide a curated, from my point of view, a biochemical view of pathways and processes where you're seeing cause and effect captured within a human interpretable visualization. So there has to be a diagram there that you can see once you've seen one pathway diagram, if someone will show you another pathway diagram, you can easily understand that. There are some few caveats to pathway databases and that is the coverage of the genome. Because a lot of pathway databases are curated and that people are looking for the literature to identify molecules and building up these pathways, it takes time. And also some of the experiments that people perform are not necessarily the best experiments, I should carefully choose my words here, and so that information doesn't necessarily get reflected in a particular pathway. The other thing is that some databases disagree on the boundaries of where the pathway ends. I think there's a core element to most pathways, but when you start looking at activators and inhibitors and different kind of crosstalk between pathways, that's where the boundaries kind of get a bit vague. That's where networks become more valuable I think, but that's something to kind of appreciate. So I'm going to talk, we're going to talk a little bit, I'm going to talk a little bit more about Reactome, but I also want to mention CAG and there's resources like Panther as well and the National Cancer Institute Pathway Interaction Database or NCIPID. These are all kind of reaction network databases. In fact, you could scrub the database often called a knowledge base. A knowledge base to me is a resource which has a lot of information about the molecules, the pathways, the graphical visualizations, the tools for analysis, and they've got options for you to be able to kind of download the data and actually use that information in third party tools. Reactome, as I said, I work for Reactome, so I have a little bias here, but I'm very familiar with CAG and other pathway databases. These are reaction network databases because they explicitly describe biological processes as a series of biochemical reactions and it can allow the flexibility of this data model allows you to represent many different events and states found in biology. So you can describe metabolic pathways, signaling pathways, and gene regulatory pathways using this approach. Essentially, it really looks like a classical chemical reaction. You have a series of inputs, it could be one or more, feeding into the reaction and then you have a series of outputs. Again, it could be one or more outputs. And then the outputs ultimately become the subsequent inputs for a following event and you start just locking these pieces together just like a jigsaw puzzle. And what's important about these databases is that they should be explicitly describing the molecules in the reaction. So for the case of Reactome and CAG, they could be using resources like Uniprot to describe proteins, small molecules, Kevi, they'll include non-coding RNAs that may have disease variants and other therapeutics as well. In terms of the describing elements of the reaction, I think you're reintroduced to genontology the other day so you can use goal biological process to describe regulations or molecular functions to describe the catalytic activity. And these reactions can occur in particular cellular compartments so we can use the cell component to determine there as well. The other thing to point out is that wherever possible you should be linking these reactions and pathways back to the primary literature. So if you ever want to question something you should be able to go back to a PubMed citation. Now the caveat there is that some of the metabolic pathways you can only find in textbooks but they're kind of like Stryer and... what was the other one? You'll find essentially the kind of central dogma of molecular biology and metabolism in these textbooks where they can be considered reliable sources of information. So this is my view of pathway databases. KEG is a collection of... it's not just about pathways, it's a collection of biological information. It's clear to point out that this is compiled from published material so there's a team of scientists who basically read the papers and they extract the irrelevant information and they put that into the database. So we call this a curated database. So they'll have information about genes, proteins, the pathways, the interactions and the reactions and they associate with specific organisms and there is many hundreds of organisms supported by KEG and they provide this kind of relationship or a map, a diagram with how these components are organized in a particular cellular structure or a pathway. This here is a typical view of a KEG pathway. It typically fits on one slide. The green elements in the diagram represent the proteins. You can see some genes here in white. You can have these kind of larger oval shapes. These are encapsulated pathways. So this is showing the crosstalk between this particular pathway, the cell cycle pathway and the map kind of signaling. Also ubiquitin mediated proteolysis and apoptosis. The lines themselves represents the different types of reactions. The arrows typically represent activation. The perpendicular line represents inhibition. And sometimes just single lines represent just direct interactions but no additional information. Of course, these are useful diagrams because you can start overlaying experimental data on top of this. KEG does provide those tools for you to do this type of analysis but you can only browse the information in the pathway and you have to use third-party tools to kind of overlay your data onto the diagrams. Reactom, on the other hand... Oh yes, the other thing to point out with KEG is that its model is licensed such that you can browse the websites and you can use those resources freely. But when you start trying to download data, you have to get a license, whether you're an academic or an industrial user. There are other tools out there called Ingenuity. I haven't necessarily summarized them. You have to purchase a license. My difficulty with some of those tools is that when it's licensed information and I don't have access to that information, how do I reproduce your experiments? There's not a lot of transparency in some of those tools as well. And so resources like Reactom try to make everything open source and open access so that everything that we do is 100% transparent. And you can use the data freely for whatever you would like to do. You can use our tools as freely as you want to. Oh gosh, Kyoto Encyclopedia of Genes and Genomes. Or Genomes and Genes. It could be one of that way. So Reactom is also a curated resource. We focus primarily on the curation of human pathways. And as I said, it can help us there is the metabolism signaling and other biological processes. Pathways are traceable back to the primary literature. And we extend some of the cross-reference to other databases out there so that we can enrich our own annotations with gene, protein, small molecule information. And then we provide tools for data analysis and visualization. So here in this diagram, in the screenshot I'm showing you the Reactom pathway browser. It's the main visualization tool to browse the biological pathways and to analyze experimental data. Whether that's a gene list, a protein list, or a small molecule list. On the left here we have a pathway hierarchy which lists all the pathways known to Reactom. They have a particular iconography for pathways. And as you explore these different levels of the hierarchy you can see different levels of the sub-pathways and the relationships to the reactions as well. And as you interact with this display you'll see pathways, diagrams, visual diagrams appearing here in this panel on the right. And then at the bottom here you have this details panel which provides you with more textual and graphical information about molecules, protein structures, experimental data. Whether that's data that you're actually uploading in or data that's already been provided by things like gene expression atlas. So in this particular view I've just overlaid gene expression data onto this pathway diagram. So the different colored entities reflect different gene expressions of genes that are up-regulated. And then Reactom has this idea of complexes. So one icon represents many molecules. And so these kind of horizontal lines that you see reflect the expression values for different components of the complex. So these are great resources with Keg and Reactom to actually explore pathways and to analyze data. Now if you would like to do some of this yourself and you need a good source of pathway data that isn't necessarily available in Reactom or Keg one good source is called Pathway Commons. I think Yuri might have introduced you to this yesterday. So it's a very useful resource for network biology and it essentially is a convenient access to biological pathway and network information that's collected from a variety of different public pathway databases. So you can search, you could visualize and you can download pathway and network information from this resource. So now that I've talked about pathways I want to focus a little bit more on interactions. I believe Yuri introduced you a little bit to the networks yesterday inside Escape. And I just want to use this slide to remind you of just a few things. The nodes and the edges within interaction networks can be almost anything you want it to be. So typically nodes can be genes, proteins, metabolites, groups of complexes and things like that. So any sort of object. Edges can either be physical, functional interactions or motivators, regulators, reactions. Any sort of relationship. And in fact in the next slide I'm going to show you there's different types and as such there's many different types of interaction networks out there. Most cellular networks are available for what I call the supermodel organisms. So things like Drosophila, you used C. elegans and Arabidopsis. There is also human interaction maps available as well. Although I would say the model organisms are far more well studied. And as such when you take into different considerations about genes, proteins, small molecules you get these different types of networks. So we have a transcript for a regulatory network where the nodes within that network could represent transcription factors and the putative DNA regulatory elements. And the edges are just demonstrating the relationship between those two entities. Michael Huff and tomorrow we'll tell you more about this type of network. Virus host networks. So the nodes represent viral and human proteins and the edges are representing the relationship between those two nodes. Metabolic networks. So this is where things change a little bit in terms of, so the nodes represent the enzymes and the edges themselves this time rather than demonstrating a relationship. But the edges are actually the substrates and products of those enzymes. It's a little different view. And then you have this thing called disease networks and the disease nodes, the nodes are the diseases, the disease terms and the actual edges connecting those diseases in fact the gene mutations you would expect. So there's different attributes you can apply to these networks. And by far the biggest group and most widely applicable to data analysis is the protein-protein interaction network. Sometimes it's referred to as a gene-gene interaction network, but essentially it's the same thing. It's essentially one in the same. And we're going to talk a lot more about this today. Now information about the relative importance of the nodes and edges within a network can be obtained by applying a variety of different graph measures or algorithms and they've been widely developed through other areas like sociology and more recently applied to network biology. I don't want to spend too much time going through these different algorithms but it's important for you to understand as you kind of like look at these networks how some of these structures come into existence. So the most popular terms being used are degree closeness in betweenness. So the degree you see the number of edges that a node has. So if you look at the green node which is in the middle there you'll see that that represents the node with the highest degree of connectivity within that graph. And then closeness is the measure of how close a node is to other nodes in the network. So if we actually look at the two red nodes here you see that these are the nodes with the highest closeness. And then betweenness is basically quantifying the shortest number all the shortest paths that pass through a particular node. So if we look at the purple node here this is the node with the highest betweenness. Probably one of the most robust measures for network topology is the average shortest path which is defined as the average of all the shortest paths amongst all the nodes within the network. So if you were to ask the question what's the shortest path between you and Mark Zuckerberg? Okay. That basically is the number of people that you need to communicate with in order for you to actually contact the owner of Facebook. That's the shortest path. It's the simplest real-world example of it. But they have huge applicability in network biology. And as I'm talking more about side-escape later today I want to learn more about these different algorithms. And there are a number of other applications that you can use in side-escape to help you do a lot of these kind of graph measures and calculate a lot of basic network properties. Centrospeed is one of them. And basically you can just install these apps into side-escape and create your network and then you can actually apply these different algorithms in the network structures. Now I want to talk a little bit about the network databases that are out there. Again the same pathway guide also lists the network databases and there's plenty of them out there as well, the several hundred of them. Sometimes people think network databases are in effect pathway databases as well. I think there's differences. I believe that pathway databases have a high level of biological knowledge available because some network databases they differ very much on how that information is generated. So people can use text mining to identify interactions. They could just basically suck in all of that high data that's out there in these publications to create these networks. Or they could physically go into the individual papers where people have done small scale biology, identify those interactions in the paper and put that into the database. And that ladder step is called curation and by far I think it's the best source of information for interaction data. It does have more extensive coverage to biological systems. I would say a typical pathway database will have about 30% coverage of the genome interaction databases. It can vary between 60 to 75%. The information pertaining to the relationships and the underlying evidence describing the interactions is a little bit more tentative I think for some interactions. Particularly those that are based on high throughput data analysis. There's a lot of noise in those data sets but I would say that where you have manual curation and multiple papers citing the same type of interaction between two molecules then there's a high degree of confidence that that interaction will occur. I would say in cells. Not all cells because we haven't tested the interactions that occur within a cell in all the different types of cells within tissues and organs that we have in our body. Popular sources for curated network data is biogrid, intact and mint. I would say these are the three main ones. You could have other ones. These resources are not, they differ slightly on their content. There's a lot of issues of curation. Also the scope or coverage of the different species and model organisms that people have studied and the interactions have been derived from. But I would say that in the next slide intact is probably one of the better resources. So just as an example in the screenshot we're searching for P53 and this kind of table format. Typically of molecule A being the big protein molecule B being the interactor. Good to see that they're using identifiers to describe these different molecules so that you can link those back to a protein database. They also have information about how that interaction was detected, the experimental approach and then they have a variety of other different annotations specific for that interaction. And then intact is demonstrating well intact has curated some of that interactions. Some of that other interaction information is coming from Uniprot, Mint and then in some ways demonstrating that intact is aggregating other interaction data from other resources as well. So it's a rather useful resource you of course can download the data from these different databases and start using the tools to visualize the networks. Now before I introduce you to some of the different approaches for data analysis I want to just take a moment to introduce you to some of the tools for visualization. So SideEscape you heard about yesterday it's by far the most popular tool and I would say it has the most support from the community in terms of the applications and publications and also the user guides on how to use SideEscape. It's very valuable. There are the tools like Navigator it's a powerful graphing application for 2D and 3D visualizations of biological networks. It has a rather rich suite of markup tools so you can annotate the nodes and edges within the network. It's fast and I would say it's a bit more scalable than SideEscape so it actually can support very large networks. Although I would say that very large networks are probably uninformative to you they look like fur balls or a ball of string and you'll never really understand what's going on there and which is why you need to use the tools to kind of filter that larger network down to a kind of smaller sub network where you can then generate your hypothesis. There's another tool called Osprey it's a tool for visualization and manipulation of complex interaction networks. It has a kind of data rich interaction view you can color code nodes according to the functional annotations and experimental data. There are other tools out there and as I said there's probably easily a dozen widely used tools for exploring networks in biology. You'll hear a little bit more about Gene Mania I believe this afternoon from Quaid so that's another good resource for visualizing networks. Now in order to use some of these tools you need to be able to exchange data from a database to that tool and so there's a variety of different data exchange languages let's call them languages to make this simple essentially all they are is a data file they can either be in a simple tab delimited format or it could be an XML document but there's very too much about the technology that generates the data file or how the contents are represented in the file just let's understand that it's an easy way to exchange information from a database to one of these tools some of the tools automatically do it so you never see that information but just in the off chance you get to one of these resources or pathway commons and you're basically using a tool like side escape where you can download data I'm using react on as the example so you can download data molecular interaction data using this PSI my tab format and sidekick is the effort that basically is to standardize access to molecular interaction data within these databases and PSI my tab is this kind of tab delimited format for data exchange so you just download this file and you can upload it into side escape systems biology markup language this is more for people that are interested in building systems biology or biological models of pathways and networks SVGN this is a standard representation of the graph itself and so every node has a particular shape and the edge or the relationship between the two nodes has a particular style as well and so that information is retained in the systems biology graphical notation and then this biopax which I'm not actually sure what it means I've always only ever known it as biopax it probably means biological pathways and something it's basically a standard language which basically is there to enable integration exchange of biological pathway data it sounds like these languages all kind of do the same thing yes they do they do it in slightly different ways and they're compatible with different tools so in the next slide we're just taking an example of there's a pathway in reaction called the amyloid pathway it's involved in neurodegenerative disease and basically you can download the SPML file from this amyloid pathway and directly upload it into a tool called cell designer and that's a graph modeling tool to start building your your model your biological model you can upload the biopax file into side-escape or you can directly connect from side-escape to react to them and just do that automatically and finally if the SPGN file you can just upload that information into a tool called vaunted and you can basically start creating your own graphical representation of the pathways so for much of the time when you're doing this kind of approach of analysis you're basically creating a network using one of these tools and then you're going to overlay attribute data that's your experimental data onto particular nodes or edges depending on the type of information that you have but another approach I want to start talking about now is and they could be just a simple gene list it could be a list of genes that you know have somatic mutations could be gene expression data and you project that gene list into a much larger pre-constructed network so before in the previous slide we've got relatively small networks and that's when you overlay your data onto that network it's one single view now we're moving into an area where there's a network and you want to filter down that network to create something more useful so the idea is to identify what they call topologically unlikely configurations that's basically a subset of genes that seem to interact very closely with one another in the network more than you'd expect by chance alone and then you can extract these clusters using different algorithms and based on the assumption that genes within the cluster are involved in similar biological processes you can use the enrichment tools that you learned about yesterday to annotate those modules and in fact the Reactome Functional Interaction Network application I'm going to talk about going to show you later can help you to do a lot of this work for you so let's just take a moment to talk about network clustering so clustering is defined as a process of grouping objects into a set so that's a cluster I sometimes use that I also sometimes call it modules so I sometimes refer to a network cluster as a network module and there is also you could also think of it as a community so it could be network communities as well and the thing that these nodes within this cluster they have something in similarity something in common there are a variety of different network clustering algorithms that perform this task I'm going to try my best to explain some of these algorithms without using equations because they are scary equations to look at but basically you're looking for sets of nodes or proteins that are joined together in tightly knit groups and essentially cluster detection for large networks is very useful in identifying highly connected proteins that share similar functionality and so I try and do my best to explain some of these clustering algorithms Gerbin Newman is the first one so basically you start off with your network and you start chiseling away at the edges with the highs between and then you continue to chisel away in that network and drill it down until you almost break the network up into individual nodes you kind of get to a point where you stop tightly knit communities of proteins and between those kind of tight knit communities is a sparse number of connections that's ideally what Newman algorithm is trying to do and we actually in the Reactome Functional Interaction Network we'll show you how to do that and I have a slide in the next which shows you the result of that clustering the mark of clustering algorithm now it's a little bit more difficult to explain so my limited understanding is that it basically tries to simulate a flow within a graph and promotes flow in highly basically you're looking for a promotion of flow within highly connected regions in the network and it demotes the kind of other sparser connections so if you can imagine for example you're in a network and you're taking a random walk and basically you suddenly come along and you're visiting a dense cluster an area within a lot of connections you're more likely to walk around those connections than you are to go off in another area where there's like a sparser connections and so that's essentially what the cluster the clustering mark of algorithm is trying to do it works very well with gene expression data so when you want to actually try to weight a network based on gene expression data it's the same mark of clustering algorithm and actually works really well hotnet again we're getting into another area of sorry yes so they were asking whether a hidden mark of clustering is the same type of approach and it is, it's just the same name so hotnet so this is a kind of crazy way of looking at it so imagine you convert your gene network into a metal lattice you know like a grill like you've got these weaves I was going to say draw it like a rectangle but it's not straight forward to think of it as a rectangle maybe that's a way to think about it and you basically each of the connection points in your lattice is a node and the lines are basically the relationships you just lay it out and then you use the physics of heat diffusion to basically study, to model the effects of these gene alterations so the thing to point out with hotnet it works really well with somatic mutation data and takes into consideration the frequency of mutations and those interactions between the different nodes within the network anyway back to this wire mesh so as you heat up a piece of that metal lattice like you're heating up a gene it's going to get hot now if you have one of those temperature gauges you're going to see cold areas on that metal lattice and you're going to see a really hot spot where you're heating up that metal lattice imagine you're heating up with a boots and burner or something like that now if you start heating up more of those different genes in that lattice and you look at your heat gun and you can see like certain elements of that lattice are going to heat up more than others that's essentially what you're trying to do you're going to get these local hot networks sorry local, you know, hot networks you know or locations so that's basically a similar kind of approach to the GERV in a new one but it's just a different algorithm really and then hyper modules tries to identify it's very applicable to cancer because it tries to identify subnetworks within those cancer mutations and it uses kind of clinical characteristics to correlate that information so things like patient survival or tumor sample excuse me it's really sorry the tool is actually very useful in trying to identify tumor subtypes by extracting networks where the mutations are significantly enriched in a particular subtype it's not a tool that I frequently use but I've heard or I've looked at some papers and it's actually remarkably well published so that's all I can say about hyper modules and then the reactant functional interaction side escape application basically tries to use some of these different algorithms and in the lab we'll actually do this it's very simple you'll see how the tool helps you to create these kind of clusters and then to annotate those clusters so typically when you actually yes yes yes so the question is how specifically these algorithms are for biological use cases the answer is very they're very widely used I would say that the algorithms for studying networks has been developed because of sociology experiments people understanding the connections between you and this audience Facebook uses the same algorithms to understand the relationships between people in the Facebook community and wherever networks are used whether it's telecommunications industry you name it they're using these similar types of approaches to understand the information that they're generating so I would say that really of the three at the top here are probably the most applicable ones that's why I'm listing them all there are other things like even things like studying the bees within a hive little bees flying around you can actually understand the sociology and the interactions between different bees within a community using algorithms and in fact some of those same algorithms are actually used in biology although I wouldn't necessarily use them and the second question so I tried my best to explain them to you today if I were to show you the equations I would lose all of them I would honestly believe if you want to get into them you can look at the papers to understand how the network algorithms have been deployed but truly I think you just in a sense accept them for what they are they've been well tested, well published if there's any caveats to like particular algorithm with a particular data you know it's worthy to look at the publications or the user documentation for these different approaches to rule some of those out now, okay so the typical output of this clustering is shown here so basically we've got a hypothetical sub network which is composed or is decomposed into six clusters okay so you can see that most of the clusters have 10 or more genes and you can see that there is you know a higher degree of connectivity within the genes within the clusters and some sparser connections across between two clusters one cluster just to point out here cluster 6 only has two interactions okay so I might ignore that for my further investigations and the reason for that is when you start annotating these modules two genes could well have many many biological functions and the question is which biological function are you looking for when you have larger groups of genes you know you still could potentially get a lot of biological processes but the chances are you can actually narrow that list down to like a handful of pathways and the other thing to point out is that the clusters are mutually exclusive meaning that nodes are not shared between the different clusters so the gene exists with a particular gene or a protein will exist in only one cluster okay so now we're going to talk a little bit about the Reactome Functional Interaction network and the visualization app that's available inside of speed we're going to learn about a little bit more we're actually in the lab afterwards we're going to learn about how to use this tool to upload a gene list and somatic mutation data to actually create a network and create these wonderful little well not quite at this publications level at the end but you will get something where you can organize the network into discrete clusters and you can know what the potential biological roles of these genes and these clusters are so, yes so in so again it's just reiterating some of the points made earlier and in particular with cancer no single mutation no single mutation to genus nest is insufficient to cause cancer typically what you have is a handful of mutations common mutations like p53, p10, eGFR and then you have this like long tail of hundreds maybe even thousands of other mutations which could be putative drivers within that list but they're in very rare subtypes or they're just passenger mutations and the question really is, you know what are the role of these genes in this disease, in this phenotype so by analyzing these mutated genes in a network context it reveals the relationship between these genes you potentially elucidate the mechanism action of the drivers and possibly the passenger mutations it very much facilitates hypothesis generation of the roles of these genes in a particular phenotype and as I said at the start it dramatically reduces that gene list from whether hundreds or thousands of genes down to a handful of mutated pathways or essentially a dozen or so pathways you can generate your hypothesis and then you can take that back into the laboratory to kind of verify some of that results this is discovery this tool is a discovery tool you can use the tool to kind of prove some of the hypothesis that you've generated in the lab that you've worked that way you've done all the wet lab work and you want to take a bioinformatics approach to try and validate whether this is true so that when you actually apply a different scenario you can then generate a new hypothesis that you can then potentially validate in the lab question so the question is does the tool automatically define the number of clusters and answers yes most of the algorithms we use will do that that's the best approach sorry do I have your permission to continue up to 10.30 or we can break and then is that okay? I don't know how many slides I'm getting pretty much towards the end of my talk and if we finish early it's time for questions and if not we can break early so what I want to start off with is explaining what a functional interaction is it's a little different than a traditional protein-protein interaction network so a functional interaction network is a highly it's a reliable biological network based on manually curated pathways which are then extended with verified interactions from other data sources so the first step in creating this functional interaction is to actually basically reduce the complexity of these reactions that you see in pathways down to a series of binary interactions and so you can have you can conceptually believe that input one and two interact with one another input one interacts with the catalyst the inhibitor might actually interact with the catalyst as well the two inputs in this formation of this complex could well interact interact with this and other members of the reactions so the output is a series of binary interactions and so in order to do this for the functional interaction network we've taken a whole series of pathway databases and not just reactome but also panther, keg NCI nature represents the NCI pit database NCI biocard biocard is just another pathway database they're more they have they're less annotation rich in terms of gene protein information they are better known for their kind of nice pathway diagrams but you can actually get that you know gene and protein information and then TRED is a transcription factor regulation database can't remember the E stands for but essentially you build this big long list of binary interactions and this becomes what we call the annotated functional interactions and then in the the second part of the construction of the network you use a simple machine learning technique to score the protein interactions from all these different pairwise databases so you can see that you basically so we're converting some of the fly, yeast protein protein interactions into human orthologs interactions there's co-expression data and there's go information so that means that share similar biological process annotations could well likely interact with one another in some way or other that interaction could be indirect that means to say there could be other partners in between but that's still a way of building up a network and so after we scored these interactions which we call predicted functional interactions we combine these two data sets and we create this large functional network. Now when we first did this four versions ago the number of interactions, 270,000 interactions and it had just over 9,000 proteins now the size of the network is just over 36 sorry 336,000 interactions and almost 12,000 Swiss protein proteins are in the network so that's coverage of almost of 58% so as I said to you before pathway databases they have coverage of about 30% of the known proteins now we're at almost 60% here now to try and visualize this network would be crazy because it would it literally would be a furball but it's a source from which you can project your data sets into and so just to show you how that works just imagine this is a nice little slide from arena at the back there so imagine this was your this is the reactant functional interaction network here and you start projecting your genes into that list so I'll just see how it looks to you because it looks better for me there's red circles these could be genes that are up-regulated that could be have mutations or not, it could be any kind of data but essentially these are genes or proteins and you're projecting that into the network and based on that projection you know that some of these proteins interact with one another so these are the yellow lines but there's still some kind of sparse connections within the network so what we can do is add these things called linker genes or linker proteins and these are basically there to provide a level of connectivity between the elements within the network and now what you've done is created a subnetwork and then you basically take away the rest of the network because you're not interested in that you can still interact with that rest of the network if you choose but basically now we've got a small subnetwork and it actually looks a little bit better on the screen here and this is basically the subnetwork which is hopefully trying to help you to understand what's going on in your gene list yes this just now just because you're seeing lots of connections here and here that's just a chance you're not exploring that in any way computation there's no algorithm yet this is just a visual this is not necessarily a fact if I tried to show you the reaction it would just be a face mask like that you could not understand I think here if this is a protocol and I haven't even thought about any of the basic network characteristics plus how to identify a region of focus so we have quite a lot of data that we're moving in that direction yes basically it's the minimal so basically how do we decide what these linkers are and how they insert these they're basically the minimal amount of proteins you need to add to this network to provide that connectivity between different the different unconnected regions yes so the question is with projecting what's the question what's the question what's the question I don't have a simple answer to that cutoff I don't remember how it was selected but obviously there needs to be if you imagine A in front of you A in front of you if A is conserved and B is conserved very good likelihood that if that biological process is conserved then we'll see what happens that's an assumption that I'm making there that's how much it's required yeah that's what I thought if you are at least following both between spaces and following trying to understand the spaces there it could it may have to think about some process that can take itself and resolve the rest of the biological process like metabolism the same genes as in children whether it's in these so I can build this interaction of high-circuit I'm telling you to think about the biological process whether it's conservation between organisms like it's somewhat similar but clearly if there's interaction with a protein it exists to be required that's not going to be a problem so so you think that you've been trained on the some of the annotations yes so my question is does it require sorry I'm just animations alright so so there are many efforts underway globally to kind of sequence the genomes from patients and many different types of disease so in this example I'm talking about mutations derived from pancreatic cancer and we can use network analysis here to kind of help interpret these data sets for example you get this kind of recurrent gain of mutation in K-RAS in pancreatic cancer but you get these smaller lower frequency mutations of other genes and the question is how is a gene down here in this long tail of contributing to this disease phenotype and these genes themselves could represent rare driver mutations as well or they could be other kind of somatic variants who have to interact with other genes in this data set to actually cause a phenotype and so we in order to gain deeper understanding of pancreatic gene lists generated from pancreatic cancer data sets we can take that set of mutated genes and project that into the reactive functional interaction network we can then and this is just the results here so this is actually a clustered network so you can see great groups of highly connected proteins and once you've identified these clusters of genes you can then annotate them with the Richmond analysis tools to identify putative biological roles for these genes in these clusters so you can see TP53 is a highly mutated gene in pancreatic cancer so you can expect to see a TP53 signaling but here as much as you might see things in signaling as well calcium signaling that are known to have roles in pancreatic cancer you can identify new biological processes that you might not necessarily be thinking about because these genes are these genes are tightly connected and they could be associated with this biological role in fact one of them was axon guidance so axon guidance actually appears twice because there are slightly different pathways with the axon guidance annotation and they cluster distinctly into two groups but axon guidance is a process which neurons grow into target cells so I was that related to cancer and as it turned out it was something that was kind of hypothesized like what's this going on and then they've actually started to do some experiments and they discovered that one of the sub pathways with an axon guidance that slipped to robo signaling may in fact enhance metastases in pancreatic cancer and predispose pancreatic cancer cells to metastasize into the neural tissue so here's where we've discovered something in a network it's been taken back into the laboratory and it's actually shown to actually have some consequence so one thing to point out is that the node size corresponds to the number of times that gene is mutated so K-RAS is from the ground ground mutation so it's a quantum module here T53 is not a highly mutated gene and then the other nodes are smaller than they look like of a much lower rate of mutation in the patient samples so we've laid that pretty good back here yes it's exactly we're looking at a much smaller number of interactions in the way from the gene the rest of the functional interaction network is that's all the genes that's the majority of genes there are possible genes that could be in the gene list which do not appear in the interaction network therefore they may not necessarily be there's no interaction for that gene there could be other mutations there could be other genes that are brought in well this doesn't have an interest so there's no additional there could possibly be one to add more interaction so there's another question could you give me a once in a while yes so that's a very good point so using this we've used the new algorithm so it doesn't take into consideration the frequency of the mutation but if you were to look at the hotnet as an algorithm you would actually take that same DNA sandwich so that's because the algorithm is trying to look for a different kind so in this case so that's a big point and we'll be clear about that so it's kind of like let's say you run a photogenic prediction tool that can assign a score so patients in a symmetry better look back at the building so it could be there's a grant that we submitted to try and actually build I hope we can but the idea is to take into consideration the functional impact, the mutation and it's more than a biology proposal absolutely you can use we're going to talk a little bit about something in a moment which is kind of it kind of brings it in I have to say I'm going to lose all of you in the audience when I start talking about this because it is a really complicated to explain and I've tried to make it as simple as possible but yeah I'm okay on time alright I've got about 10 minutes so we'll try and run, I hope I don't run over so one other approach I kind of was talking to you earlier about using the the MCL algorithm I just forgot what Markov clustering you can combine gene expression data into the reactant functional interaction network and based on available clinical information you could potentially try to identify network modules that could be related to patient outcome or patient survival or some other prognostic signature so basically the starting point here is gene expression data you have a whole list of genes and the columns are essentially all the samples that you have within the data set so you have expression values for each of the different genes in all the different samples and it's the same sample this is a very applicable approach to basically samples from cancer patients or you could actually use this approach has also been demonstrated with cardiovascular disease and type 2 diabetes you can use this approach when you have that expression data available so what you do is you create the network based on the gene expression list so the gene list within the expression data and then once you've identified those clusters of tightly connected genes you then perform something called cox proportional hazards and the idea there is to screen around the individual modules to identify a potentially clinically significant module and that's typically based upon clinical data and that clinical data could be whether a patient is alive or dead it's a very straightforward question you're asking but that can fit into this kind of survival analysis approach and once you've identified a possible module that's relevant potentially clinically significant you can then run Kaplan-Mayer survival analysis and basically you get this plot at the end and basically you're basically plotting survival probability versus the elapsed time for the different groups and the samples and so the Kaplan-Mayer analysis divides the samples into two groups basically samples having low expression within this particular module of genes and then high expression which is within these genes for this module and so in this particular module it's been annotated with cell cycle and aurora B signaling so the idea here is that 31 genes are significantly related to this breast cancer data set and so patients the hypothesis is that patients with low expression in these module genes have a better outcome than patients whose genes are more highly expressed in this module so the idea here is that a single network module or more than one module could be useful in defining a signature a prognostic signature so we will try to demonstrate this in the lab if time permits so you can understand this a little better yes you do the clustering you've been putting that into the network you're using the gene correlations to actually weight the network and then you cluster the aurora then you annotate those modules with half the annotations and then the next step is to ask the question within these modules are there any genes that could well predict patient survival and you basically have that analysis to break the genes into two groups based on the same group in this case it's high expression and low expression it could be gene expression data whether a gene is mutated or not mutated it's part of the survival analysis it's just yes, it just creates a plot and then you do the log rank between these two lines and you ask whether it's significant so the question is how do you sound when you write this one of the times you did this a couple of years ago with that weight we cracked we could actually the problem with the bottleneck was actually downloading the file and if I recall it had about 300 samples or something like that and all the genes that I've seen they had gene expression data for it could pick how several hundred samples it might even be able to say that one of the other tools is that you are in the tools and you need to be able to be on the file because it's designed to analyze what our data size and a lot of the accounts of genome networks experiments are generating like thousands of thousands of data like this one I'm going to push forward here and just wrap up in the last few minutes this is going to be a tough cell to you it's a rather interesting area it's an interesting approach to data analysis I have to be honest there's only a handful of tools out that actually support this approach and some of those tools may well have been developed some time ago some of those tools have been developed simulated data that means to say it's not truly relevant biological data I would view these tools as somewhat experimental in approach and also the availability of the data sets and you may need to have things like copy number of variation and gene expression data whereas what we've been discussing about in the last two goals two the talk from Yuri yesterday and what I've just discussed you're typically analyzing one data set at a time here you're actually trying to integrate more than one data set at a time maybe three but the availability of three data sets that you could actually use for some of these tools is based on how those experiments have been formed whether it's the same samples it's very difficult to get those data sets again cancer data sets are actually very applicable data sets and use cases in this because it's one of the only diseases where people are generating copy number variation sequence information gene expression data protein state information as well so this approach basically tries to infer how pathway states are disrupted in disease so you're going to use qualitative and quantitative measurements to infer the activities of various components within the network or in the pathway and ask the question what are the consequences of those actions so it kind of alludes to someone who's talking about functional impact of the mutation you can potentially look at this in this tool although you do need to have that mutation information you may well need to have gene expression data as well to support that so some of the tools that are already available include cell net analyzer it's a MATLAB tool it's very useful in providing the algorithms and visualization tools for metabolic engineering so this is more appropriate for looking at biochemical systems or metabolomics data the next tool is NetForest or NetWorkin actually NetWorkin is NetForest has been superseded by networking and basically the idea here is to try and understand the underlying intracellular signaling networks within a large scale phosphor proteomic data sets so you can basically elucidate the phosphorylation events associated with a given phenotype or a disease there's a RACNE which it's a novel algorithm which analyzes microarray data and is specifically designed to scale up to kind of the complexity of regulatory networks in mammalian cells and the idea there is to basically identify transcriptionally related network modules and then this paradigm which I'm going to talk about now which is actually the most common form for doing pathway modeling within cancer data sets so probabilistic graphical models PGMs are widely used techniques in machine learning and statistics for modeling of complex dependencies amongst a variety of variables basically it's a way of studying a lot of different elements within your network and it's recently been applied to understanding of cancer network cancer networks and so the goal here is to integrate different types of omics data into these models so if you have copy number variation data gene expression data mutation or even protein state information you try to project all of that information into a network to identify significantly impacted pathways and then to try and link the activities that you're seeing the activities within particular pathways to particular patient phenotypes now in order to do this this is where it gets a little bit complicated so traditionally a protein-protein interaction network you think of one protein one node that node interacts with another node in order to integrate multiple data types simultaneously you have to think of that one protein node now it becomes several nodes because the protein is encoded by a gene there's a transcript it could be a protein state and actually when you start thinking of other layers of information potentially one node becomes many nodes at basis of this approach you should think of four nodes gene copy number, expression state protein level and protein activity and basically this is demonstrated in figure in this C here so you have a simple pathway MDM2, TP53 of the single nodes that are involved in regulating this apoptotic pathway but MDM2 is represented by a molecule in DNA in RNA protein and then there's an active protein and this interacts with a reaction within the TP53 network so basically rather than just having two elements or maybe three if you think of the apoptosis as an additional element you now have several molecules and then you project your data into this network this is actually called a factor graph here and once you've created your network the next step is to infer pathway levels for each of those sorry I'll go back to those elements of this network and then this is where it gets a little bit complicated it's a series of classifiers that you need to train to identify these inferred pathway levels and the output essentially is the pathway activities for these given molecules and the best way to represent this is by in a cluster heat map so in this example they've analyzed glioblastoma multiforme data you can see that the significant pathway perturbations these are the pathways here so you can see that they've broken down into four clusters the fourth cluster is probably the most interesting here because we see that there's distinct down-regulation of HIF-1 pathway here and then in the other three clusters you can see that there's very distinct EGFR signaling you can actually see there's over expression of EGFR and also E2F so these these pathways aren't just picked by choice these are actually pathways that are known to be influenced by glioblastoma these are pathways that are known to have altered activities within glioblastoma and so basically all of these sorry I should explain all these clusters represent a different sample each row is representing the different components of the pathway it's a tricky approach to actually understand and I do apologize if it's not clear to you it's one sample from one patient oh that's a good point so each column corresponds to a single sample right each row is an entity and I'm actually just trying to think how many patients were in this study actually just to give you that number the only way that the patient the patients would be in this or the samples would be in this experiment if you have data for all four cases there's gene expression data and copy number or at least copy number data so yes so for paradigm the good news is this is where it gets kind of is that the source code to actually run this is very difficult to compile the project you know creating these factor graphs is actually a very time consuming effort and so there's not a lot of pathway modules available it's not very well documented approach takes a long time to run to actually create these factor graphs and then to project all of that data into those graphs takes tremendous amount of time the good news is we're trying to develop an application it is an alpha testing you can attempt to use it there are a couple of data sets they're not available for this workshop but if you go to our user guide you can actually the Reactome Functional Interaction User Guide you will actually find this information and we are trying to find a way to kind of speed up the creation of these different pathway modules models and then to improve performance and then that's fine so just in summary I have a list of all the different pathways some of the different pathways databases and network databases we've talked about today there's a summary of some of the different de novo network construction tools you'll hear from quite about gene mania shortly we'll be demonstrating the app in the next in the lab in terms of pathway modelling here's some links to some of the resources and apparently we're on a coffee break now so I'll take any more questions if anybody has otherwise we'll break for a break