 My name is Robin Hall. How far back do you want to go? Like, from the day I was born? I did actually. And then I delivered milk and cream and then I went on to do like working at Kennels. Then I got interested in cancer research. Actually, this was in the days of before BRCA won. So actually, I was one of the groups trying to find BRCA won. Obviously, we didn't get the prize. But I really enjoyed molecular genetics at that point. So I went to do a PhD in genetics at Nottingham University. From there, I did post-doc in Japan studying yeast and expression. I came to Canada, started working with a guy called Charlie Boone on looking at synthetic lethality springs in the east. I did some proteomics in Chad Greenblatt's lab, also at U of T. And then I moved over to do bioinformatics with a guy called Chris Hogue. That's where I really got into data curation and managing the projects involving the kind of recreation of pathways at a database at Science Magazine, which is the database of cell signaling. And from there, I moved through other bioinformatic roles. I managed the limits at one point, setting up for cancer research study. And then now we're at, I'm at the OICR, working with a guy called Lincoln Stein. And I do the outreach and management for Reactome. So I will talk a little bit about Reactome today. We'll demonstrate that in the lab later on in the morning. But I think I kind of want to follow along a little bit from Yuri's talk and some of the stuff that Veronica's introduced you inside escape and talk a little bit more depth about pathway and network analysis. So your learning objectives, obviously, is to understand more about the approaches of an analysis. I do apologize if I'm kind of repeating some of the things that Yuri said. But I think it helps reinforce some of the approaches that actually, not only have I, but also Yuri's found useful. I want you to understand the different sources of pathway network information. I think there's some important points to raise there. Because for all the analytical approaches that you apply, you need to understand not only your own data, but where the data you're trying to integrate it comes from. And that actually can sometimes influence your results. And so you have to be aware of that. I'll talk a little bit about some of the analytical approaches. I won't give you a ton of stats stuff. I'll leave that down to Quaid, who's going to be here a bit later. And I think Yuri might have actually introduced some of the enrichment approaches earlier. And then we'll do the Reactome FI network. We'll use that in the lab. Sorry? Yes, the Reactome FI, which is Reactome Functional Attraction Network. So we will talk about that later. So to me, what is pathway or network analysis? Essentially, it's a computational technique that makes use of biological pathway or network information to gain an insight into a biological system. It could be a normal cell. It could be a deceased state. It could be a phenotype, whatever you're studying. I say it's a rapidly evolving field and it is because, well, there's something called next generation sequencing. And that has generated huge, large, complex data sets out there. And I would say that with that, there has to be tools that analyze that data. I think pathways and network analysis are good approaches. There are many different approaches to talk about. I could spend hours talking to you about all the different ones. I'm probably going to talk about the ones that are most relevant, hopefully, to some of the work that you're doing. So one of the challenges, I think, that lies when you are working with, you know, when we're working with experiments with generating huge amounts of data is to extract from that meaningful information. You can have millions, if not billions of data points. And you want to answer a fundamental biological question. So pathway analysis basically incorporates a lot of primary knowledge. And it helps you to analyze these large gene protein metabolite data sets that you have and present a biological context. The method itself, because it's so many different, you know, increases the statistical power by integrating multiple perturbations for testing across a high dimensional space. Particularly with cancer research now, next generation sequencing, we're identifying a lot of somatic mutations. We know what the, you know, we're trying to figure out what the role of these driver mutations are. But you have this whole long tail of like smaller number of rare mutations that you find in your data sets. The question is how can we use pathways and networks to help explain the relationships between those genes? So just as an example here, the most cited reason for actually performing pathway analysis is to help analyze a gene list. So here's a gene list, 127 cancer driver genes. So this was a data set derived from the cancer genome atlas, which, you know, these researchers classified as cancer driver genes based on their mutation frequency. Now, when you look at that list of 127 genes, you'll probably identify one or two genes that you know of it, but you don't know many of them. The question is, what's that list actually represent? What are the mutations doing to cause cancer? And so what pathway databases allow us is to map these genes onto biological pathways and understand the roles in the pathway. And I'll show some of the analytical techniques that we can use and I'll apply that same 127 gene list later. I apologize for this slide. I'm not sure if this is a format issue, but I wanted to just take a moment to talk about what a pathway is or what a network is. I think that's for some of you could be kind of self-explanatory, but there's certain things and principles to be aware of when talking about this. So a biological pathway is a series of actions among molecules in the cell that leads to a certain product or a change in a cellular state. And with that you have different categories of different types of pathways. We still use these terms quite frequently and some of the tools that I'm going to talk about will only work in certain data sets. So it will only work in certain types of pathways. So you can't apply the tools to every single data set. So there's metabolic pathways, which everybody's kind of familiar with glycolysis. One of the first pathways you learned at high school, did everyone actually do? Everyone did biology here? You're quiet. But obviously the most important, I think, is signal transduction pathways, where we have a signal that moves from outside the exterior of the cell into transcription factor within the nucleus. And then from there you can have gene expression profiles as well. And that falls into gene regulatory pathways, where you have genes turning on and off. So they're the three kind of categories. And certain databases I'm going to talk about will curate different levels of information. Now, as for network, to be honest, most pathways do not necessarily start at an endpoint. I kind of have an endpoint. Now on the diagram on the left, I'm using my finger, I should use a pointer, you see EGFR signaling pathway. You can see there's a start point here and there's an end point down here. You don't see that same thing with the network. In fact, many pathways in the cell don't really have boundaries. We just artificially create those boundaries when we curate databases or when we put an illustration up there. So and then so because multiple pathways interact with one another, and they ultimately form and researchers are able to learn a lot about human disease from studying a variety of different biological pathways and networks, identifying what the genes are, the relationships between these genes and other molecules involved in the pathway of the network, and probably find clues to when something does go wrong in the disease. It used to be that I would talk about a lot of different pathway database, but the landscape of these data resources has changed. So I'm only going to focus on what call reaction network databases. So this is like react to him. Who's heard of keg? Okay, keg. Okay, that's good. Panther? Okay, wiki pathways? Okay. So basically, they explicitly describe biological processes, a series of biochemical reactions. And the flexibility of that data model allows you to describe almost any type of reaction in the cell with the metabolic signaling between regulation. So you have a series of inputs flowing into the reaction and a series of outputs, then those outputs become the inputs or subsequent reactions. And then you get a pathway. And the different types of molecules that you could have, you know, represent the inputs. And you can see in brackets, we're using reference databases to annotate these molecules. So you're explicitly describing those molecules within the database that allows you to basically connect your data to a particular molecule in that database. We also use a variety of gene ontology terms to describe the catalytic activity, say, for example, of an enzyme, or if there's a regulatory process, we can describe it as a biological process. These reactions. Sorry, you're questioning us the overlap in terms of the content. Good question. I would say I do I do have to kind of just advise you that since I do work for reactome, I do have slight bias towards reactome. But I have looked at all the different pathway databases out there. And for the time keg used to have now I'm looking at it from a human gene component. But keg used to have the largest collection of human annotations linking human proteins or genes to pathways. I would say reactomes taking the lead now. And reactome is probably one of the few pathway databases I'll talk about this in a minute, that's still actively curating pathways, a lot of data resources have kind of gone stagnant. So in terms of coverage, we cover about half the known human genome. So you can get a sense that kegs less than that. And when you actually try to overlap the resources to actually identify different pathways, there's not that many. And I'll talk about that in a moment. But it has and that's the thing with keg. Yeah. I mean, in fact, I'm moving on to the next slide. So that it's really been introduced keg now. So it is a collection of biological information. It's not just about pathways, it's genes genomes across multiple species. But I would say as Francis was saying, it focused mostly on on the kind of the supermodel organisms that you heard up there. And they have kind of these reference pathways, which may not necessarily truly reflect all the given pathway within a particular organism, it's just a reference. So the pathway is being curated from a variety of different data sources. So they have this kind of reference pathway. And then they kind of based on orthology information, they can project that pathway into many other different species. Now reactant does a similar thing. I'll talk about that in a moment. But the point is, reactant doesn't project as many organisms. We curate human pathways. keg curates human material, anything and everything. It is a useful resource. The only thing is I would point out is that it's licensed now. So in order to get into the real heart of the data, you actually have to purchase a license to get access to the data, you can still access data through their website for free. But really for bioinformaticians, they want access to the flat files and things like that. So this is just a keg diagram here. Green boxes represent proteins. White boxes represent genes. You have these kind of encapsulated pathways. These are pathways within pathways. And then you have variety of different lines connecting the entities. Yes, it is but but it's a lower throat. It's not that their curation capacity has been reduced. So they're not, they're still adding new pathways. But I would say the only resource that's actively adding new pathways to their resource is reactom and wiki pathways. And just to point out that wiki pathways actually contains a subset of reactom data as well. So reactom, the difference to keg, we're open source open access, all of the data websites, all the datasets we generate are open source open access. We focus on the curation of human pathways and encompassing all areas of pathways, you know, metabolism, signaling and other biological processes like the cell cycle and so forth. Every pathway is traceable back to the primary literature. You can make the same comments about wiki pathways and keg. And like all other databases, we extensively cross reference to other databases or data resources so that you can find about, you know, if you identify a gene that you know in a study, you don't, you're in descendants in a pathway, but you don't know much more about it, you can click to a whole host of NCBI and EBI based resources. And then we provide tools for data analysis and visualization. Not all pathway resources do that. Some people just provide you with the data and it's up to you to use the tools. We do a little bit of both. We can actually have a situation like this where this is the reactom pathway browser. And with that you have, this is a pathway diagram here at the top. On the left here is the pathway hierarchy. You can interact with it and select different pathways. The additional numbers, numerical numbers you're seeing here in this, you know, list of significant pathways is the result of a pathway enrichment analysis. Again, taking that 127 cans of genes that I talked about earlier, plugging that into and overlaying that data on top of the reactom pathway diagrams. So it's very easy to do. We'll talk a little bit more about file formatting in the lab, but simply it could just be a list of genes. It could be a gene expression data set. It could be a list of proteins or even metabolites. So rather than thinking along with like doing gene enrichment analysis, you could do a chemical enrichment analysis. So if you had a list of small molecules, you could do that with reactom. In fact, if you had a list of mixed molecules, you could do that type of analysis as well. The one that you kind of want to point out with different data resources is the slight difference in our approach. Sorry, question, yes? So you have, yeah. Yes, you could. So you had a handful of genes and the components are part of like signaling component, but you actually realize that's the downstream component of the pathway and something like EGFR, FGFR, some of the receptive tyrosine kinase signaling pathways are going to contain that same subset of genes. Yes, that's always going to be the challenge. We'll learn about this a bit in the lab later. You do have to kind of do a little bit of exploratory analysis on that list to see what those hits really are. And I think, you know, what we do with the tools is that if you can incorporate different data, you know pathways from different data resources and you're seeing the same kind of hits, it's kind of pushing you a little bit towards this particular subset of pathways. These particular, you know, this is what's this gene for blacks and you can tease out from the data set, sorry, the results, you know, pathways, significant pathways. It is feasible. So this was just to show a difference between reactome and keg. So at the top here we're just showing a reaction from apoptosis and this is the corresponding keg diagram here and the arrow just wants to highlight that you'll see a difference. So the keg has annotated this reaction like this caspase 8 and 10. So basically caspase 8 and 10 activates bid but they don't actually provide any mechanism. You can see how there's there's no seems to be no real, well there is obviously a relationship between caspase 10 and bid but caspase 8 just kind of sits there and it's not clear what the relationship is between 8 and 10. So with reactome we actually mechanistically demonstrate that, you know, active caspase 8 here and we also have with it because it's actually a complex, we're actually explaining that this here is realistically a complex. So the point is we have different approaches to visualize data and how we annotate information so it can make it sometimes difficult to compare one database with another but if you kind of simplistically break it down to like the elements of genes and proteins it's much simpler to just look at the overall coverage. You know keg apoptosis pathway has 10 genes in it say for example and the reactome pathway has the same 10 genes. We just might organize that different, the information slightly differently. So a good resource to get access to not just pathway but also network information is pathway commons. So basically all these different pathway databases are out there and we're all curating and this network, this interaction databases, they're all curating different data sets and so a resource like pathway commons tries to aggregate all that data together and provide you with tools to actually visualize and analyze your data. So it's a key resource to getting access to a lot of pathway data sets in a unified format. So that's one thing you're going to think about when you're doing analysis. Where is my source of data coming from? Well you know where your experimental data is coming from but the network, the pathway and network information, where is it coming from and how am I going to integrate the two data sets? My data with what's already known and so this is what pathway commons tries to do. It gives you that data in a particular format that you can use with a tool like side escape. Okay and I'll talk a bit more about those formats in a minute. So here's a moment just to explain what an interaction network is. Essentially it's a collection of nodes or sometimes people still use the term vertices. I prefer using nodes and edges. The edges are the lines that connect two nodes in this example here and there's a whole different other series of terms that we use directed, undirected. Basically that tells you you know this arrow has direction. So this molecule here, this node is having some influence on this so there's some directionality to that. If it was just a straight line you wouldn't know whether this node is regulating this node you just know that there's some relationship between those two nodes. Nodes can represent many different types of molecules, any type of object. By far the most common is protein-protein interactions or gene-gene interactions but in the next slide I'll show you some other examples and the edges themselves can be any types of relationship whether they're physical interactions, whether they participate in the same complex, they're regulators, they're part of the same reaction. So almost again and in fact unfortunately in the next example I can show where things you know can kind of twist. Things that you would expect to be nodes can actually be reflected as edges. So the kind of methods that we use to experimental methods that we use to look at protein-protein interactions that could be things like yeast to hybrid experiments that could be mass spec, that could be individual experiments where people are doing biochemicals, IPs and purifications. I mean the good old days you basically had a bucket full of cells and you'd pass them through tons of affinity columns and you'd hope to pull out some protein and they'd run out on a gel and then you cut the gel out and try and do some mass spec to identify the protein and maybe that was your complex. Now with high throughput techniques you can do a lot of things a lot faster. Now in terms of identifying gene-gene interactions it's a little bit more subjective. I mean in yeast the experiments you could you could do synthetic, excuse me, has anybody heard of synthetic lethality? Oh so it's basically the idea is that in yeast you can knock out one gene and it kills a cell so that's an essential gene but you can knock a gene out in a cell and it has no effect and you can knock another gene out, gene B out and it hasn't got an effect either but by combining A and B by knocking those both out you can kill a cell, well and that's called synthetic lethality. So you can basically make an assumption. You don't necessarily know what the relationship, I mean you know that there's a relationship in this gene A and gene B but you know there's probably other processes or physical interactions going on between them that you don't know about in the relationship but you can at least make the assertion that gene A and gene B are somehow connected because because you knock them both out of a cell and it kills the cell. So this is the different types of, you know, different types of interaction networks. It's an older slide but it's very high, very clearly, you know, I call them the supermodel organisms. I mean essentially Drosophila, UCL against Arab doctors. These were the you know where people were identifying a variety of different protein-protein interaction networks in the early days and then once you know people cloned all the genes in humans and they started putting these two hybrids of actors they could actually do mass screenings of protein-protein interactions. So you can get all these different types of networks that exist. So you've got transcriptional regulatory networks. So the actual, the nodes in this there's two different types of nodes. The diamonds represent the transcription factors and then their targets are the circles. You've got virus host networks where you've got nodes representing host, human host cell proteins and then the other nodes are viruses. Metabolic networks, you have actually enzymes as the nodes and the lines themselves are actually the metabolites or the products of those enzymes and then disease networks and basically in this case the diseases themselves are in fact the nodes and the lines are the kind of genes the mutations that connect those different diseases. But by far the most common is the protein-protein interaction networks which we're going to use. Now how do we get interaction data? So there's again a variety of different interaction databases out there. I used to work for one called Bion which is the biomolecular interaction network database and actually Francis and I used to work there. And in fact at one point it was probably the biggest interaction database out there. Now I would say resources like Intact, Mint, Biogrid are probably other useful resources to get interaction data. Forgive me, I missed a slide. Sorry about that. So basically these network interaction databases they can either be built because you're actually dealing with larger quantities of data, they're derived from high throughput experiments, you can shall we say automatically curate that information into a database but there is a kind of low throughput manual curation by individual people reading papers. I would say there's more extensive coverage of the biological system so you're going to find better coverage of the genes or proteins within the cell than you will find in a pathway database. Although I would say that some of the relationships in underlying evidence is a little bit more tentative with a pathway database there's a lot more detail about the individual reactions. So you can understand the mechanistic relationships between the molecules in the network database or the net you know there's less information you know there's a relationship between two nodes but you don't know as much information I think. And then as I said the popular resources here Biogrid, Intact and then there's Mint as well. They vary based on the types of interactions that they curate whether they're physical or genetic you know the the species are covering these are the popular ones that kept for human curated human data. So in this slide I was just showing you the intact so everybody's well favorite gene. There was a time when you would search for p53 and used to get 10 papers in the world that was it. My wife actually did research on another sort of kind of kind of famous gene called p10 and when she started her PhD there was three papers on p10 two of which were actually cloning the gene. Nowadays you can you know you search PubMed you can find you know in this case thousands of binary interactions that's because TP53 has been cloned in many many different organisms and also TP53 well it's not just it's not just one gene several genes now and then there's also other genes with TP53 in the name that are related interactives so you're going to find a whole host of interactions and typically what you've got is this kind of table of data and you're going to see this binary interaction here and you're going to get supporting data and then you're going to see you know cross references to other interaction databases and then you can click on these records and you can get more information about the interaction the source of that interaction and you can potentially visualize that information although I'm not showing it in this slide. So now you know where data comes from the question is how do I integrate that into my my workflow so there's a variety of different open data exchange formats so these are basically files that are made available so that you can simply download you know a lot of pathway network information in batch or for individual pathways or networks and then you can use that as a sub you know you can feed that data into one of your tools like Sight Escape or one of your other work pipelines so there's this four types I'm going to talk a bit more about just out of this SPGN this is a system's biology graphical notation so basically it's a standardized format for showing representing you know biological processes pathways and interactions and the idea here is with this file format is that you can you will see a particular pathway diagram nodes are organized in particular locations when you download that file and then you upload that file into one of the editors or your analysis tool you actually retain that same layout that's actually sounds kind of kind of like well wouldn't you expect that but a lot of these are the files because it contains so much information you don't necessarily want to keep the coordinate information so SPGN is very good if you want to actually use the diagram and kind of edit it yourself add more information but things like Biopax SBML and even Sidequick a lot of the time when you upload that data into a network viewer you're going to lose that pathway connectivity that you see in the Reactome database so Biopax is a standard language that aims to exchange and integrate data from not just pathways but also data for interaction databases pathway commons uses the Biopax format so you can upload that data directly into SideEscape which i believe you've learned about yesterday so that's a good way of starting to bring data into SideEscape so you can build your network and then you can do the data analysis from there so that's in a sense what i'm describing to you here is how you create one approach to doing network analysis it's actually to or pathway analysis is to create the network yourself because you have a choice of deciding what information what you want to include in the network the other example is Sidequick this is mostly for molecular interaction database so there's a kind of it's kind of like a tab delimited file which is a which basically allows you to upload again data into SideEscape if you want and then SBML if you're doing kind of you know looking at computer models of different pathways and networks you might actually use this particular format just because you can actually include kinetic information certain databases like reactive don't necessarily curate the kinetic data but there's other resources like biomodels which actually has the relevant reaction kinetics that you might want to include into your analysis so there's different formats i'm telling you these are the most common ones to use and they're probably the most relevant because it's different series of visualization tools that accept these data sets so you're actually introduced to SideEscape yesterday you're going to use that later today there is other tools there's navigator which is quite a powerful visualization tool for representing two and three dimensional visualizations of biological networks there's Vissant which really kind of works well with metabolic networks there's things like for SBML you could use some of the thing called cell designer the point i'm making here is a lot of these are open access open source some network tools that were developed in the early days sorry that people using the early days are licensed so you actually have to buy them so or if you want to actually develop a tool for them because some of these have plugins that allow you to expand the complement of analysis tools or integrate additional data again you know SideEscape is a very good example where a lot of that tool development is open but something like cell designer there's actually a license which actually restricts you from actually integrating certain things into that tool so you're actually all that you can really use it for is data visualization and a little bit of analysis according to their framework but something like SideEscape you know if you're not happy with like the particular tool you can actually develop it yourself and you can contribute to it kind of you know so that you know the analysis field so it's actually kind of a useful approach here to actually use tools like some of the ones here because there's a it's not just you that's working on this there's a whole community of people out there working on these tools so you'll always find you know if i don't have the best way to you know if i'm not explaining to you but this is the right way to do your analysis you can go and read papers you can go to the SideEscape website you can download apps you can basically try all the different tools that are out there and seeing if they're better for your analysis than what i'm saying so typically i'm not sure if you showed this but this is a typical analysis workflow for rather pathway or network analysis now i know this example is showing the word pathway but you could cross that out and rewrite network database that the approach is still apply typically you have your data set could be a gene list could be gene expression profile data set it could be results from mass spec and you can fill you know protein list from mass spec experiment you traditionally the most you know frequently is pathway enrichment analysis this is this first box here and there's a statistical algorithm applied and at the bit at the end of that output you get essentially a list of significant pathways with a p-value or a q-value that's typically the most common type of analytical approach another approach is called functional class scoring this is when you use a gene level statistic and this was best popular were they introduced to gene set enrichment analysis GSEA so that is functional class scoring example and again widely used resources and you'll find things like you know individual pathway database like reactome are part of the gene sets that you'll find for GSEA and then finally which we'll talk a little bit more about later and we'll try to demonstrate a little bit about this and this is actually where you start taking on board some aspects of the network topology so the actual structural layout of the pathway you know you're going to organize information into particular in a particular no you'll see a network and then you find that you know when you actually integrate your data in you apply variety different analytical algorithms you can actually identify structures within that network they could actually tell you something meaningful about your data so you have to look at things like simply you know more about the structure of the network and these are more kind of about understanding the perturbation that could be going on within a pathway so if you perturb sorry if you perturb the pathway there's an effect whether that kills the cell or whether that treats you know you know switches you know switches a pathway it activates a pathway so these are things that you can do and study in the pathway topology section I'm probably not explaining this very well but I will show you in future slides better examples of this so I don't know if you already put this slide up this was a paper that gosh apologies again let's get it cut off at the bottom here but this is another way of looking at he did okay so I wonder whether he's he's actually put the same questions so I was looking at it in a different way okay so all right but this is it I mean the enrichment analysis can allow you to you know basically and I admit that I work in I work at the Ontario Institute for Cancer Research a lot of the data that I've worked on in previous years has been cancer data sets so I'm a little biased towards when I'm giving talks I will talk a bit more about cancer the examples you're going to learn in the lab are cancer data sets I have to say that probably the better source of data out there I mean I know that there's other diseases I know that some of you may actually be working on rarer diseases out there and I think that's a valued effort there's certainly you know but you can basically substitute the word cancer for pretty much any other kind of disease or other you know phenotype that you're studying so I think the second one can help you know potentially identify and we're going to learn about this in the lab this whole de novo subnetwork construction but you could basically ask questions like are pathways particularly identifying new pathways are they altered in cancer and are there any kind of clinically relevant summer tumor subtypes you know the idea is linking or integrating not just your own data your you know the experimental data that you have to actually identify significant pathways components within network but also then to bring in clinical data to actually see if you can define differences in those tumor subtypes you know different types of pathways are being switched on or off and then finally which you know when you start looking at pathway based modeling which I was trying to explain a moment ago you're looking at how pathway activities the flow of information through a pathway changes either in an individual patient or across multiple patients and really the question is are there any targetable pathways within a patient based on those observed changes now there are some issues to kind of raise some challenges about different types of analysis which is why some people try to prefer network based analysis which is why I'm going to focus a bit more on that on the lab today and that is we can address two issues with pathway based and you know why we might use network analysis to to have to deal with a couple of issues in pathway based analysis so pathways kind of have a hierarchical organization and that is you know kind of like you were explaining the other day about gene ontology at the very top you could have the term pathway and that covers all pathways then below that you're going to have metabolic pathway and then under metabolic pathway you're going to have you know metabolism of proteins metabolism of carbohydrates and then under metabolism of carbohydrates you're going to have glycolysis gluconeogenesis and all the other things so there's a certain way that you organize those pathways and that's how it has been for years so that causes a little you know getting back to the question earlier it does cause some issues when it comes down to data analysis so you want to try and flatten that hierarchy down so we can actually convert that the pathway information into a kind of systems-wide network and that makes it much easier to analyze your data because there you just have one big network and the other thing to kind of bear in mind which I alluded earlier is pathways don't necessarily have boundaries there's crosstalk and we've just given you a nice little you know I could show a nice diagram of a pathway and say this is the pathway for EGFR signaling but we know that there's components in there that are shared with other receptor tyrosine kinase pathways so where does that end because there's because you're going to find that genes are going to exist in multiple pathways but you're typically only going to see one gene in the network because it's one big display and all kind of the crosstalk the interactions relating to crosstalk are displayed in the same network whereas you may not necessarily all see all those interactions within a pathway diagram so that's going to obviously affect your data analysis sometimes. The other challenges and I think this has always been the case I think you know I used to work and actually Francis and I were talking about this the other day I worked in a lab in the very early days when you when you're doing gene expression profiling it was a it was a nitrocellulose membrane with individual genes spotted onto that onto that nowadays it's all glass slides and microchips but in those days you know the kind of complement of bioinformatics tools and databases you have nowadays didn't exist so we were joking that the only way that you actually could analyze your data set was basically identify the genes that kept popping up in in the experiment and the frequency and then you would cluster that information using mic license I can't remember the name of the tool was a cluster or something like that and it was basically just presenting a heat map of the data which was you know for the longest time quite a common format of displaying your data your results obviously now people like to show network diagrams and pathways that's a better way to demonstrate so but the same problem then still exists I've got different types of experiments and I'm doing I've got gene expression of this somatic mutation data I've got you know phosphorylation data I've got mass spec information how do I bring all that data together I'm not going to tell you every single thing can be done here there's still some challenges out there but you can at the very least integrate some data sets and we're going to do that a little bit later in the in the reactor lab and then the other thing to kind of still and we're still trying to work out some of these you know these these questions in terms of like doing these kind of moving into that third category where you're actually doing pathway based modeling or simulations and that is how to use you know the structures that you identify within your network analysis how can that actually be used to predict you know develop biomarkers predict patient outcome and treatments and then when you start perturbing these networks with drugs because you think of people that are undergoing chemotherapy you can analyze those you know you can you can analyze the events that are going on going on in the cell by you know gene expression profiling or next generation sequencing is you can study how some of these drugs are having a you know how they're kind of potentially affecting individual pathways that's a little bit more challenging but so you know it's you have to set some expectations you know uh you can sorry the topologies of networks so that you know I should actually have a better answer for this yes you can and a colleague of mine actually if you were standing here would probably have a better answer for you it's not straightforward to actually and but you can do statistical testing to compare and we're going to talk about network modules in a moment but you know if you do have two networks you can compare this and I think there's some tools inside escape that allow you to do that as well but I am not sure but I will I will I was going to ask for Ronique sorry I was going to kind of be how do you compare two networks I was thinking I'm trying to there's yeah yeah yeah it's yeah it's more custom it's feasible but I don't I'd off the top of my head I don't have a simple example I can give you like how best to do that all right so we're going to talk a little bit more about network-based data analysis obviously now we're actually you know we have better coverage because we there's more information about particular human genes this is going to be focused mostly on protein protein interaction networks we're going to talk about you know identifying these kind of topological structures that you know the modules or clusters within the networks we can talk about annotating those modules to identify unique gene signatures or you know in some ways try to identify biomarkers of course with the advent of next generation sequencing and we're generating large data sets of semantic mutations we can look at driver and rare mutations together now and using network approaches you can understand the relationships between different driver genes and these rare mutations within the cell and rather than you know label them individual modules with pathway well we can we can obviously label them with pathway annotations but if we think that within that module there's a series of content or genes that are being mutated you could think about labeling that module with as a disease or you know mutated pathway and I'll talk we'll talk a bit more about that while we're doing the reactome lab so basically this idea of de novo subnetwork construction and clustering we're basically we have a list of genes proteins or transcripts and we're going to kind of apply that to the network identify topologically unlikely configurations so basically a subset of genes that interact more closely with one another than they would expect but you know than other than each other in the network then you'd expect a chance you can extract these unlikely configurations and then you can annotate these clusters so network clustering it's just it's basically this process of grouping objects together you can use the word clusters communities and modules so i'm going to use probably modules and clusters as i'm continuing the talk but you can understand this and essentially they consist of elements that are similar in some way however so there are a variety of different network clustering algorithms they were mainly developed in the social end you know people that are studying social interactions engineering there's you know there's you know large networks within companies and people want to kind of collectively organize some of that information in the network and then they obviously want to try and filter out information that's basically what this these algorithms are doing and they're basically there to look for these sets of nodes that are tightly connected with one another to one another and basically the idea here particularly as we know now is that in large networks it's very useful to identify highly connected genes because they obviously potentially share the same connectivity so this is not just the connectivity but the same similar functionality so this here is a long list well not a long list but relatively let's see actually in a bridged version actually there could be the slide could be a lot longer but there because there's many different network clustering approaches out there in fact people that are studying like B high like the interaction between B's and how the B swarm around the queen B yeah actually there's you know there's interactions that you can study there and you can actually you there's actually network clustering approaches derived from that kind of work i'm not saying they're going to be applicable here for protein projections but there's the list of algorithms that really could be used for uh i tend to find those topologically unlikely modules is long so i've focused on some of the most relevant ones we're going to learn a bit more about Gervyn Newman in the reactome lab later today basically the idea here is that it's you have your network you kind of chip away at the network so you start removing like the edges where that's the highest between this first and then as you kind of chop away at the graph you kind of break it down into individual nodes and as it breaks down further you kind of see these tightly net interactions appearing you'll see it better when we actually you'll understand it better when i demonstrate it in the lab later um marker of clustering algorithm again it's a similar kind of approach of like identifying these modules but it looks at the graph in a slightly different way and it's using these kind of statistical stochastic flow models um and again my my limited see i'm trying not to present you with statistics and equations here because i'd lose i'd lose myself probably in that as well but it is kind of um basically i think that the take-home message you hear is there's a variety of different algorithms that are used for network clustering um you know if you have a gene expression data set you may want to use Markov clustering algorithm and in fact i think we can actually we actually might do that in the lab actually i'll maybe try to do that in the lab if we can but for typical you know gene lists that you have out there the Gervyn Newman algorithm works really nicely at identifying modules hotnet is kind of interesting in the sense it uses heat it kind of uses a heat diffusion model it's a slightly different approach but again um the advantage here is that it avoids some of these ascertained binds that is usually associated with well annotated genes so certainly genes that are well studied there's going to be lots of interactions and they're going to automatically cluster you know when you're into like these nice little tight modules uh when you do the analysis so you want to kind of avoid certain things like that sometimes because you're not always focused on the genes that you know the most about you're interested about the rare mutations which are that are interacting maybe with one another or some of these driver genes uh there's a hyper modules app which basically helps to find network clusters that correlate with clinical characteristics so it's important to bring in the idea that you know once you've created your network based on your experimental data you may want to bring in additional data sets like clinical data to try and relate some of those modules within the network to a phenotype or a clinical outcome and that's and that's the whole idea of predicting biomarkers and you know so the clinical data comes from with what kind of formula? um well uh the clinical data really is um I mean we'll do this in the reactome lab later and the idea here is the clinical data so you know obviously when the um you know a lot of these kind of large studies there's clinicians involved so you kind of you've got you know say you're studying the gene expression profiling of a cancer patient you also have you also have as well as you have the experimental data that you you know you have you might also have information about survival and whether the patient succumbs to the disease when that is and you can use that information to actually um make predictions and actually the reactome affine network and the actual cytoscape application that we're going to talk about you know we're going to show the lab later it actually integrates a number of these different algorithms so depending on the type of data that you want to you you want to analyze you've got these different opportunities to analyze the data using this algorithm and essentially this is the outcome of the analysis this is just a you know just a hypothetical subnetwork it's composed now of six clusters um you know and oh this is you know my slides have been screwed up a little bit sorry the point I want to make here is that um you know you know this cluster six here it's only got two genes you might exclude that from your analysis and focus more on like these modules here that are here and here and then you also have to remember that it's mutually exclusive so the gene only appears once in this network so you know there's some analysis you can do where you have networks where there's more you know the same gene can appear in more than one module this is one gene per you know so this I'm going to kind of move on a little bit I'm going to probably can I have permission to run a like maybe five minutes into the break are we okay with that I'm sorry um thanks so now we're going to talk about the reactant functional interaction network um and basically in fact I've gone through some of these things here the point is that we want to basically analyzing mutated genes in a network context allows us to understand the relationship between the genes so you can elucidate the mechanism action of the driver you take you know these driver genes and the rare mutations as well you can facilitate some form of hypothesis generation and most importantly really when you've got these large datasets with thousands of genes maybe tens thousands of genes you can reduce that down to a handful of mutated genes and potentially mutated pathways and that's basically there to generate your hypothesis you know that's that's what this is a hypothesis generating rule or alternatively you could use it to validate a hypothesis you already have you know you've done experiments to you know you demonstrate the tool that sorry just demonstrates demonstrate a particular phenotype or a perturbation of a biological event and you can basically you know potentially go out there derive experimental data throw it into the react to a fi network and into the app and see if you actually can validate your your hypothesis so we're going to talk about functional interactions here so basically the network the react to my fine network is a functional interaction network a functional interaction network is an interaction which to protein well basically it's a reliable biological network based on manually curated pathways and extended with verified interactions so this is just a typical reaction here and you can break that reaction pathway reaction down to a hands full of binary interactions okay now in order to have a kind of larger network week we've basically parsed a lot of data from the pathway data different pathway data resources and integrated these other these other protein protein interactions so we extract these you know the extracted pathway interactions are called the annotated fi's annotated functional interactions and then what we can do is for subsets of that data we can use that as a training data set for a naive basing classifier which you then feed in the features for all these different interactions and that you know creates a second data set for the predicted fi's so they're going through all the different protein protein interaction databases and trying to figure out okay here's an interaction because we're basing the fact that you can trace a lot of the interactions in the pathway from the sorry the interactions derived from the pathway database can all be linked back to the literature and there's been you know some you know additional contributions of authors and reviewers there's a high degree of certainty that that interaction occurs within the cell some of these interactions are derived from high throughput experiments there may be some that do actually occur within the cell some could be frank you know artifacts and such so the naive basing classifier tries to identify you know the probability that a particular interaction is actually that is true it actually will occur in the cell so we combine these two data sets and we left with like a series of uh 328 000 interactions and uh consisting of about 12 000 proteins in the cell so that's about that's just about 60 percent I think of gene coverage the human gene coverage okay no because a lot of these protein we don't when we curate pathways we're not using protein-protein interaction data from any of these high throughput databases okay but you could you could potentially identify within some of these databases interactions that have literature references so you're going to have a slightly you know you know depending on the experimental approaches people have applied to study those interactions you're going to decide whether there's a high degree of confidence so you're only annotating react from based on single gene experiments yes yeah yeah yeah yeah because you vote because and also you have react to them on some other pathway databases there's also an an interrupt you know this a curator who's building that record of the reaction but they're also talking to an author and the author is somebody who's an expert in the field who's giving you a bit more information saying well I think it's these papers you want to build to to build this this reaction and then it's reviewed peer reviewed so there's time and effort put into creating these records so there's a high degree of confidence in the individual interactions like you're essentially guaranteeing then that you have not a priori form the model yeah between these two guys right so yeah we're trying to keep them well that's why we have these two groups we combine them and then we have this large network so what I've talked about before is the idea that you're using genes that you know that you're extracting networks from different databases they could be and you create the network yourself now what I'm telling you about here is an alternative way of doing data analysis and that is here's the reactor network and you why don't you just use that as your large data set to start with and then you basically use these clustering algorithms to chip away at that network once you've uploaded your data in there you identify those modules and actually I do this better in the next slide here but just imagine this is like a subset of the network here now there's going to be some degrees of connectivity that you'll you'll see these kind of structures here just ignore this this is just an unweighted network just now but if you start putting your genes into the data set into projecting your genes of interest from your data into the network they're going to map to certain locations in the network they obviously oh yeah they obviously the yellow lines demonstrate that there's some form of interaction between those nodes and then you could sometimes add these things in called linkers these help to provide a degree of connect to a better connectivity between topically distinct locations you don't have to add in linkers but sometimes linkers could be things like transcription factors which may not necessarily be kind of perturbed within your within your gene expression profile but they help to explain what's going on in your data so there you get additional lines and then you take away everything that's not really part of your data set and now you're left with some network up based on your data and now you can do the kind of network analysis so just an example of going back to the 127 cancer genes I was talking about earlier you can upload that into the reactomefi network and then you see that there and now we've done the clustering so you can see these distinct modules here those four distinct modules and then using pathway enrichment tools you can actually label these modules and say well this is receptor tyrosine kinase signaling there's this is possibly crosstalk between notch, wind, tgf, beta signaling there's TP53 module here and there's a cell cycle module again there's you know you have to look at the the data sets that you're you know you have to look at the the pathway names that you're overlaying because it might not be obvious sometimes that you're talking about the TP53 pathway it could be a sub pathway of that but the point is you know there's there's the statistics to back up these end these particular annotations but you yourself have to kind of go in and tease out certainly some of the pathways of interest this just is an example of where you can combine the results of your experimental data with clinical data this is going to get my question earlier so basically what we're doing here is using the REATOMFI network to display gene expression data and then to search within these modules for sorry within the network we're looking for modules that are related to patient overall patient survival okay so basically here you can do there's two different types of survival analysis you can perform we'll be doing this in the lab there's cox proportional hazards and then this is the Kaplan-Meyer approach here so with the Kaplan-Meyer plots you basically draw probability of survival versus time so basically what happens is basically we have sample information associated with all of the gene expression you know the gene expression profiling is derived from patients we know about some physical characteristics of the patient how they're responding to treatment so we know basically from the sample the actual patient samples genes are going to have low expression or high expression and basically that's what you do you split the group into two groups so you're basically going to have samples where there's no or low level of expression and you know that's the that's the red line and the green line is where genes are high expression so basically in this particular case there's a module of 31 genes that are involved in in mitotic cell cycle and the difference in color is just there to represent different annotations from different pathway databases but essentially it's confirming that it's part of the cell cycle but you could say that these 31 genes is significantly related to breast cancer patient survival in at least five different cases that's what this Kaplan-Meyer plot is showing the point is and possibly patients with kind of low expression are faring better than patients with high expression of these genes in this particular module so the idea here is basically that a single or maybe not just a single network module it could be more than one module could be used as a signature to for the signal for the it could be used as a signature for patient you know cancer patient prognosis we'll do this in the lab later today actually there's quite a good data set to just demonstrate this so the last part of my talk and I do apologize for running over time but this is I'll get through this relatively quickly because it's quite painful this is because this is more about looking at pathway modeling so again you're still going to have lists of genes proteins transcripts we have preserved a lot of the interactions the information about the relationship between those entities and what we're trying to do is integrate multiple data types so things like copy number variation and somatic mutations together to actually try and yield not just a list of significant pathways but a list of altered pathways and that's well let's say this has been a you know many groups have tried different approaches and I kind of made a annotations earlier that you know people work on different types of things like they'll focus mostly on metabolic tools or sorry they focus mainly on metabolic pathways there's also a signaling pathway and for that there's different tools for different resources so what I'm trying to say is you can't necessarily apply all of these tools to your data set if you've got a metabolic data set you could probably use something like cell net analyzer because it is there used mostly for studying basically a lot of the algorithms have been developed for doing what they call computational strain design so say somebody has a bacterial cell and they want to maximize the production of particular product from a given pathway well you can do metabolic engineering and such and so what I'm trying to say is cell net analyzer may not necessarily be applicable to studying signaling networks in a human disease that's what I'm trying to say so there's no well there's only a handful of tools that kind of work with multiple different pathway data sets like and I'll we'll talk more about one in a moment but there are these different types of approaches for studying pathway modeling whether you're looking at metabolic pathways whether you're looking at phosphorylation states within cells whether you're looking at regulatory processes transcriptional regulation of such I'm going to talk a little bit more about paradigm because I think that's actually quite a useful approach to actually looking at different types of pathways and in particular it's useful in studying disease data sets again I apologize for focusing primarily on cancer but it's probably one which is actually generating useful multi-linked data sets so probably sick graph models in this case attempts to integrate multiple uh molecular alterations to yield this listed pathways as I said you can have omics data for an individual patient or across multiple patients so typically it's copy number variation gene expression data and some kind of mutation data and the approach here to actually use paradigm is as such now we've we've I've introduced you to basically the kind of a simple view of apoptosis MDM2 inhibits p53 uh in order to actually perform paradigm you need to think about the fact that you're you're studying multiple simultaneous perturbations on that pathway and it's not as simple you have to create what we call a factor graph so basically um in this example a single protein in the pathway now becomes four independent entities or but related so there's genes transcript protein and an active protein in this case and each has of the pathway you know each of the components has a piece of experimental data that you can you can exploit now the next question is once you take two of those different data sets whether it's a copy number variation it's how did that get integrated into the analysis and displayed in the pathway diagram so we're going to go back to the pathway diagram just as an example here so we have a simple gene expression regulatory pathway where CT, GF and NAPA are regulated by YAP1, WTR1 and RUNX2. Now what we're trying to do with this PGM work is to ask the question if I convert this pathway into a factor graph I'm not going to show you the factor graph for this individual pathway just for you to understand just now is you want to ask questions well if YAP1 copy number is up how likely is CT, GF upregulated or if NAPA is is activity is high how likely is WTR1 expression upregulated or maybe RUNX2 downregulated anyway sorry so in this example here we're using the same diagram this is using this probabilistic graph modelling view so on the left panel you're seeing the pathway for two different patients so we're combining copy number variation and gene expression data from ovarian cancer samples again this is from the cancer genome atlas and we've integrated that into the factor graph using this paradigm approach and basically once you what you're viewing here is the inference and so basically the point I want you to kind of focus in on here is the comparison here is that if you look at actually sorry I'm so I'm colorblind I apologize I'm right green colorblind which makes it very difficult sometimes when you're looking at slides to and I can't because of the projector uh you're gonna have to I think did I annotate this I didn't oh shoot the colors are there is a difference in the colors the thing is what I'm seeing on my screen is different from what you're seeing here the point is and probably your notes are are they colored all right you can see it so the point is and I'm looking at here is you can see that it's still green but there's actually a difference in color here between these two notes okay um and so the first sample basically you've got lower expression of napa1a and then you've got slightly higher expression here and it's most likely because wtr1 is is there's there's actually copy number change and I am just like there so you see here there's two so that this line tells me there's two copies of wtr1 so the point is the copy number variation predicts that napa1 is being up regulated so the minus and the plus the minus three and the plus three refers to the copy number or to the expression this is I hope that's expression or the or actually um thing is they've just that could be the because then the lighter green refers to like you know more negative value it could actually be the infer it could be the statistic in the firms I actually don't think that I need to I should check that for you okay anyway just a wrap up here because and it's probably like well paradigms actually the good news is let's focus on the gauges it's all good we've integrated the paradigm approach into the react to mefi network so you can do these kind of probabilistic graph models and pathway based simulations using the react to mefi network if you decide it's still work in progress but it's a useful resource if you want to look at if you have that kind of data you want to integrate it and try to make inferences from perturbing pathways activities so just the last few slides are basically just listing some of the resources for some of the different pathway databases the network algorithms and some of the modeling approaches so you can go to these websites okay and we're on a coffee break again I'm sorry for overrunning is there any more questions I feel a bit I'm sure there'll be lots more questions during the lab