 So, my name is Robin Ho. I work at the Ontario Institute for Cancer Research. I just give you a bit of a background of myself. I am one of the few people in this room probably. I pre-date Anne as part of the Canadian bioinformatics workshop. I was initially back in 2003, one of the students on the course there, and we did introductions to bioinformatics which you think about spending a week at Cold Spring Harbor. I had two weeks in Vancouver, and it was in the same kind of environment. It was a nine to five class, and it was intense, but it was very valuable. My background is microbiology genetics. I've done dabbling and genomics, proteomics, and obviously bioinformatics. I have worked for a number of interaction databases, Science Magazine, and I'm now working for Reactome. I have done data analysis in my past life as well. We were joking about this last night. I've used technologies which you probably have never heard of, so I have appreciation of the data that goes in. I don't play around with all the data tools that I'm going to talk about today. I have a better experience of some, less of others. I'll do my best to handle some of your questions as well. Feel free to interrupt me while I'm talking. I tend to jabber a little bit, so I have notes here to try and keep me on schedule. You're learning objectives today. We're going to continue along from the excellent talk from Yuri yesterday about some of the pathway resources out there, so you're going to learn about more of the pathway and network analysis. We want you to understand the sources of that information. That's important very much to it when you're analyzing data, because you need to understand where the data comes from, because that can have a huge impact on the results that you're going to look at. We'll talk a little bit about some of the analytical approaches to analysis, visualization, and data integration. That's a huge thing, particularly when you're talking about multiomics data sets and trying to understand what your data means. Then we'll do an overview, as I said, of the Reactome Functional Interaction Network and the Side Escape tool that we've developed. Let's get our hands up first. Who uses pathways on a frequent basis for their work? Oh, okay. That's low. Haven't networked data. I look at it differently than pathways, but who uses both? Then looking at the tools, people heard of CAG before. Obviously yesterday, but before you came here, Reactome, anybody? Good, good. That's not 10. That's good. It's always a good thing. I'll tell my boss that. Also, who uses ingenuity here? Okay. At least somebody's honest. I'm going to take them down later. You have as well. Any other resources that people have used that they want to share in the pathway and network world? It's helpful feedback for me as well just to get an idea of what your backgrounds are. All right. Okay, let's continue. So really, what is that type of analysis? Well, it's a technique that makes use of either pathway and network information to gain some form of insight into a biological system or your data set. It's funny, I always put rapidly evolving field. I have to say that it is, it still is. There's certain areas when you're looking at network topology and pathway topology and a lot of other variants, information and how to link that to pathways and pathways to disease. It's still changing. There are many approaches. Some are good, some are bad, some are awful, some you shouldn't ever try. I'm not going to tell you which ones are which. That's up to you to decide. Sorry, it all depends on your questions. The most important thing I think you should have is a series of questions before you start playing with these toys. Because without those right questions, you're going to go in every which direction. And I'm sorry, my biggest headache has been somebody who's analyzing data has been the PI not giving me the right question. Because without that, you're not going to get the right answer. And you may actually find when you're doing your analysis, you're going to get a different answer than you'd expected. So you then have to go back to your PI and say, hey, this is what I found. This is really exciting. And then you have to convince them that they have to change track onto what you're suggesting. So you better have the right information there. And I think that's the biggest challenge is to basically have this need to analyze this huge amount of data and extract something meaningful from it. And hopefully there you can actually answer some fundamental questions in biology. So realistically what we're looking at, we're looking at a huge size reduction. And initially when I started, we were talking hundreds of genes. Now you're talking about thousands of genes. And if you look at, like, the ability of data points now, good old days of doing yeast micro-rays, you have maybe a thousand data points on your slide. Well, actually, your nitrocellulose membrane. Now you can be looking at billions of data points when you're sequencing genomes and things like that across many different samples. So it's a huge data reduction down to what I think of as dozens or a handful of pathways that you want to focus in on. You can increase statistical power by integrating multiple perturbations for testing in a high-dimensional space. Okay, nice photograph time. Thanks. Particularly when you're looking at variants, you're going to hit a number of driver mutations that you can quite easily identify or relate to in terms of your model. But there's also that huge long tail when you look at a distribution of variants, particularly in cancer where you've got that kind of long tail of rare mutations. And you're just trying to understand what's the relationship between the drivers and the rare mutations. And then there's a whole bunch of other stories. By far the most common is identifying a hidden pattern in your gene list. That's what you're trying to do. It's a great way to visualize and try to look at mechanistic models that underlie some of the experimental observations that you have in the lab. It's also useful in predicting the function of unenacted genes. The graph itself can be used to establish quantitative models. And, you know, as we'll try and talk a little bit later with the FI react on its network, it's kind of useful for, you know, trying to, you know, developing methods for identifying molecular signatures within a data set. So, you reintroduce you to this data set the other day. I'm trying to use it again just because it's a nice little example where there's 127 cancer driver genes. You look at your genes, you might find, you know, an interesting gene you know about. You might, you know, you might know a little bit more about the pathways that they connect to. But looking at this list here, you don't necessarily know what they're doing and why these mutations will be causing cancer. So pathway databases allow us to try and map these genes onto biological pathways and try to understand the roles of these genes within the pathways. And if you substitute the word pathway for a network, it's pretty much the same outcome. So, again, Yuri presented this slide just to kind of give you, Yuri wrote this paper and my boss was another, Lincoln Stein was also on the paper, so we're kind of, we use resides, but it helps to reinforce some of these ideas and our different perspectives on pathways and networks. So I think of pathways as a series of actions, you know, amongst molecules in a cell that leads to a change in cellular state or the production of a certain cellular product. I think most of you have been introduced to metabolic pathways when you're in high school, probably learned about glycolysis or something like that. You might have seen some basic chemical reactions that were like part of life. They're basically the chemical reactions in the body that help us to kind of metabolize glucose to make energy, so forth. Signal transduction pathways out there move a signal from the exterior of the cell into the interior, usually the nucleus. We're seeing there an example of the EGFR signaling pathway on the left there. In this icon, green represent proteins, blue represent, sorry, I'm colorblind, so if I'm saying the wrong color now, I'm in trouble. EGFR is a complex, ATP or small molecules. You can see inhibitory as well as activating events and the little boxes represent the reactions. There is also gene regulation pathways, sorry about that where genes are turned off and on and off. But the reality of what's going on in the cell is it's a big mismatch of metabolic signaling and gene regulation events. We still classify these pathways using metabolism, signal transduction and gene regulation, but when you're looking at disease and a lot of biological growth senses, there's interplay between each of these. And when you think about that interplay and that's crosstalk, the actual fact of joining pathways together in some ways does create a network. Networks tend not to have a start and end point. Pathways themselves don't necessarily have some boundaries sometimes. And then you can learn a lot by looking at both pathways and networks to understand more about disease. Now, I'm going to talk specifically about reaction network pathway databases. If you got a path guide, I don't know if the links at the back, there's the resource path guide.org which has reviewed all the interaction pathway that the data resource set there. There's several hundred out there. Again, I'm going to get back to this. Some good, some great, some awful, and some that are useless. And it depends on whether they're being funded or whether they're actually actively being updated. I think the important thing to think about when you're actually going for data is to look at what we call the gold standard resources. Now, Reactome is one of them. I am proud to say that. But that's been decades of work. We've been going since in an earlier version when it used to be here at Cold Spring. It used to be called GKB, the genome knowledge base. And it started in 99. And like in 2001, there was more data that came online available. And then in 2004, we renamed ourselves as Reactome. So we've been releasing a lot of data in the last 10 years. It's really significant. But I was going to say was that we've worked hard to establish this gold standard database kind of title. Keg is another one, resources like Panther, Wiki Pathways, they're coming up there. So these are all examples of reaction network databases, where basically the unit of the pathway is this reaction. You know, it can represent a variety of different reactions, I mean, chemical reactions that you can find in biology. So the inputs and the outputs can be proteins, small molecules, complexes, micro RNAs, therapeutics, whatever you'd like really feeds into a reaction, you can describe that using genontology terms that Yuri was talking about yesterday. And you can like link all this information to citations. And it's, you know, it's an established system, it's very well used and very well accepted. So one example of this is the Keg, it's been going, it's probably one of the older data resources out there, I think it's been going since the late 1990s. Recently, we talked Yuri mentioned this the other day that it's licensing had changed. It's no longer you can go it's free for you to go to the website and use the data within the context of the website. But once you once you actually want to download the data and do your own type of data analysis, you have to have a license to do that data. If you do use keg, and you haven't got the license, there's a very good chance you're using an old data set, which means there's nothing wrong with that because there was a time when it was freely available for scientists. But if you try to publish that, you know, your the reviewers should be well, it's not that always people make note that their data is old. But you know, a good review would pick up on that and probably advise you to reanalyze your data with more recent data, of which you'll need to then get the keg license or go to another data resource. But keg is still a very valuable source of biological information. It's not just about pathways, it's called drug variant information. There's interaction data, and across a variety of organisms in the biological kingdom. I would say that when they're curating pathways, they tend to focus on a species of interest, collect all that knowledge, and then they create these kind of reference pathways from which they can project that information into other organisms. And that's something we do at Reactome as well. And it's something to bear in mind, you focus on the curation one species, and then by using ortholog data, you can project those pathways into other organisms based on the conservation of proteins, genes, and we kind of assume in a or one of the assumptions in that model is that we're assuming that small molecules are conserved like 100% across all species, unless somebody tells you otherwise. And their goal at keg is to kind of provide these relationships between all these components, and they organize them into a slightly standard way of diagram. This actually hasn't really changed in over a decade. So basically, the green boxes are proteins, white boxes are genes. You can see on the pointer here. You can see on the left here, we've got map signaling pathway. This is basically an encapsulated pathway. The idea is if you click there, you get linked to another pathway. In some ways, this is reflecting crosstalk. And then you've got a variety of different lines to describe the relationships between these components. You can download these diagrams. A lot of third party tools use this for analysis. I don't believe keg has an internal like they don't have an internal service embedded into their website for you just to upload data and be able to analyze it. You have to do that through another tool. React, which I work for, it's open source open access. That means pretty much everything that we do is transparent. We focus on the curation of human pathways. And then, as I said, we project that into 18 other model organisms. Typically, we're focusing on model organisms where there's gene ontology and additional annotations to support and enrich our own annotations. And also, that means that there's actually a community out there that's there to kind of look over the data and use it. Every pathway we create is traceable back to the primary literature. So we cross reference to other databases. And we also provide an analysis tools through our website. And also, we support a number of other third party resources. And I'll kind of use react as an example of how things should be done in terms of providing open data exchange. Because when you're using pathway data from different sources, there's a variety of different exchange languages, there's visualization tools that we can use. And then there's a variety of analytical approaches that are applicable to these data types. This here is a screenshot of our pathway browser. It's gone under a few iterations over the last year. What we're seeing here is our illustration, like diagrams that we provide as well. Whether you have these ugly green boxes that just like were little boxes with little arrows. And it wasn't it wasn't really that interesting. Not visually not visually impressive to look at. So we replace one of these lovely illustrations. And we've actually integrated our analysis program into these visualizations as well. But traditionally, and I forgive me, I didn't you'll see in the next slide, we do have these network views of pathways as well. So there's a hierarchy here on the left, which is describing all the, you know, biological pathways and events in in reactome. We have these kind of top level pathways like signal transduction. And underneath there, you've got a variety of different signaling by different receptor tyrosine kinases and such. Basically, we've colorized the pathway diagram. So if you look at these boxes here where you see, you know, this is signal transduction pathway illustration. And these little white boxes are representing all the different sub pathways that encompass signal transduction. And you see these little yellow colors. This is telling me that there's hits, the proportion of hits in your data set that correspond to these individual pathways. If it was a network map, you would see this kind of individual nodes being highlighted with similar colors. And the important thing about the results here is this this in this details panel below, you're seeing a list of significant pathways. This is what you really want. And there's a couple of slides further down where I'll talk a little bit more about the analysis approach. And I'll get back to this in a way. But this is really what you want to see. You want to see a nice visual graph of the pathway with your data in it. You can use it as a script. You can use it for your publications. And then you put that list of significant pathways. And as you interact with either the events here on the left, or in the bottom here, the diagram updates with the new information. The one thing I want to point out the difference between React and Keg, and this doesn't necessarily mean that it's not trying to say it's a bad thing. What I'm trying to say is when you look at different databases, the way in which they capture or they curate that data can be different. So we're going to look specifically here where active Caspace 8 is involved in a reaction which activates bid. So it's clearly seen here that there's this and these two things are linked. If I look at the equivalent event that's shown in Keg. Now Keg is actually adding Caspace 10 as activating bid. But you see how this Caspace 8 here is floating in the diagram? There's no link between Caspace 8 and bid here. So the point is different data resources will look at data in different ways and that will be visualized and could well have an impact on you know your results. So just you know this is not to say it's a bad thing it's just to be aware that different pathway resources you know this to me is a core reaction. Okay you're going to see clearly see these components are shared between these different resources. If you've got a panther and wiki pathways you're going to see the same thing. But let's get the information in the right presentation because it's useful to know because you're relying a lot on data being right. Yesterday you have a question? We don't curate it. We don't believe that that's necessarily the same of the same mechanism. So this is the thing. Curation is based on in the reactome we have authors, the curators and we're peer reviewed as well. So it's gone through a system where you know curators who are kind of have some knowledge of the area will curate the information from by reading papers. They're relying on the the expertise of an author who's a researcher in the field to give us the right information or to correct as if we're wrong because they are the reviewers. So just data can be represented differently in different resources. You already mentioned this the other day a really good source of both pathway and network information as pathway commons. You can just search for your favorite gene or you can download data from them. This is a nice slide because it segues into the network stuff that I'm going to talk about next. So I think most of you are aware of what an interaction network is. So your question is? Referring to the previous slide, is it a possible the different cell line? Is it possibly slightly different? It's possible that yes. The question is whether the cell line, you know, could that interaction that I was showing you earlier exist in another cell line or another cell type? That's absolutely true. We try our best to cover whatever we can when we're curating these events. Reactome typically takes a generic view of the cell not to say that we don't curate specific cells like not signalling there has to be two cells involved in interaction in the reactions. So it is possible. Very true. But when you actually start going through some of the data sources for data, you can find really big gaping holes in the knowledge. And it depends how far you need to go into that data to understand the mechanism behind your data set. Anyway, back to interaction networks. They're essentially a collection of nodes and edges. Nodes typically represent proteins, genes, metabolites. Edges are the relationships between the nodes. Depending on the representation, you can sometimes flip that as well. Nodes can become the relationships or some other information. And the edges themselves can be genes. It's kind of confusing. But in the next slide, it just demonstrates the different types of networks that exist. Typically, they're based on the kind of the model organisms or human data, where there was basically, excuse me, a number of high throughput experiments to assess the interactions between different types of biological molecules within these kind of model organism systems. So there's transcriptional regulatory networks where there's interactions between transcription factors and the regulatory elements. There's virus host networks where you have interactions between the virus proteins and the host network proteins. There's metabolic networks where you have enzymes interacting with substrates. And then you have disease networks where things like the nodes themselves are the disease and the edges are in fact the genes that connect those diseases. By far the most common is the protein-protein interaction network or gene-gene interaction network. If you want to look at it that way, we'll focus a bit more on that subset. So in order to create these network networks, you need to be aware of the network databases out there. Their curation model can either be automatic or via manual curation as well, like pathway databases. Automated approaches are there to kind of scrape data from other data resources or to, you know, when somebody does publish a large interaction dataset, you can usually parse that data file and just upload that directly into your database. It typically has more extensive coverage of biological systems. Reacting maybe has coverage of about half of what we know about, half of the, half of all human proteins, which is pretty high, but, you know, interaction databases have a better chance of having a higher coverage of our known proteins. The one thing to point out is that with interactions, sorry, relationships and paths between nodes and pathways, the relationships are more based on a lot more experimental data, more references and such, whereas in interaction databases, the relationships between those interacting proteins can be tentative sometimes. There's less evidence, depends on how much strength you want to put in on certain methods that were used to identify those interactions. There's a variety of different popular human network interaction data sources. BioGrid was developed by Mike Tarras at initially at Mount Sinai, went over to Edinburgh and I think he's back at McGill now, and then there's Intac, which is based at EBI, it's developed by Henning Hermiakov. I apologize, the numbers might be a little bit out of date, but there's certainly very high coverage of physical and genetic interactions. And there's a number of, as I said, there is some reference data there to support those interactions. But I think as we'll talk about the construction of the Reactome FI network, there's ways in which you can play off the two, the pathway in the network to actually build these nice interaction networks of protein-protein interactions. So this is just an example of P53. So I just went to the Intac website, typed in TP53, and you'll typically get this tabular results, TP53. So you'll get a molecule, TP53, the corresponding interaction. So in this case of MDM2, there is a whole host of interactions based on different detection methods. Getting back to the confidence in that interaction, depends on how confident you feel that an assay demonstrates the interaction over some kind of immunoprecipitation interaction. And then there is identifier for that interaction source. EBI is telling me it's from the EBI database. Mint is telling me that it's actually from another interaction data source. IMEX is this consortium of data exchanged. So actually, I can't remember what it used to be. IMEX interaction, I can't remember what it stands for now. But basically it's an organization where they've tried to harmonize the curation efforts of these different data resources and also provide an opportunity for data exchange between resources. And then you can see the source of that database. And you can click on these different records, read more about the individual aspects of that interaction. And then if you clicked on the graph tab, and I don't have it in the slide, you could view a network view of that interaction. So I kind of mentioned earlier that it's important to have ways in which to exchange data. And that is because the different tools that you potentially will use to construct a network, because there's two ways in which you could do your analysis. One is you actually create the network, which means you have to go out and fish and integrate all that interaction data that you want to have and make it available to one of the visualization tools I'll talk about in a minute. Or you go out there with your data and you actually try to integrate it into a pre-constructed network that's already out there. And we'll do that through the ReactomeFI network. So getting back to the way that Reactome tries to do it is we try to do it both ways. We give you the data so you can create your own network or we give you the entirety of Reactome in a single network. So that's two ways in which you want to do it. And depending on the way in which you want to analyze your data, you get those options. And that's typically how it works with a lot of other resources. So there's only different what we call data exchange languages. So systems, biology, markup language, SPML is a way in which you can exchange, you know, graph data relating to so that you can generate these kind of computer models of pathways and interaction networks. Systems, biology, graphical notation is a graphical language for representing the actual layout of the diagram. So usually when you upload data into one of the visualization tools, you basically get a hairball type thing with nodes everywhere. SPGN tries to take the layout that's being at that pathway database or network database and converts that so that when you see it in that data resource and you see it in a tool that's compatible with SPGN, you're basically looking at the same thing. Biopax is another one of these entries in exchange languages for biological pathway data. It's also useful for the exchange of interaction data as well. It's compatible with the side escape, which you learned about the other day and you'll be using today. The IMEX consortium developed this kind of this amuse, Psyquic, this particular Psyquic effort, which has a particular, they call it a PSIMI tab format, which is basically a tab to limited file of interaction data. So it's very similar to that table of data that you just saw earlier, but with more information, more annotations. And you basically download that data. And a lot of these data formats can be uploaded into a variety of different tools. As it had Biopax to side escape, PSIMI tab can be used to be uploaded into side escape as well. SBLMEL can be uploaded into cell designer and SPGN can be uploaded into bi only out. These are just examples of some of the tools. Some other tools. I mean, we're going to use side escape today. I would say it's probably the most popular in bioinformatics. It's probably the most relevant to be relevant in terms of analysis and visualization. Plus, they've got the App Store, which basically, sorry, the Side Escape App Store. Let's get that one right there, which basically allows you to download and install plugins very easily. In the good old days, when you actually use side escape, you didn't have that feature. So there was really some interesting times when you were trying to install plugins, and you either had this option of copying and pasting folders into other folders to try and get it to work. Navigator is out there. It's actually quite a powerful tool for looking at two and three dimensional visualizations of networks. There's a variety of different kind of rich algorithms there for layout. And it can hold, I actually would say that it can hold potentially larger data sets than typically Side Escape can, but I still think there's value in using Side Escape. And then Vison is just another resource. It's probably one of the earlier tools still going that allows you to manipulate metabolic interaction networks. When we first playing with interaction networks, there was tools like PyAC, which were like these were tools that were kind of developed for looking at social networks and other engineering networks. So these tools still are applicable nowadays, but most people are using ones where you can integrate and link out to integrate your own data or integrate and link out to other data sources. So again, getting back to I think Side Escape is the best tool for you. Plus it's another open source tool. So you can play around with it a lot. So several years ago, Catry published this great little review on pathway analysis workflows. In fact, Yuri was talking about his own publication, which I wasn't, I have to be honest, I wasn't aware of. So I will actually look at that and actually maybe update this slide. But I took this this figure from their paper because it really nicely describes all the different analytical workflows for pathway analysis. So your starting point is your own input data. And in some ways, the gene set or the bucket of data that you're going to compare your data with is the source from pathway database. Basically, the first generation analysis is over representation analysis, still used today. It's a great tool. I think where it works better now is when you cannot just do like get a p value at the end of the results with implying the significance of that an individual pathway, if you can actually get an FDR where you've actually done some multiple testing, that's a much better outcome. So we've kind of taken it to one step. And then we've just improved upon that method. It's probably the most applicable method that's being used out there. If you go to the reactant website and you analyze data, you're using aura. I mean, we do over a, well, okay, now I'm bragging a little bit, we do over a million analyses a year. So that's like, you know, 50 odd, you know, 50, 60,000 analyses on a monthly basis. Unique datasets. That's the other point. So people are widely using this as a way of analyzing data. It just would be nicer to see those publications, higher publication. So one of our challenges is in the database world is getting the citations, which people are actually using our data, they're using our resources actually site us in the publication and actually use our public, you know, use those images that we have in our website. They actually use that in the, in the, in the publication. Anyway, the second generation is functional class scoring. This is like gene set statistics. This, the most popular of this is the Gene Set Enrichment Analysis tool, again, very widely used. And then finally, the third generation basically takes into consideration, you know, the top, topological, topological structures of your network to try and infer an impact, you know, on the pathway based on, you know, changes in variance or loss of an interaction or the perturbation by a drug. But well, these three approaches usually outcome is a visual that you can actually allow you to understand what your data is doing, and then a list of significant pathways. Now, if you were to substitute the word pathway in all these cases with the word network, you pretty much doing the same kind of analyses, the same approaches apply, admit as we kind of go into pathway modeling, which is this pathway topology, there's going to be different algorithms of different approaches. But that's, that's basically the workspace. Now Yuri again presented this slide. Again, it's taken from that paper that my boss and Yuri wrote. So just looking at the same types of approaches, the reactants, functional interaction network that we're actually going to be demonstrating later today, you'll be using later today falls in category number two. And we can also do some aspects of pathway modeling with with the ReactMFI network, but I won't talk too much about that just because one, it's a Saturday, it's kind of crazy. Plus, we're still working on some of the tools there. But I think that's the more exciting area because that allows us to analyze multiomics data. So oops, what's going on? So basically, these are the questions we asked. I'm not going to go through this. Some of the issues with analyzing pathway data, sorry, in a pathway based approach. I talked earlier about a hierarchy. So the top levels is kind of like go ontology, top levels, signal transduction, then you have variety of sub pathways underneath. So when you do your analysis, one gene is going to hit on multiple pathways at different levels. In the network approach, your gene is only going to exist in that network once. So flattening the hierarchy into a system wide network might sometimes be better. How do we handle crosstalk? As I said, in pathway analysis, your gene is going to hit multiple pathways. When you do the interaction network, that gene is only going to hit one target once in the network. The sub challenges to integrating into pathway analyses, multiomics data sets, good old days, people just had a gene list. Nowadays you've got copy number variation, methylation data, somatic mutations. How do you integrate that all together? And then, how do we use some of those topological structures? And you know, if you have drug data as well there, how can that, you know, how does that have an impact on the pathway? And can then actually perform certain cellular, you know, sorry, pathway simulations. That's really where the next, that's where I say where it's rapidly evolving. And I think that's the more exciting area to be in. So network based approaches are, I think, in some ways, if you've got large data sets, sometimes better than just simply looking at an individual pathway approach, because you've got better coverage of the genes, the typically protein-protea interaction networks, they can also be gene-gene interaction networks or gene, what was I going to say, it could be chemical genomics, so you put interactions between genes and small molecules or genes and drugs. The other approaches you could look at is taking a modular approach. We'll talk about more of that in a minute, where you're trying to identify these kind of topologically unlikely configurations within the network. And this is useful then to annotate those, you know, tightly connected genes with pathway annotations, so there's that interplay between the network analysis and having pathway information. You can use that then to identify gene signatures and biomarkers. And when you have access to clinical data, you can start looking to see if there's, you know, the relationship between cancer drivers and mutations. And you can start kind of labeling these kind of modules with kind of disease annotations as well, which is quite interesting. So DeNovo subnetwork construction is the approach that the React to MFI takes. You take your list of genes, proteins or RNAs. You apply that to a larger biological network. You can identify these topologically unlikely configurations, which essentially are a subset of altered genes that are more closely connected in the network than you would expect by chance. You can extract these clusters. In a sense, when I say extract, you're kind of inside escape. You can move them around a little bit, so it's actually better in terms of visualization. And then you can annotate these clusters with gene ontology annotations or pathway annotations. So network clustering is this approach where you're trying to group objects together. Call them sets, clusters, communities, modules, different ways. So each cluster basically consists of elements that have something in common or something that is similar. There's a variety of different network clustering algorithms out there. I'll try to describe a few in simple terms in the next slide. I don't want to show you some crazy equations or any math because I'll lose you at that point. And I'm trying to think of nice ways in which to describe these different clustering algorithms, and it's not easy. But what I can tell you is they work because they've been used not just in the biological setting, but in other social networks and other approaches where people are needing to look at information in a network approach. So it's the Gerben-Muhman method. It's the one that we use in the ReactomeFI network. It's very nicely identifying these modular connections. And it's basically like, I try to think of this analogy of like, you know, if you're in a bar in Japan and the guy, you want to order the whiskey at the bar, and he's going to put this nice rock of hand chiseled, you know, ice in your glass, and he's going to pour the whiskey over there. That's the kind of analogy it is. It's like, it's a craft. There's a way of chiseling away. You're given a block of ice, and the guy's chiseling away, and he produces this lovely representation of Mount Fuji in the glass for your whiskey. I was trying to think of something a better way to describe it. It is literally like that. You have this huge network with all these interactions. And then you're basically the algorithm is they're chipping away at the bad bits. So things like where, you know, you have a node, which is very highly connected, and it's basically sucking the air out of the room, okay? You don't necessarily want to focus in on that because that's good. That's, you know, you take that out of your network, and it could well be an important protein to your whole life. There could be some aspects where it's important, but it's maybe not the focus. You're looking at, you know, interactions where there's, you know, certain hots between this side and that side of the network. I'm probably not explain this as well. It really is. And the way that Gervin is different from Markov clustering is like they try to use heat diffusion models that you have to kind of kind of identify, sorry, heat diffusion is hotnet. Sorry, this is where I get confused. Hotnet's like this metal lattice of interactions, you know, like a grill, and you heat it up. And when you heat things up in one area where it's highly connected, you're going to get a hot spot. Well, that's what you're looking at hot, you know, where you've got more interactions, that's a hot spot in your lattice. And I'm trying to think about how to describe MCL. It's like, see, there's no simple way to describe these algorithms to you, but other than to say that they do actually work. And they do really work with biological data. It's, it's, you think is you need to actually use them to understand the thing is, this is really what you want to see at the end of the day. You don't want to see the hairball. You want to see a cluster that like, like, like there's a discrete clusters where genes are more tightly connected than the individual connections between the clusters. That's what you want to see. That's the outcome. Okay, let's let's move away from the algorithms then. But here you've got, you know, an output of a network clustering, you've got six clusters. You know, cluster six only has two interactions. Okay, you might want to remove that from your analysis, unless that's the key gene that you're looking for, you know, and then you put it then tell your boss, well, maybe that's not the gene we should be looking at. Maybe it's these five other clusters that we should be looking at. And then, you know, they're mutually exclusive. That means to say that no nodes are shared between the individual clusters, the colorized differently just as a visual representation. But, you know, once you've clustered them, you can start labeling these clusters with pathway annotations, genontology annotations, because the assumption is that genes are tightly connected together, are probably involved in similar biological functions. So the ReactMFI network is a tool that we've developed that exploits a lot of these different algorithms. So there's the assumption here is, and I apologize, this is used mainly for analyzing cancer data sets. But there is examples in the published literature where people have taken diabetes data, cardiovascular data, and analyze data sets there as well. So the assumption is that no single mutated gene is necessary and sufficient to cause cancer. Typically, you have, you know, a handful of common mutations, plus many hundreds or thousands of rare mutations. So analyzing those mutated genes in a network context will reveal the relationship between these genes. You can potentially elucidate a mechanism of action. And you can facilitate some form of hypothesis generation on the roles of potentially these roles in a disease phenotype. Likewise, it could be the other way, and you're not necessarily going to facilitate the hypothesis generation. You can actually be using it to prove something you've already experimentally seen in another biology, you know, in another experimental approach. So basically, like the pathway analysis said earlier, you're reducing hundreds of mutated genes down to a dozen of mutated pathways. And we'll do this in the lab in probably about another 15, 20 minutes. So I want to explain to you how this network was created. So as I mentioned before, we're trying to incorporate, you know, when I said earlier where you can incorporate both pathway and network information into the same network view, you can actually get these large, useful networks. So what we've done is we've broken down all the pathway reactions into pairwise relationships. So this is just an example of a reaction here on the left. And these are all the corresponding functional interactions. So this could be some directs and some more associations. But once you start applying all of that data into other, you know, if you take pathway data from Reactome, you also take it from Panther, NCIPID, from Keg, you, you know, you start looking for similar types of pairwise interactions that kind of reinforces that, you know, that that interaction does exist in the cell. And when we take that with and compare that with less reliable data from the interaction databases, you can actually create these kind of two classes ultimately. So so basically we started with all these sorry we started these pathway data sources here on the right. We created a subset of annotated FIs. We then use these data some of the data here from Reactome to train a naive basing class of R to basically predict this group of functional interactions from these predicted FI data sources. And then we basically can combine these two datasets into this large functional interaction network. We've actually just redone a newer version of this. And this was almost 370,000 interactions. I believe the number of protein coverage is about 13,000. So it's maybe 60 percent of the human genome. So just in this next slide here, I'm showing you how the FI network works. So I imagine this was the entire network. Now, the reality is if you look to the entire network of all the Reactome FIs, it would be like a big ball. You wouldn't see much. So I'm just creating a hypothetical look of it. So and this may not be the best colors in this room, but this purple and red. So these are hits in your gene list, you know, genes that are, you know, low expression, high expression or, you know, mutated, not mutated, part of your dataset. And then obviously these genes are connected by some relationships, the interaction data. Now, you're going to see here that there's an area over here. This node is not connected to anything. And there's an area here. And the sum down here, which are not connected. So the point is the yellow lines are, it's nice to see that. Some interactions, but you need to create connections here and here in the network. So what we can do sometimes is we add genes in that are not necessarily part of your dataset. And these are called linkers. And these can be things like, you know, for example, could be transcription factors. So if you've got a gene expression dataset, and you're analyzing it, you may not know, you may, you won't necessarily see an expression change in your transcription factor. But you're looking at the downstream effects of that transcription factor in the dataset. So, you know, these triangles could in fact be things like that. So you use these as connectors to kind of then subtract away the network that's not being used. And now you're left with a subnetwork that's based on your interaction data, sorry, based on your, your data from your experiments. And in this case, we've used linkers to provide the connectivity between sparsely connected sections of the network. It's not always necessary to add linkers. And when you're using the react to my fine network tool, what I suggest there is to just play off, use one, see how your network's created. If it's not big enough, try adding linkers and seeing what happens there. So you're getting back to the 127 cancer genes, we looked at before. This is the, this is like a clustered view of the network. Now there's nicely colored modules. You can then perform pathway enrichment analysis. And then you can label them. This is somewhat arbitrary. You have to look at the data. And sometimes it's not obvious, you know, that the pathway isn't like, it could be something within the pathway that you should focus on. But it's telling you it's signal transduction. So the reason is a lot of these genes up here on the right, I mean, they probably have components that are shared by a lot of receptor tyrosine kinase signaling. And likewise here, I used to work on notch signaling. And I know that there's an interaction, you know, there's, there's this connectivity between notch, wind and TGF beta signaling. So that sort of makes sense. And then down here, there's, there's, you know, TP 53 signaling or effect was a TP 53. And then finally here on the right, sorry, the left is cell cycle. So you'll perform pathway. And once you put those modules identified, you perform pathway enrichment analysis, you potentially identify annotations that you can label these modules with. Another approach is to combining the network with gene expression data to possibly identify a network module related to overall patient survival. So one of the modules that was identified in the clustering had annotations from two different data resources. I think, actually, but here actually, sorry, I've written oops, sorry. Oh, there we go. Sorry, I wanted to just check the colors. So orange was NCI PID and green is from react, sorry, purple is from reactome. And basically, you can perform basically there's a component within the reactome if I tool that allows you to perform Cox either Cox proportional hazards or Kaplan Meyer analysis. So basically, you're doing survival analysis. And this is one of the KPM plots probability versus survival time. And you can see that basically the tool tries to split the samples up into two groups. The red line represents, you know, samples where there's gene loging expression, and then the green is high. So and then it works out the what's the word here? KPM plots. So it's the log rank test to see if the distance between these two lines is significant. So basically, the outcome is that oops, sorry, relying a little bit of my notes to remind me of the slide. There we go. So basically, it would imply that patients with low expression within these module genes have slightly they're performing better than patients that have high expression of these genes. So this is just an example of where a single module, a network module could be used to identify a signature within a cancer patient, you know, prove cancer patient prognosis. Yes, question. This is on so we're going to use the side escaped when we use the side escape tool, you're going to create the networks based on a data set, you can then cluster those modules, you'll annotate those modules. As I said, you've got to look at the gene lists and the annotations to kind of figure out what that potentially what those genes could be representing in terms of biological function. But once you have a clinical data set, you can integrate that into the tool. And then you basically create these plots, you know, through the survival analysis. Oops. There we go. That's better. You can also overlay drug information onto the react to my fi network. So you see the interactions between particular cancer drivers and particular therapeutics. So in the last remaining few minutes, I'm going to move through pathway based modeling. This approach basically attempts to infer how pathway states are disrupted in disease. It uses both qualitative and quantitative methods or measurements to infer the activities of variety of different components within the pathway and how that impacts on a particular series of events that you find, for example, in this case, in these examples in cancer biology. So this is particularly useful when you have not just a simple gene list, but you're actually trying to integrate multiple data sets into these molecular alterations. And it kind of moves a little bit from a little bit away from like this network analysis, just a simple network analysis approach to more kind of systems approach systems biology. Sometimes you could be potentially looking at one pathway or a smaller network view of things, not rather than these big global views of networks. So I'll try to explain this a little bit more in some of the other slides. But there's a variety of different tools that will look at different things. There is a tool called Cell Analyzer. It's probably one of the older ones that was used to, it's anybody familiar with MATLAB. So it basically has a graphical interface and uses a variety of different algorithms to explore the kind of structural and functional properties of not just metabolic, but also signaling and regulatory networks. It's mostly used for metabolomics data and probably more applicable to people that are interested in metabolic engineering. So if you're trying to like make a strain more efficient in metabolizing or producing a particular product, this is probably a way to use that. Networking is a newer version of is a kind of update on NetForest. But this is basically providing tools for phosphor proteomics and looking specifically at kinase cascades. And basically you can use it to elucidate how particular phosphorylation events are associated with a given phenotype or may have an impact on a disease condition. There's a RACD, again, one of the very old, it's actually still, I think when it was developed. But again, it's an older, one of the older tools that still stands out today. It's got a novel algorithm for analyzing microwave gene expression data. And it can scale up to like large from million regulatory networks. And then finally there's this newer area probabilistic graph models. I think this is most applicable to a lot of cancer data sets. And there's a particular tool called Paradigm that was developed by Josh Schuertsch Lab. It's actually what we've integrated to react to an AFI network. So they're becoming more widely used in in, I would say cancer data analysis, but also in disease data analysis. You can, we've applied it to cancer network models. But the goal there is to basically be able to integrate different types of omics data into your into your analysis and into the visualize and ultimately to be able to visualize those impacts on the on the pathway. And basically the goal is having access to once you have access to you know, multiple patient data, you can find significant, you know, impacted pathways for disease. You can link the activities of individual pathways to patient phenotypes. And you know, if you can perturb, you know, sorry, if the patients are receiving drug treatments, you can see how effects of one drug or multiple drugs have on on that particular pathway. And ultimately on the potentially on the phenotype for the patient. So here you with using paradigm, this is just a simplified molecular event that you might see in a reaction pathway. Or but when you're trying to integrate multiple omics data into this graph, you've got to think about creating what they call factor graphs. So each node in this factor graph represents a particular type of molecular data. So the gene can describe the mutation or this or the or potentially the copy number variation. But there are two different components or the RNA can describe the micro RNA, the protein can look at, you know, you can use mass spec data. And if you've got active protein, you could consider a fossil proteomics as well. So basically, with traditional analyses, you're taking one gene list and just applying that to your analysis. Here, you're taking all of the data as separate data sets and feeding that into the algorithm. And this is where I start banging my head off the wall because to try and explain that algorithm, algorithm is again, really painful. So I'm trying to use the examples here as we see visually to try and describe how this tool would work. So we have a simplified gene expression, regulatory pathway here where you have CTGF and NPPA, sorry. So these are two transcription factors that are that regulates cell proliferation. So there's some upstream components here that regulate these two transcription factors. So the expression of these two proteins are controlled by these proteins here on the left. So by converting, and I'm just going to continue this in the simplified pathway view, but if you were to create a factor graph to run the analysis, to do the pathway inferences, you could ask questions like, if YAP 1, or sorry, YAP 1 copy number is high, is CTGF expression high? Or if NPPA 1 activity is high, how likely is it that WWTR 1 expression is up or runks to expression down? So these are the things that we can potentially do when you have all that expression data, copy number variation data. As you feed it into the model and you ask relevant biological questions. Here's just another example of it. This is where it gets a little more complicated and I've apologize for those of you who are red-green color blind. I'm one of them, so I'm looking at a green screen right now on the left. But this is an example of where copy number variation data and expression data from ovarian samples have been integrated into the factor graphs using the paradigm approach. So after the inferences perform the results with and without copy number variation are visualized on the left here. And then on the right you're basically seeing the observed data, the copy number variation data and the expression data. So basically what you're interested in seeing here is that in the first sample, and I have to show here, MPA is in fact lower expression. It's the green here from the second sample. It's a darker green here. And this is most likely COP because there's you see the copy number variation here of WTR CNV is 2 and here in the first sample is 1. So you're seeing up regulation of an upstream component here, which is impacting on this protein here. But when you're seeing it in another patient sample, it's just normal expression. So there's no impact on the pathway in one, but there is an impact on pathway two, based on the availability of integrating two different types of data. Now, we've implemented PGM models into the Reactome Fi network. We have also got a beta version of the Boolean network modeling approaches in there. We're working on other differential equation models as well. So the idea is you can potentially throw different types of data at Reactome Pathways. So if you go through the website, you can just do regular pathway analysis. If you come through the Reactome Fi network tools, you can do, excuse me, network analysis, as well as pathway analysis. And when you get into that network pathway analysis, sorry, the network analysis model, you've got different options. You can do simple analyses or you can go for the more complex probabilistic graph models, Boolean networks, Node-E. All right. So in the next slides, I just, I highlight some of the resources out there for the pathway network databases, all the different analysis tools that are available.