 All right, so hi there, I'm Robin Hall and I'm going to be continuing today talking a little bit more about pathways and networks. Basically, I'm going to kind of talk a little bit more about what, you know, Veronica and Ruth introduced you today. I may repeat a few aspects of that and that's just to reinforce those ideas and also to expand on this. I should just start by saying that, you know, these slides are excuse me one second these slides are based on resources that are created by myself and others, particularly Veronica, Gary, and some of the EBI training resources as well. So today, our learning objectives for this lecture as follows to understand further understand the principles of pathway network analysis. I'll talk a little bit more about the sources of the pathway and network data. We'll talk a little bit about some of the different analytical approaches, and I'll talk about this and they were in as a use case, the reactome functional interaction network or the fi network for short. So, I'm aware that Gary might have explained this a little bit yesterday but you know is my definition of what network pathway network analysis is. Even after 10 years of doing these workshops, I still think the statement still fits. These statements still fit. I might even have adjustments ever so slightly but essentially it's an analytical technique that makes use of biological molecular network or pathway information to further and understanding of biological systems. The scale is rapidly evolving I think with the changes in high throughput data capture. There needs to be. There's always an application for pathway network analysis, and there are many different approaches that are available. I'll talk about some of them in a few minutes. Why do we do this type of analysis I think pathways and networks are very intuitive to scientists. They do provide a rather useful display for biological, and I would actually argue as well chemical data as well because a lot of the omics data sets now are exposing, you know, multi omics so we're looking at genes proteins, small molecules drugs and chemicals and other environmental toxins so the whole host of data that you can actually display on pathways and networks. Veronica obviously with and Gary we're talking about how useful these types of analysis aren't for increasing the statistical power by reducing multiple hypotheses for a lot of the tools out there. There's ways in which to actually automate the analysis when you're actually processing large data sets, whether you use using API's or features built into desktop software. This can certainly make these analytical approaches much more efficient. And then pathway network analysis satisfies a number of, you know, common use cases and biological research talks about identifying patterns and hidden gene and hidden patterns within your gene lists. A way to explain experimental observations. We can predict the function of annotated or understudied proteins and we talk a little bit about another project that we're working on in reactome later that would best describe this. Also for, you know, pathways and network graphs are very useful in establishing a framework for quantitative modeling and also assisting with the development of identifying molecular signatures. Just as here's a real world example. You know, there's, you know, several groups that compose part of the cancer genome Atlas project identified 127 genes, which they classified as cancer driver genes based on the mutation frequency. So if you look at these genes as a list, we don't really know what these 127 genes are doing and why these mutations may be causing cancer and pathway network analysis tried to relate these genes to pathway and network annotations and other functional interactions. I'll talk a bit more about this example and some of the tools that I'm going to demonstrate later. And just to take a moment to remind us about, you know, what a pathway and what a network is. And I'll further define some of the kind of characteristics of these pathways and networks in subsequent slides. But essentially a biological pathways a series of actions that occurs within the molecules within the cell that leads to a certain product or maybe a change in the cell so you know, we're aware of there's metabolic pathways, which are very much similar like chemical reactions that occur within our bottles bodies. You know the conversion of glucose to energy for example is one, the signal transduction pathways that move the signal from the cells exterior to its interior, and then the gene regulation pathways that turn genes on and off. But for a network with a pathway you kind of classically think of there being a start point and an endpoint, you know, top to bottom, as you can see on the left side, but for networks, they don't necessarily have a starter endpoint. And, you know, some people think that pathways have no real boundaries and pathways kind of often work together to accomplish different tasks. And some people, biological pathways interact with one another they form a biological network. And, you know, researchers are able to learn a lot about human disease from studying biological pathways and networks, and identifying what genes proteins and other molecules are involved in a biological pathway or network can provide clues about what goes wrong when when disease strikes. And then following advantages, typically they're, they're usually curated, that is to say, you know, a researcher reads a paper identifies the knowledge within that paper and translate that into a computable form. So, you know, essentially data entry that data becomes available through a database and the resources are available online. Typically pathways are a biochemical view of biological process you can capture cause and effect. And there's traditionally some form of human interpretable visualization, either a textbook like illustration, or maybe a more technical network view and I'll describe this more in a shortly. Sorry, can I interrupt here. Yes, you give a quick definition of what curation is. Yes, as a former member of the international curation community and host of a curation conference. Yes, that's that's very true thanks Francis for the reminder co host co host of it. Francis and I worked together to host that by curation conference a number of years ago. Essentially bio curator is an individual and I would say that they're usually a post graduate. They've got usually post graduate qualifications. And they're typically, you know, in the case of a bio curator, they have a biology background, or, you know, they might have a background in chemistry, or some other, or some other science discipline. And obviously we're over years we've been very experienced in reading papers and extracting information from those papers. And basically a curator does is. I should, is, is the, the read papers, identify the information and those that information will be captured in a, in a database will be basically they'll be kind of like a web form, or some tool that they use to capture that knowledge in the database. A peer reviewed curator will also review that information for other curators as well. That's kind of like a process of peer review. Some other process, some other organizations have an external peer review system so that knowledge is actually reviewed, just very much like a paper publication. And that data is obviously made available online, either for download, or through different web tools that are available. Does that help answer that. Yeah, I think, I think, yeah, I think it's a, I think it's important to state that there is such a job out there so people with PhD is doing this kind of work. Yes, so that's another sort of avenue for people to think about sort of is to, and then a good thing to do is to give an example of a specific database and how that varies from one database to another. Reactome, which we're going to hear about today is recruits many by curators of different types. Some are full time, some are part of time. Right. And likewise, I used to work at GenBank at NCBI, and we had, I had about 20, 25 curators there, and they're all sort of 95% like PhD levels or scientists that worked on the mechanics, and they're reviewing and curating and making sure that the, the, the, the, there's a lot of things to worry about like identifiers and understanding the data model of the database, but also ensuring that the, there's a human biological review that's going on in the process. So I think that's an important part of the, of the, of databases in general and biological databases be it pathway or sequence or, or, or, or chemicals. Yeah, that there's this sort of human review that often takes place that's not taken into people don't think about very much. No, I think it's a very good point. I do come across on that further down in my top, but I will make it now and I'll repeat it again later, but it is a lot of manual curation is it's human intensive. It does take time, but it typically if it's done right and consistently will create a kind of gold gold standard of data set, which is from an experimental biologist perspective is like presuming that the most experimental biologist use me databases assume that that takes place. It does. But it's not always the case. No, exactly. And you have to be careful about that and I will sort of brush upon this later. But yeah, no thanks for bringing that up Francis. There are coming disadvantages of pathway databases. I kind of just mentioned one there you know it's time consuming but also our focus is rather it's smaller scale. And then sometimes in times of, in terms of the curation. There can be a sparse of coverage of the genome. And certainly different day pathway databases to do disagree on the boundaries of pathways. And some of these disadvantages are met by people using networks, but continuing along. You know the kind of gold standard pathway databases out there are really like react to which obviously I work for, but also there's keg and there's wiki pathways. These kind of like reaction based pathway databases where the unit of the pathway itself is essentially the react to where each node represents a biomolecule, and each edge represents the conversion of one or more biomolecules into another via a reaction. And then we can use multiple entity databases to reference and describe these different molecules, and we can use things like the gene ontology terms, and you know public public med citations to kind of explicitly describe key elements of that reaction. And then obviously these kind of reactions become modules are like big building blocks, kind of fixed together like jigsaw pieces to build that pathway. And one other, you know, good pathway reverse out there is keg, it's probably like the kind of focus on other types of knowledge capture. So it is actually a collection of biological information. Obviously it's a manually curated resource, but it doesn't just focus on pathways. The keg pathways themselves do contain information based on the genes proteins. You know molecular interactions and other reactions associated with multiple organisms probably is one of the few databases that captures a whole host of species reactome wiki pathways are a little bit more focused. So let's raise a question. How can you manually curate, you know, hundreds of pathways across hundreds of species. And the answer is, you don't, you can't. So you're going to fix it on a handful of key organisms that it could be bacteria, it could be human. You know, one of the key model organisms which you will manually curate, and you'll use a process called the mythology prediction. That's based on the fact that proteins or genes and proteins are conserved across species. And you will build up a model of based on conservation that you will basically take your, you know, say for example your human pathways and you will project that information based on mythology data into another species to create a predicted distribution of pathways. Sorry, Robert. Does the keg database, which is out of Kyoto for those people that don't know is does it. It used to be that it would not be really good at separating sort of. Mix up multiple organisms in one pathway business to do that. They still have a kind of reference pathway which could be considered like a kind of a hybrid pathway. Clearly when you're when you're looking at some pathway databases we always talk about the evidence of the source of the interaction. And for example in react to them, even though we're talking about a human pathway, certain most been only ever demonstrated amounts. So we can infer that base I mean that the chances are that that most reaction does occur within the human cell. But we have to just basically say look this is where the evidence lies, and we will make a prediction, strong prediction I would say. And we will clearly label that. So when you're navigating through the human pathway you'll see that information. It's not as clear I think in keg, the distinction between what is human and what is mouse, and what is rats, you know, for example. Anyway, they do provide these kind of nice simple pathway diagrams green boxes represent the proteins, the white boxes represent the genes. You can have these, you know, encapsulated pathways, you know pathways that are maybe, you know, embedded within just your like map kind of my cycling pathway. And then the kind of lines here are providing different levels of representation of the reactions that that link different entities to together. Now react on the other hand, unlike keg which is licensed is open source open access pathway database. We do focus on many of curation of human pathways involved in metabolism signaling other biological processes. All that knowledge is traceable back to the primary literature, and we provide tools for data analysis and visualization. Here's just an example here. This is the pathway browser that you use to navigate through biological pathways and react to them. I'd like you also to analyze experimental data, whether that's a gene protein or a small molecule list. You see the pathway hierarchy here on the left. I just is listing all the different pathways and events molecular events and react to them. And as you interact with this level you see these kind of pathway is textbook like illustrations on the pathway viewport here. And then you kind of see some other pathway attenuation information in the details panel below here. In this example, we're actually looking at that 127 cancer gene list that introduced you at the start. It's been it's it's been uploaded into reactant to perform over representation analysis and the results have been overlaid onto the signal pathway. So anywhere where there's a gene hit, you'll see a list that you'll see like this particular pathway here, which is integrand signaling. You can see the yellow lines illustrating a hit. This is a gene list a gene from the list is a you can find appropriate annotation in this pathway. And this details panel here below is showing you this kind of list of significant pathways. I'll talk a little bit more about this in the lab later this afternoon, or actually later this morning. Now there is kind of segue into kind of talking about biological networks there is this one large really useful repository of pathway and network information out there and that's called pathway Commons. It really provides a very, you know, if you don't necessarily want to use the react just the reactum pathway annotations that are out there. It's a really convenient access point to biological pathway knowledge. You know to use the pathway Commons resource, you can search visualized and you can download a whole host of pathway network information there. So, moving into kind of talking a little bit more about biological networks. Many different types of information can be represented in the networks themselves. I think Ruth did a very good overview of this. Yesterday, just to remind you nodes represent many different types of entities that can be genes proteins, the edges themselves traditionally convey the information about the links between these nodes. And you have to be aware of the, you know, the knowledge within the network, particularly when you're performing the analysis since different algorithms will be applied to different networks. And different types of data will produce different network characteristics in terms of connectivity complexity and structure. And I'll talk about that in the remaining slides, particularly in the context of protein protein interaction networks, since they are probably the most commonly described network out there. These are graphical representations of the physical contacts between proteins within the cell. These interaction networks themselves are essential. Sorry the protein protein interactions are essential to almost every process in the cell. So understanding these interactions are crucial for understanding of self physiology and normal and disease states. You know, protein interactions themselves can represent both transient and stable interactions. So stable interactions are formed in kind of protein complexes. For example, like the ribosome and then transient interactions are kind of what I call these brief interactions that may modify or carry a protein leading to further change. And a good example of that would be something like a protein kinase. And I think that, you know, the transient interactions constitute the most dynamic part of the interactive. And the interactive essentially is the totality of the protein protein interactions that can happen within a cell, or in a specific biological context, whether that's, whether that's a, you know, studying the interact with an organeller, for example, excuse me. And the development of large scale protein protein interaction screening techniques has created a kind of large volume of interaction data that's available through a variety of different molecular interaction databases. And traditionally the kind of first step in performing protein protein reaction network analysis is of course to build the network. And there's plenty of different sources of protein protein interaction data. You can source that data yourself from your experimental work, and you will potentially choose how that data is represented and stored. You can also extract the information from literature, either manually through your review of these publications. Or we traditionally and probably the easiest way is to actually derive the experimental data from a host of different protein protein interaction databases, because their job has been to extract. Just as we were talking earlier about the curation process to extract those protein protein reactions from the experiment with the experiments evidence that's been reported in the literature. And I've listed some of the different protein protein reaction sources that are out there. And they capture a great deal of information about the interactions. And, you know, they may differ a little bit in terms of the quality of the data that they capture the amount of metadata that they store, maybe the species that they're actually focusing on whether that's human interactions mouse interactions were used interactions for example. It's just one second. It's obviously, you know, you know, as a user of interaction data, you have to be aware of where this data comes from. Because no single I think no single interaction database. Excuse me one second. Okay, sorry about that my son is in class today. Sorry about that interruption there. So, sorry, where I was talking about being aware of the source information so. So is there so is there one database that has everything that you're about to answer that question. I was going to say that I think that the truly, I mean, I think, you know, databases that worked off in the past, you know, used to have very large, the remain nameless have have usually had like a large quantity of data depends on the organism you're interested in if you're looking at interaction data, I would say intact and mint have quite a large amount of interaction data. But, you know, there are primary sources of interaction data and you know you have to look at some of the times you have to look at the websites that aggregate a lot of interaction data from different sources. So is it intact aggregate things from mint intact and mint do aggregate from other I mean the part of this I max which I didn't explain is this international collaboration of researchers of interaction databases to kind of exchange interaction data, and they're supposed to provide, you know, they're supposed to manipulate their own data, but also provide, you know, data from the other sources as well. You know, I haven't done a comprehensive evaluation of how each of these different partners does actually aggregate data from the other sources. But, you know, some sources may not necessarily focus on human data. You know, they could be focusing on yeast data and so forth or, or, you know, data from Arabidopsis. So, the point I'm trying to make is sometimes important for you to combine data from multiple sources. And you have to consider like what type of data they're capturing so that you're avoiding things like redundancies or inconsistencies within that data, because as much as we like to think that a lot of the data is correct. So most of the time it is, and we put quality control measures in place to actually to confirm the information that's being described is accurately capturing the right molecules and such. There can be some mistakes. You know, it doesn't occur. Sabri asked an interesting question on Slack about if you know of any AI projects where they'll be extracting information from publications in literature in general to include into certain databases. I should know about this and I can't think of any off the top of my head right now that are AI space. There is like post like text mining type approaches that I'll talk about in a moment. In fact, it's probably the next slide. If I actually can move on to that actually right now. There are resources out there that are using text mining as a way to extract knowledge from papers automatically. And are they looking at deep learning techniques? I would say yes. Off the top of my head I can't think of any resources specifically doing this right now, although I should better be able to. I could look into that and actually I think that's probably something that a lot of the I think text mining plus whatever tools machine learning type tools become available will definitely support and help the buyer creation activities across the board. Absolutely. Yeah, I mean, yeah, I mean, I think they're, yeah, I mean, I think I'm trying to think specifically of a specifically for pathways and the answer is probably yes there is, but it's not yet available. I know the Chan Zuckerberg initiative there's a resource called matter that is mining data from publications making that data available. Also within Gary's, there's a project called factory. It's not. Yeah, I assisted meaning that you want to give the person who's done the publication. The interaction kind of thing and have them curated for you. It's a mixture of AI and involvement by the author. So there are different types of tools that are trying to develop. Perfect. Thanks Ruth. Yeah, there's, and there are different organizations out there that are using, you know, deep learning deep, I mean, deep learning deep mining techniques to, you know, capture different types of information that are being generated from high throughput experiments that are like for and or information that it is published in the literature and they're looking particularly right now at variant annotations and their relationship between those variants and particular interactions and stuff like that. Yeah, so I think, yeah, I think AIs are certainly going to be like very important tools for the future. Anyway, so you have to move on. Yeah, we're going to move on here and I'm going to and I think we've kind of covered areas where there's, you know, particularly with the way that we generate network data, or the data that becomes part of these approaching interaction databases, you have to be aware of the experimental approaches that are used the methods, because this could point to limitations and the availability of some protein protein interaction data and how much weight you should apply to trusting whether those interactions actually occur within a particular system that you may well be studying. Because there are some limitations in these detection methods as to whether you know these are truly physiological interactions that actually occur within the cell and there's times when these large scale experiments, you know, to generate false positives and negatives and their results. And, you know, there's always a question as to, you know, when you perform these kind of experiments, you know, does the in vitro experiment mirror that occurs in the in vivo interaction. So, moving along. We'll talk a little bit more about that some of the kind of principles of network analysis. One of the key principles is with working with the complexity of the network is to extract kind of useful information that you would not otherwise have learned by understanding the individual components. So, analyzing the kind of topological features of a networks rather useful in identifying relevant participants and substructures within the network that may have some biological significance, and you know the topological properties of the network can be applied to the whole network, or it could be applied to individual nodes edges, or parts different parts of that network. And there's different strategies that we can use to do this. And there's some principles to think about in terms of the network as well that help us to understand the topology of networks and how the analytical approaches work. It's called the small world effect. I think Ruth introduced us to the little bit this yesterday. And this is my take a slightly different taken it. And it's basically about the maximum number of steps separating any two nodes is typically small, no matter how big the network is and this is this whole notion of six degrees of separation. And the reason we study things like this is this is can level of connectivity allows for like efficient and quick flow of information within a biological network. But it does pose an interesting question and that is, if a network is so tightly connected. Why, why doesn't perturbations in a single genome protein have a more dramatic consequences for that network. Because biological systems are extremely robust and can cope with relatively high degree of perturbation within single genes or proteins. And in order to explain this, we need to look at another fundamental property of protein protein traction networks. And this is what we call scale free networks. This basically is that number. The majority of nodes in a scale free network have only a few connections to other nodes, whereas some nodes are connected to many other nodes in the network. So, basically what this allows is that those failures that occur randomly within the network, the vast majority of the proteins that are within a small degree of connectivity. The likelihood is that hub, you know, the larger interactions would be affected as small. So, and where so basically if you see, you know, you know, if we looked at an interaction. Just between, say for example, this node here and this one here, say we lost this interaction here, this node, sorry, you would potentially lose this interaction. But, you know, you'd still this major hub here would still have all these other interactions, and the rest of the network would be unaffected. But if you start losing some of these kind of hubs, these larger, these larger hubs here, so you lost this one. The generality is the network will not lose its connectedness because these all these other hubs are still performing these other interactions still occurring. So this might not necessarily affect the remainder of the functionality of this network. But again, when you see major failures in other hubs, then you're going to see, you know, the appearance of more isolated graphs that is to say, you know, this may become an isolated graph over here. And this and this and so forth, when you start losing these interactions, essentially what we find is these hubs are like enriched with essential or lethal genes. So, for example, like cancer linked proteins are hub proteins like p 53 or e gf far or p 10, for example. Another concept to think about is path, and this is, you know, this distance path is basically a sequence of connections. That occur within the graph. And then the distance, you know, the distance is well the distance between two nodes is defined as the number of edges along the shortest path that could be connecting it. So you can see here in the middle here, there's one hop to this note here to hops to this note here and so on. And we can describe all these different distance and connectivity is as a as a means to measure centrality. This was initially developed for a lot of social networking analysis and central what centrality is is it gives a kind of estimation and how important a node or an edge is for connectivity, and that kind of information flow through the network, and there's different metrics that we can use to calculate centrality. So one of the degree is term degree describes local centrality and doesn't necessarily take into consideration. The rest of the network and the importance that we give on that value of degree depends strongly on the size of the network, and I'll use this graph here as a just in a moment just to describe each of these different terms. There is more global centrality measures that we can look at the whole network. Sometimes this is called some one of them is called between this or between this centrality. And basically this looks at the central node. And the question is, is a central node that provides the kind of shortest path between nodes, and these nodes are powerful because in the sense that the extent that is needed for that information to be conveyed conveyed between nodes. And in how many shortest paths there exists. I'll show this in a moment again with us in this slide the other one is closeness. And this is measured by the closeness of a central node to other nodes and it's useful in estimating the kind of flow of information through a given nodes to other nodes in the network. So the, let's just look at everything in the context of the blue node here. So degree tells me this dependency so these are the local nodes. If you follow the cursor to the single node here, that's degree closeness. Basically the closeness to all other nodes. Okay. So there's one, two node, two hops to this node here. Okay. There's one hop here so at most, for this distance that you need to go is basically to get to any node whoops lost my cursor there is whoops is going one to one to or you could go one here depending on the flow of the information between this is basically looking at the fact that this blue node here is connecting the right side of nodes to the left side and vice versa. Okay, there are other network features to consider. For example, you know, you can also consider the size of the network, the number of nodes within that network, the density of the network so the proportion of the connections that exists. And then there's these other kind of higher order organizations that you see within the, the network such as motifs feedback loops. We sometimes call them clicks, or other small work network patterns that are, we try to identify those that are over represented were compared to randomized versions of the same network. And obviously one other, you know, the touches upon, particularly with the tools we're about to discuss is this idea of clustering coefficient, or a transitivity. And this just, you know, describes the kind of modular connections that occur within, you know, networks as a whole so high transitivity or high clustering coefficient means that the network contains communities or groups of nodes that are densely connected internally. And by looking at these communities in a network. It's a nice way or a nice strategy for reducing the network complexity and extracting functional modules for example protein complexes that reflect the biology of the network. And there are several terms that are commonly used when we're talking about these different clustering analyses approaches. And I'll talk about them in a moment. The other things to think about here are that no assumptions are made about the internal structure of the communities. We're just looking for like high density regions within the graph. And it's important to note that finding the best community structure is algorithmically complex and is only possible for very small networks. And there's a whole host of other tools. Sorry, algorithms out there. And I don't I'm not for the benefit of time as well I'm not going to spend a lot of time talking about this things like Markov clustering algorithm. The fuzzy sea means this Chinese whisper clustering. There's a lot of new and urban algorithms. There's hot net. There's a whole different approaches to identify basically within these larger networks. You know these modules of tightly clinic genes, from which you can then perform annotation and Richmond analysis to kind of these modules, these pathway modules. So annotation Richmond analysis, obviously Veronica Ruth talked about this early the other day is not strictly speaking network analysis tool. But it is one of the more important methods to understanding the biological context approaching protein interaction networks. There are different variety of analysis tools out there. The most basic form is annotation enrichment analysis for using some which uses gene or protein annotations provided by a, you know, pathway knowledge base or maybe a gene ontology to infer which annotations are overrepresented in the list of genes. That can be taken from a network and essentially these annotation tools perform some additional statistical test that tries to provide us with a list of terms that describe the whole network, or part of that network. And so typically, you know, just in summary the kind of steps that you may use in the network visualization analysis is we talked about, you know, obviously creating that network of protein protein traction protein interactions. Sometimes you can use the soft some software tools out there. And the next slide will show some you upload your experimental data. It's usually in a table format. You can through the software tools you can navigate through your created network. And essentially when you've created your network, you know, you can navigate through the network to understand the relationship between the nodes and the edges. And, you know, you can perform different types of analysis network analysis using clustering tools to identify modules of interest. You can annotate these modules with, you know, pathway or go annotations and then the idea is to kind of export this as an image for publication. Most of the tools. Obviously yesterday, and we will continue today, you know, we'll focus on using side escape, but there are other open source, you know, tools out there for the, you know, for network based analysis. A lot of data science is now focused on R and bio conductor. And even within there, there's a number of different. A lot of data analysis workflows are actually already established through R and bio conductor for bioinformatics analysis. But I think there are other graph based tools that are out there to explore data. And I think majority of people focus on things like side escape, just for the ease of use, and also the fact that it does provide programmatic solutions for scripting and data linkages and linkages between the platform and some of these other, you know, our platform for example. But, you know, there are different tools. There are different. Some of you are familiar with Python or see. So there are network analysis platforms and tools out there for Python, I think there's network X network X yes. And for Python and see I think there's something called a graph. I'm not a seer Python person so I can't necessarily comment specifically on the individual tools that are out there but they are being used. So we're going to talk about today in the lab and start introducing it now because there's a little bit more network function to describe network analysis approaches to describe here. This is the react on functional interaction network and the react to my five is up for side escape. And basically the idea here is analytic analyzing your list of genes or mutated genes in a network context allows you to understand the relationship amongst these genes. To validate the mechanism action of drivers or maybe the interaction between these driver. Drivers and the kind of rare mutations that are out there and facilitate some form of hypothesis generation on the role of these genes and in a disease feeding the type. Essentially, you know, you're taking network analysis is basically reducing, you know, hundreds of thousands of genes within your list down to a dozen or so altered or mutated pathways. So it's a functional interaction. It's a reliable biological network that's based on manually cure interactions derived from manually curated pathways and extended with verified interactions. And the starting point is breaking down these pathway, the reactions of a pathway into binary interactions. And once you do that across a variety of different data sources here shown on the right, you can create this group of kind of what we call annotated F functional interactions or annotated FIs. And then you can train and use a native basin classifier to look at the features of other protein protein interaction databases to identify, you know, what we call these predicted FIs. And you basically combine that information together to create the functional interaction network. And as it stands, the network consists of 436,000 attractions and 13,000 proteins. And we rebuild this network every year. Basically how this app works is that you start by projecting your genes into a large FI network. So the red and purple genes, the red and purple nodes are just representing data from your experimental data. Basically these nodes will have relationship information so that you can start creating the sub network, but you can still see that some sparse connections. So what you can do is you can potentially add a linker gene, and a linker is basically a gene that's not necessarily part of your data set, but it's added to the app, because it's adding a link between two genes in your list and it increases the power for data interpretation and enrichment analysis. So these nodes have those connections, and you basically remove the remainder of the network is not necessarily part of your data set, and what you're left with is a sub network based on your data. I do apologize, I've realized this is yellow lines and a white background, but this is the kind of the idea of creating a small subset, sorry, a sub network based on your experimental data from a much larger network. So just as an example here we're going to run through generating a sub network and try to identify modules of interest within the network and then annotating those modules with pathway annotations and this is using this 127 cancer gene list that I introduced at the start of the presentation. So I uploaded the gene list, created the network, perform the network analysis, annotated the modules. So this is the network that's created based on these genes is these four modules. 123 and four. And then through enrichment analysis. So we're looking at all these modules with different pathway annotations, so we're looking at receptor tyrosine kinase signaling here, signaling by notch beta, wind, sorry, notch, wind and TGF beta as components of the cell cycle and this TP53 signaling as well here. So basically what you've done is reduce that 127 mutated genes down to a handful of altered pathways. You can also use, you also can combine gene expression data into the react to my fine network to search for network modules related to patient survival. So just in this example, the first step is to calculate the gene expression correlation for the genes involved in the functional interactions, and then you can assign those correlations to the fi network to wait it. So this is the MCL network clustering algorithm to kind of identify the modules of interest. And within the react to my fi app, you can choose two types of survival analysis survival module analysis to this one called Cox proportional hazards and the others called Kaplan Meyer. And in the case of the Kaplan Meyer analysis. Basically divided clinical samples are divided into two groups. In this case, it's based on expression values. So there's going to be one group where you see expression low expression genes, and that's the red line. And then the other group is high expression genes in the module. And basically the result here was identification of 31 gene module, whose expression was significantly related to breast cancer patient survival across five independent samples. And this module here is actually involved in cell mythotic apparatus assembly. There's basically two different types of annotation there's the purple that came from the reactor pathway database, and the orange was from the NCI pathway interaction, basically the conclusion of the study was that patients with low expression and these module genes fared better than patients with high expression of module genes. And the take home message here is that a single network module instead of modules could be used as a signature of patient prognosis. So in the final section of a talk we're going to talk a bit more about pathway modeling approaches. So we are, we're going to, I will talk a little bit about pathway modeling it calls to be sometimes referred to as network based modeling as well. Essentially there's different computational approaches for modeling pathways. Based upon either this this network based method, or mathematical modeling. So basically, network based methods apply graph theory to discover the relationships amongst nodes in the pathways where each node represents a biological entity like a gene or a protein, and the edge represents the interaction between the node pairs. And one example of this is probabilistic graph modeling that I'll talk a bit more about in the moment. The important idea here is to kind of preserve some of those detailed biological relationships in the modeling process. And the other approach which is to take mathematical modeling. That learns and analyzes the underlying network by transforming the reactions and entities into a matrix form so the several different approaches here to study biological pathways, such as Boolean networks, which can be used to represent large scale signaling networks. There's ordinary differential equations that can be used to provide quantitative models of small size gene regulatory networks. These are called stochastic and flux balance analyses that are usually used for the study of metabolic pathways. Basically, the idea here is that both these computational approaches can be used to infer how pathway states are disrupted within a disease, for example in cancer biology. This is both quantitative and qualitative modeling, both qualitative and quantitative measurements to infer the activities of various genetic components of the pathway, and it's somewhat akin to systems biology. And just in this graph here, I'm just showing you kind of the flow of ideas that are the flow of information within the pathway modeling using different computational approaches. Biopharmatic methods for pathway network modeling typically start with a hypothesis which can be derived from your experiments or from some theory. You can then use computational methods to based on your experimental data that could be gene expression information, and then including knowledge information like functional annotations to kind of model the biological pathways. And then you basically create a model, a predicted model pathway model, which can be refined by evaluating each model with the experimental and the hypothesis against this experimental data and hypotheses. You can collect further information by searching the literature or databases retrieving additional pathway or interaction data to basically further define or modify the models. You can run different types of simulations using different mathematical modeling approaches. You can compare the simulations. And then in some cases you typically will perform additional analysis. It's a multi analytical approaches and then you try to create a figure for publication. So the types just continuing along in the classical pathway modeling that's out there. There are tools to study metabolic pathways. There's cell analyzer. It's a MATLAB tool that's for analyzing the structure and the function of biological networks. They provide kind of a variety of different strain, sorry, algorithms for computational strain design, metabolic flux analysis within regulatory and cellular networks. If you're looking at disease or computational modeling of signaling pathway, there's tools like KenoMA Explorer, NetForest, or NetWorking. And these tools basically can elucidate the kind of phosphorylation events associated with a given phenotype or disease condition. And then if you're looking at things with like gene expression studies, there's tools like Arachne, that allow you to process micro-expression data to model regulatory networks in mammalian cells. And then there's a whole host of other applications that are available through site escape, like Amino Petri dish that perform different types of network inference, network modeling, pathway modeling approaches. Now, I realize I'm kind of running close to 11 right now. Am I okay to continue a little bit more in the talk? I should be wrapped up with five minutes, I think. Not a problem, Robin. Yeah. Okay, thanks. And we're going to do the picture afterwards and Gary is late, so it's all good. Okay, great. So getting back to probabilistic graph modeling based pathway analysis. It's one of these kind of widely used techniques in machine learning and statistics for modeling complex dependencies amongst multiple variables. And the idea here is to integrate multiple molecular alterations to yield a list of all sort of pathway activities. So the idea here is to apply multiple omics data types to understand more about maybe a single patient or a group of patients if you have cancer. In this case, we're just using, this is an example of cancer network analysis. And they'll use methods such as naive basing, sorry, I was going to say naive basing classifier that's wrong. I should say the methods that are being used are basing networks to learn the cellular networks to generate now cellular networks from gene expression data. And the goal is basically to integrate multiple data types into the model to find significantly altered pathways and to link pathways and their activities to patient phenotypes. That's what I was trying to say. Our time is a factor graph framework for pathway inference on using high throughput to genomic data. The first step is to kind of convert a typical pathway into a factor graph. So we're just using this, this individual rather simple reaction step here. And it's expanded initially into four nodes, one to represent gene copy number, gene expression protein level and protein activity. And a small fragments of the p 53 apoptosis part pathway is being shown here. On the left and then on the right is the converted factor graph. This representation allows us to model many different high throughput data sets such as gene expression data. Variant annotations. You know, protein level information that's being captured from MS experiments and changes in gene expression. And just in one of the kind of classical studies developed the paradigm approach. They were looking at the blast over multiple for me data. And the paradigm approach basically identified informative sub types for GBM cancer data. So basically it produces matrix of integrated pathway activities. Samples and entities were then clustered using hierarchical clustering and the visual inspection kind of revealed for obvious sub types. Based on the IPAs and the fourth subtype clearly distinct from the other three. So the example is the hip one, what the major conclusions on this paper was that hip one alpha is a master transcription factor involved in the regulation of hypoxic conditions. And two of the other clusters, EGFR signature and the innovative map kinase cascade involving get it. Interlooking transcript interlooking transcriptional cascade. And basically the conclusion was that this approach discovered that mutations and amplifications in the EGFR were present and this obviously has been previously shown to be associated with high grade glioblastomas. In fact, we've implemented paradigm approach. It's just using a very simple gene expression regulatory pathway as an example. An example here is where CTF G and NAPR are to transcription factors, which regulates self proliferation. And the idea here with the paradigm approaches you can look at some question of these two proteins are controlled by CTGF are controlled by the app one. WTR one and ranks to and basically by taking this PDM approach can ask questions. By converting reactive. If you ask the question if you have one copy number is higher is CTGF upregulated. Or, oops, or if NAPR one activity is high, how likely is that WTR one is upregulated, or maybe ranks two is downregulated. So that's basically we're going to leave the talk. There's a list of different sourced pathway network databases here that are useful to look at some other kind of network construction and clustering different approaches that are out there for the pathway modeling. There are links here. And we're off on a coffee break.