 So, I mean, this part of the workshop is called Pathway Network Analysis. So now from now on until the end of the workshop, all the labs will be to use Citroscape and we are going to represent all our results using a network in Citroscape. So then for this reason, I need to give you a brief introduction about network and Citroscape. And then we would go back to the gene set enrichment results that we just did with GSE to represent them as a network. So the learning objectives of the module, so I hope that at the end of the lecture you would have understood the advantage of network visualization for biological data. So what is an advantage to use a network compared to a table to analyze the results? And I hope that at the end of all the labs and at the end of this last two days of the workshop, you will be able to use Citroscape, install the apps you need and be able to create and navigate an enrichment map. So let's start by an introduction. So we are going to have like a few definitions of terms. And the first one is different types of networks. So we can have social network, biological data network, we can have any types of network. So here in the example of the social network, what we are doing is we are going to compare entities that are people. So entities are people and in this example that is being created with the social network app, we are studying researchers. So researchers, they may collaborate and publish a paper together. And if it's the case, then our two researchers are two points on the network will be related by an age. So we are studying how much the researchers work together and if they work together then we get like a dense origin of the network that we call a cluster. But we could also do a network with biological data. And in this example, it's a network of gene. So we have one gene that is related to another gene and we see a line here. So we put a line if the two genes are known to physically interact with each other. If the gene products are known to physically interact with each other. And the common thing between these two networks is that we are studying the interactions. So we are not only interested by the elements of the network but we are interested by how the elements of the network are connecting each other. So to create a network, first we need a software. In this workshop, we are going to use Cituscape. Cituscape is a standalone application which means a software that you need to download on your computer in order to use it. It's open source and free. It can handle any types of network. So it could handle social network, biological data network, any network you want to create. With Cituscape we can play with visual styles, we can move elements of the network around and we can do network analysis like clustering the network and Cituscape has many apps that we can use in addition to the basic features. There are other software that exist and I could cite Gefi and Navigator. Gefi you can also do any types of network like social networks, biological data networks and Navigator is more focused on biological data. So this slide, we saw it this morning but it's important to repeat what is the difference when we say pathway and when we say network. So when we say pathway it's more like a detailed pathway that has been studied over years. For example in this example that would be the EGFR receptor pathways and let's say researchers have studied it a lot, many papers on it and maybe a curator, a scientific curator would read all these papers and put the pathway together with downstream event, upstream event, feedback loops. And when we see network it's something we got from genomics data. So this is data that we get and we know the elements of the network and our data is going to tell us how the elements of the network are going to interact with each other but it's simply, it's simpler than the pathway, the real pathway. Yes, you need to look at the features. So I think you need to compare the features and then that's where you can decide what you need. The navigator is, you can't customize it with apps for example. And GIFI is more general, not only for biological data. Yep, exactly. The thing that happens here is that you don't need to be able to buy it. Yeah, exactly. But there is a good point to bring up because there will always be buyers. But sometimes two tools can be used. It's more in the finer detail but it's also in the practice and if you're used to it, you're going to go to that one more often and you're just going to use it. And the big thing about side escape is it's much, much more customizable. And you can make it do many, many more. So when it's perceived the most funding from an AI, it's the one that's perceived the most support from the community. There's a lot of pros in side escape. But if I had the developer of the navigator in the middle of the video that I said. And he's not teaching us for himself. And that's why I put a slide because I wanted to mention that it's not the only one. The only one I use because Gary Bader. Yeah. So why would you use network visualization for biological data is because we want to represent the relationships between the molecules. So in our case, the biological molecules will be genes or proteins. Most often it can be everything else. And it's better to see the relationships. If you look at the Excel table, you may not see these relationships. You really need to display them as a graph to be able to see them. And another advantage is that you can add multiple layers of information on one network. And I think this is very important. For example, you can compare expression data, mutation data, any other phenotype in one network. And I hope that we will see that a little bit in the next three days. So you get an idea of that. And once we have our network, we can do it in an unbiased way using like mathematical method, a network analysis. So it's not biased towards one point of the network. But we can really measure what are the regions that are more connected than the other ones to find hub in the networks, for example. So what would be the simple steps when you do network visualizations and analysis? Well, first you have downloaded your software and then you can create your custom network or use an app to create your network. At one point, you need to upload your data into the software. And it usually is like a table format. So again, you can do it in R or you can do it in Excel. But you save it as a tab-delimited file. And this is this file that you are going to import into Site Escape or any other software to first create the network and then to add attributes that could be like expression data, mutation data that will combine all together in one network. The simple format that you could have is a general list, like a list of genes in a tab-delimited text format or sometimes you can copy and paste a general list. But more often, it's a table with like a few columns and the list of genes. Then you create your network. You navigate for a network to understand the relationships that you are studying. You analyze your network. And the last step would be either to export your network table that you have created, maybe to give it to someone else, a collaborator that will continue the work or very likely you want to produce an image for publication. So you create a nice image, you export and that's the end of your analysis. So some definition of terms, very important. You need to know what a node is and an age. So node and age are the elements of the networks. So the node would be the circle in general. So this red circle and it's the molecule. So in our case probably genes, protein or pathways. So the element of the network that you are studying. And the lines are called the age. So the lines represent the interaction, the relationships between two nodes. So graph is another word for network. Mostly I think you are going to see undirected graph. So undirected graph, you can see that the ages, they don't have any directions. They don't have any arrows. But sometimes you see arrows like this. In general it means gene A activates gene B, something like that. So you need to understand what the direction means. And very, very often you will see that these ages, they are different. They don't have the same width. So one has like is thicker than the other one. When you see this is because you've put a weight on the ages. And you do it when you want to emphasize that relationships between A1 and A3 is maybe stronger than the relationship between A1 and A2. So in this example, let's say we have samples. We have different patients and it's gene expression data. Then we can construct this heat map to see the correlations between the patients. If the correlation is high, then the correlation coefficient is going to be, let's say, between 0.8 and 1. This is this dark orange color. So maybe this is A9. So A9 and A8, they have a strong correlation. We have a strong correlation. We put a weight of 3. If there is a weak correlation between A9 and A3, maybe we put a weight of 1. So then we go back to our table and we construct a table for each age. So the age A1 and between A1 and A2, the weight is 1. The age between A8 and A9, the weight is 3. And then we save this as a tab-delimited text. We import this into the software and this way we can put weight for our ages. Also for the network layout. There is no network with no layout. So if you don't put any layout, what you have is a hairball. And on a hairball, everything has the same distance. So the nodes are overlapping, the edges are overlapping. There is no topology. You can't see anything and you can't interpret anything. You can't analyze your network. So you need to have a network with a topology. And for that we apply a layout. In this case it's the force directed layout. And how it works is that the nodes are considered like negative forces. So if two negative forces come together, they try to repel each other. But the edges are like springs and they pull. So they try to connect the nodes together. So if you have a lot of edges, then they will pull and they will form a group. But if two nodes are not connected by many in age, then the distance is going to grow. So that's the way it works to layout the network. So it's not overlapping. And you can see regions that are more connected than other ones. And here are three slides that I took from a video where a force directed layout is being applied. So you see the different step because it's an iterative process. Yes? Yes. So let's say you have gene expression data. And you have 300 genes that are significantly different she expressed. And you have the 101 that are up-regulated, 101 that are down-regulated. Then the same genes will be up-regulated in the sample A8 and in the sample A9. And the same for the down-regulated. So you have like a overall good correlations between the two samples. Not really because the correlation coefficient is going to look at all the genes together. So it's like an overall... Yeah, it depends on your question. The biological question you want to answer. But if it's a positive correlation, it means all the genes that are up in patient A8 are also up in patient A9. And globally, it's a trend. And if it's down-regulated in A9, then it's down-regulated in A8. So like a perfect correlation of one because it would be that all the genes go in the same direction between A8 and A9. It's negative... Okay, if it's a negative correlation, like if the r is less than zero, then maybe you put the weight to 0.5. It depends on your questions. So here you want to...in these questions, you want to join the patients that have a strong correlation. But if it's a negative correlation, then they are not correlated. You may don't want to draw any edges in this case. It depends on the question. So the question here that I put for the example is, we have patients. If the patients are represented by nodes, try to connect the patients that have a correlation together. So that's a positive correlation. So from 0 to 1. If they have a negative correlation, then just don't draw any edge because they are not correlated. So there will not be edges for negative correlation coefficient. That's just an example to see how you could use weight for the edges. Yeah, you can do your correlation in different ways. Okay, so I think I was here. So there are three snapshots and you see at the beginning the nodes are overlapping. But when the algorithm is progressing, then you see that the nodes are repelling each other. And at the stage number three, the network is perfectly stretched out and you can continue your analysis. So network analysis would mean to find the subnetworks in the network. So for example, cluster or modules, you can find a path between the nodes. So which is the shortest path between two points of the networks. Or you want to find the central nodes in a network like a hep gene. So all the slides could be applied to any software. And now we are going to speak more about cytoscape and what we can do with cytoscape. So it was said before cytoscape is an open source software used for visualization of complex network. And the main feature of cytoscape is that there is a lot of apps that are available. And it's because cytoscape is open and free. And so there are many communities that are working together to develop cytoscapes. And anyone, one of you could develop an app with some programming skills. So whenever you need to do, you can build the app and use it in cytoscape. And this is an example of a network built with cytoscape. And I hope I'm going to use this example to show you how we can merge different layers of information in one network. So the data is from copy number variants for studying autism spectrum disorder. So copy number, it's rare copy number variants. So gain or deletions of DNAs when we compare autism cases to normal cases. And then so the researchers retrieved a gene list that were in this region of rare copy number variants associated with autism cases versus normal cases. So the starting point was a gene list. Not like not a very large gene list, but a medium sized gene list. And then they asked, in my gene list related to autism, do I have genes that belong to same pathways? So what they did, they did pathway enrichment analysis using the same technique as you explained using G provider. And then they found significantly pathways. So genes are enriched in some of the pathways and this is this right part of the network when you see the red and the pink nodes. So each node, pink or red is a pathway enriched in genes from the autism list. And that's a lot of results already. But when the researchers look at the results, well, they didn't know where to start. They didn't know which pathway would be better on how it could relate to autism. So they got the idea to add another layer of information. They took some databases with known genes to intellectual disability, autism, or they combined both. And they merged these results to the pathway. So they take these known genes. They did pathway enrichment to see in which pathway these genes were enriched. And they put this on the network. And what they could see are overlaps. And that's what they found very interesting. For example, this cluster which is CNS development where the known genes would overlap with the new data. So I think in a way that gave them confidence. And then they could start to focus on this cluster pathway to further study the mechanism of autism. So an example where you can merge two layers of information to further interpret your data. But before doing such a complicated network, let's start by the basic. And what you could do is to create your own network using your table. Let's say you have pulled down experiment. So maybe put your mix or maybe like a micro RNA. If you have like clip sync data when you have your micro RNA and you want to see where the micro RNA binds. Anything. And so here you have a pull down experiment and you know that bar one in your experiment is known to interact with MCM1. So you can create this table in Excel. You save as a tab delimited file. And maybe you can add extra layers of information. Maybe you know the mutation status of these genes and maybe you know this expression level. So that your second table that insider skip you would import as a node attribute. And then you will have your network here. So the first table would be used to create your network. That would create the nodes and the edges. The second table, the mutation data, maybe you could use it to put the colors on the network. If your gene has many mutations, maybe you can associate this with a darker color. And finally, the third information, maybe you could put this as a node size. If your genes has a high level of expression, then you could put like a large node for it. So it's all relative to your table. So that's the way you can add three layers of information. So if you don't want to create your own network, then you can use an app to do it. And there are many, many apps. So the first thing you do that is to go to the app store. So you can follow this link or you can Google site escape app store. And this is the world of apps. And I think I've counted like yesterday. I think it was 314 apps. So it's a lot. It looks overwhelming like this, but actually they are organized by categories. Like data visualization, network generation, graph analysis, online data input, network analysis. I'm sorry. I could. That would be good. Using the categories. Yeah. So personally, I use most often the gene ontology and gene set enrichment categories. But here, like for example, three categories that you could use to generate your network. First, you choose an app to generate your network. So you don't have to do it manually. Then maybe you want to cluster your network. So you use an app to cluster your network. And maybe you have done two networks or three or four networks and we want to compare it. Then you use the app to compare the networks. So what would be the advantage of using network biology? So I can't explain you the description of a few apps. So for example, gene mania. Gene mania, we are going to do it tomorrow afternoon. And it's gene function prediction category. So what this gene mania is doing is try to predict the function of the gene. So it works by guilt by association. For example, if you don't know me, but you would be able to know six of my best friends, maybe you would have a better idea of what I like and who I am. It's the same for a gene. You don't know anything about the gene. But if you could find six genes that are related to my unknown genes, maybe you would have a better guess of the function of these genes. So that's the way gene mania is working. And we are going to have a lecture and a lab on it tomorrow. This one M code is to detect the structure in a network, like the complex. But very simple for you, like the clustering. So the cluster, it can also detect motifs. Animal. So it's the one if you really want to dig in the pathway. So you are studying your favorite pathways and you have multiple data on it. Let's say maybe it's a reaction, like a cascade of reactions, like enzymatic reactions. And you want to study this as multiple time points and you want to do modeling. Or maybe you have mutated one enzyme in this pathway and you want to understand how the mutation of the enzyme would affect the flux of the pathway. Then there are some apps to do that and animal is one of them. Paper is for proteomics data. I don't know if you have proteomics data, but if you have one, then you know that you have missing data. Proteomics, they are peptides that are better detected than other. Peptides flies well, peptides don't fly well, so you have missing points. So what you could do is create your network using your proteomics data and add an extra layer of information from public database. So like protein-potion interaction database. And paper is going to combine the two to infer the missing links so you can have your full protein complexes. Maybe another one, this one, pathblast, is in the category of network evolution. So it could compare different network across species. Sometimes network biology is used with a focus to disease. And for example, it could be used as a classifier. So we built a network, a general network, and we see how the network is different when we have a disease case compared to a normal case. And then when we add a new sample, then we can classify it in the disease or in the normal cases. So that's for example of how we can use network analysis for biological data, but what is missing? What is missing is the dynamics, and it's the same as when we say what is the difference between a pathway and a network. Actually our networks are very simple compared to the reality. And they represent static processes. So if you want to do like real modeling of a pathway, you can do it, but you need to have the data with it. So you need to take multiple data points or something like that and do some modeling. So if you do proteomics, protein-protein network, what is failing is the atomic structure of the proteins. So the 3D structure of the protein, we don't see this in the network. And also most important I think is the context. Because usually when we draw a network, we take all the genes in a genome. But in your cell, not all the genes in a genome are going to be expressed. It could be also like different if your cell is cycling versus quiescent. It can also be different using the developmental stage. And usually we don't take this into account. We just take all the genes. So if you know maybe that you are studying brain or maybe a certain developmental stage and you are able to filter out your network for the genes that are expressed in this tissue only then it's good to do it. But that's going to be the second step. So what have we learned? We learned that networks are useful for seeing relationships in large datasets. It's important to understand what the nodes and edges mean. So each time you have a network, you see the circles, you see the edges. So you need to ask the question. So do the nodes represent genes or proteins or pathways? Because in our case like today the nodes are going to represent pathways. But tomorrow they are going to represent genes. So it's different. And the edges, what are the edges? Today the edges are going to be the number of genes that are common into two pathways. But tomorrow maybe the edges are going to be the physical interaction between the two proteins. So yeah, many methods and many apps are available to study your data. But what do you need? I think to answer this question you need to define your biological question first. And try to define one biological question at a time. And then when it's clear then you can look for the apps you need. And if you don't know the answer, you don't know which app you need, then you can use a mailing list like BioStar to answer your questions. Okay, so do you have questions about this introduction? No? So you're ready to move on and start a little bit? So before starting the Enrichment Map, I will try to present like the main feature of SiteOScape. I think it was in the prior readings and the tutorials, but just in case I'm going to review some of them. So when you open SiteOScape, then you have the main window and it's divided into four parts. On the left you have the control panel. Below that you have the table panel. On the right you have the result panel. So here you have like a toolbar. And for example you can save your session and reopen it later. Or you can save the image of your network. For sure you can navigate through a network. So to navigate through the network, then you click. So you do like a left click and you move the network around. Or you can use an eye bird view which is on the right side, at the bottom right side of the window. Or you can zoom in and you can zoom out. You can have different layouts. So I showed you the post-directed layout, but there are other ones like the circular layout that you can find in the layout menu. Or the organic layout. So the organic layout is my preferred layout in SiteOScape 3.4. So it's a kind of post-directed layout, but it stretched the network even more than the post-directed. You can change the visual style. So you go to the control panel. And in the control panel you have this style tab. And here you can change the color of the node. You can change the shapes of the nodes, also the edges and the background. So this is an example on how to play with visual style. So first you create your network. And then the second, as a second step, you want to import your table of attributes. Here is gene expression. So we are going to color the node based on gene expression. So we've uploaded our table of gene expression. We go to style here. And then under a field color, we choose the right column. And then we can set up a scale of color. And it will be like this. So if it's a high expression, the nodes will be red. If it's a low expression, then the nodes will be green. I think it's the opposite here. It's low, red, and high green. But I usually do the opposite. Once your network is created and you want to make a nicer figure, then you can rotate, you can scale, and you can align your nodes. When you have a big network, maybe you want to focus on one region of the network. What you can do is to select these nodes using your mouse. So they become yellow. And then after that, you create a subnetwork. You can automatically filter the network. So for example, you can only select the nodes that have a positive enrichment score. So you can do it automatically in the select tab. So now back to earlier this day, and in the workshop, where are we? So if we're here, it's because we obtained our genalist already. So maybe we've collected from Genomics Data for RNSEC, and we did a good job. We did normalization. We did statistical testing to get the differential expressions. And we have our genalist. And Uri showed us that we could compare our genalist to pathway database, to do pathway enrichment, to see if our genes enrich in certain pathways, like synaline pathway or biological processes. And now we're here. We want to visualize and identify interesting pathways and networks. And later on, then we may want to add extra information of the network that we have generated, like matriorite targets or transcription factors of drugs that have targets in the network, and so on. So the genalist you can have, depending on the experiment you have, you can have like a very large genalist, many, many, many results from your differential expression analysis. Maybe you have a medium-sized genalist, or you have like a small-sized genalist. So if you have like a... as the example in the lab, you are comparing two populations that are very, very different. And you expect a lot of genes that are significantly differentially expressed and consequently a number... a high number of pathways that are significant. So if you, in this case, then you go directly to enrichment map to represent the results as a pathway network. So the nodes are going to be pathways. If you have a medium-genalist, so not too many, maybe you have like a few one, two, three pathways that are significant, well, you don't need to represent the network as a pathway. You are going to go directly to a gene-gen network that we are going to see tomorrow, where you represent each gene in the network and then you can put color depending on the pathway that they belong to. If you have a very small gene, it happens sometimes with mutation data, and people come and say, hey, hey, I have my gymnast. Yeah, 20 genes. Can you do a pathway enrichment analysis for me? They say, yeah, sure. But I'm not going to use like a G-profiler or G-ACA because what you want is almost to predict the function of your genes or you predict how they could interact. So you want to expand your list in this case. And so you can add linkers to your network. And we are going to see tomorrow if we add some FI with the options linkers or gene mania. So to expand your network. So that's the three modes. But mostly today, we are going to follow our basic example like we are studying population A and population B. They are really different. We did RNA-seq on it. We are going to run GACA. And then from the GACA results, we are going to create an enrichment map. So again, so the output of GACA, two tables, one for the pathway that are enriched in genes that are regulated in population A versus B and one table for the genes that are done regulated in population A versus population B. And then we are going to upload our results into management map. And then each pathway is going to become a node. So each of these pathways are going to be red nodes and all these pathways here in the blue table are going to be blue nodes. And if these two pathways here share a significant amount of genes in common, then they are going to be related by an H. And the thickness of the H is going to be calculated using the overlap coefficient, which basically calculate the number of genes that overlap normalized by the size of the pathway. So the thicker the H, the higher the number of genes in common between the two pathways. So from the tabular format, we go to our network analysis and the color of the nodes. They could indicate the significant of the score of the pathway results. So your GACA, you have the top pathway, pathway number one with an NES score that is maybe 2.3. Then 2.3 is going to be a bright red. If we have a pathway that is enriched, but with a lower score, maybe 1.4, then we are going to put it pink. So we are going to use the node color to indicate the significance of the pathway enrichment. So this is an example of an enrichment map. And you see, so all the nodes are the pathway. And if the pathway have many genes in common, then they form a cluster on the network and then we can label the cluster using the gene set. So all these individual nodes are pathway. Because the database are redundant, then they're all related to the DNA metabolism function. So if we could read the individual name of the pathway, then we could see that it's really all DNA metabolism. So let's put a circle that the app is doing automatically and we label this as DNA metabolism. So maybe the GCA table gives us like 300 results, but basically it's 1, 2, 3, 4 biological functions. So we first summarize our gene list into gene sets and then we create this network that will do some cluster to summarize further into biological functions. Even like 600 gene sets, I would say that at the maximum I can identify 20 different biological functions. So just here as an example, we have breast cancer cells that were treated with estrogen, so control and treated. And we had two time points, 12 hours and 24 hours. Then the first thing to do is to create one map at 24 hours. And this is it. So we have all the genes in red and rich in genes up-regulated and a few pathway and rich in genes down-regulated at the time point of 24 hours. What we could do also is add the additional time point on the same map, on the same network, and that it will look like this. And the difference you see is this white spot. So what we did here, we uploaded two different data sets, the data set 1 and data set 2, is we assigned data set 1 to the node center and data set 2 to the node border. So if you see a node that is all red, it means the results are similar at 12 hours and 24 hours. But if you see here a difference because the inner center is 12 hours, we can see here, then it means there were no significantly significant difference at 12 hours. But when we wait until 12 hours, then the pathway becomes significant. So with Enrichment Map, you can compare two time points and you can do more using the pie charts options. So now when you click on a node, the node is a pathway. A pathway contains many genes. So you can click on the node and actually see the genes. And if you have uploaded what we call an expression file, for example, if you have arenasic data, let's say you can upload the normalized count data, better the count per million data for each of your patients, each of your samples, that when you click on a node, then you can see the expression pattern for all your samples and it's a good quality control. So when I click on this node, I see that my at 12 hours, my patients have an up-regulations of the genes compared to my control. But in this case, the difference was not obvious at 24 hours. And in this case, it was only at 24 hours that my treated patients showed an up-regulation of these genes in this particular pathway here. Another way to overlay extra information on the network is to use what we call the post-analysis feature of Enrichment Map and here we add an additional gene set to the map. Here it's, I think, proteomics data or arenasic. This one is arenasic data and it's knocked down of a macroRNA. So if you knock down a macroRNA, what you suppose is that the genes that were silenced by the macroRNA are going to be up-regulated compared to the control. So now we add the microRNA predicted targets to the network and we see that they overlap significantly with many of these pathways that were up-regulated. So this is something that we do with microRNA or transcription factor to identify where the targets are in the network. And this is an example of proteomics data application that we don't have to have gene expression data. We can do the network with any data. So for example, here proteomics. This is the application word cloud. So when you have a cluster of pathways, each pathway has a label and we want to summarize this cluster using the most frequent words. Then we can use this application and this application is used in what we call auto-annotate. So when you've created your network and you want to draw these circles, this cluster, with a label that you can use the auto-annotate application that will use both slides to do it. Okay, so we saw this slide already just to mention that it has been created with the application enrichment map. And this slide is to show you how it looks like on the real site escape. So this is the input panel where you are going to upload the GAC table and the GMT file and the rank file and then you are going to build your map and then you are going to see your network and if you click on the node, only if you click on the node, then at the bottom on the table panel you are going to see the heat map. The node heat map or the edge heat map are going to show you the genes that are including in the pathway that you have selected and highlighted in yellow are the leading edge, the GACA leading edge. I don't know if you remember the one in green, when we had this GACA table and they say core enrichment, yes, yes, yes, yes then now they are here in yellow. So we know that these genes in yellow are the most important because they are the ones that are participating with the pathway enrichment score. So tips for publishable enrichment maps when you've created your enrichment maps and you interpret it and you satisfied with the results maybe you want to put this as a figure for your publication so you need to make it a bit nicer. So you are allowed to move nodes around, maybe if they are too dense you can put them apart and play with the visual styles and when you are happy with it you export it as an image and sometimes you can edit like the labels using graphic editors. So that was my colleague who developed enrichment map who created a cake but it was even before my time it was before I joined the lab so I couldn't taste it. Any questions on enrichment map or you want to try it now?