 So, we are going to start the module 8 today, which is called gene to pathways. And so this work is licensed under a creative common license. This means that you are able to copy, share, and modify the work as long as the result is distributed under the same license. So, here are learning objectives. During this lecture, you will learn how to biologically interpret the gene list derived from various omics experiment, the main concept behind pathway enrichment analysis, and some concept of network visualization. So, this module naturally follows module 06 and 07, as both bulk RNA and single cell RNA end up with a list of genes that are differentially expressed, and that we like to functionally interpret. And the concept that we are seeing in this module is very more generic and can be applied to several omics data, like fricam, somatic, mutate, gypsic, microRNA, or GWAS data, for example. So, this is the course outline that we see that we have three parts. So, the first part is going to explain pathway enrichment analysis, then the second part is going to be about a cytoscape and enrichment map, and then we are going to open it to more generic information about cytoscape and network visualization to make this transition between module 8 and module 9. So, let's start the first part, which is pathway enrichment analysis. So, here is the general workflow of pathway network analysis, and we see that we have three steps. So, the first step is the generation of the omics data. So, after the alignment data to reference genome and the statistics, we get a list that we would like to functionally interpret. And then during the second step, this is where we use bioinformatic tools to interpret our data. So, in order to perform pathway enrichment analysis, what we do is that we query our gene list against biological processes, which are what we call the pathways. But we can also use other sources of information, for example, like drug targets, if this is important for our project. And then finally, the third step is where we represent the results that we obtain at steps 2 as a network, because this is easier to interpret. It helps us to elaborate new hypotheses. And it's an opening to future direction in our project. And the next step is to validate the pathway that we found using drug or enzyme inhibitors that can block the pathway of interest. So, this is another representation of the analysis workflow. So we see the two main elements that are the gene list and the pathway. So the gene list comes from the raw data that we analyzed, normalized, and then we get the gene list. So that's one element. And the other element is the prior knowledge about the function of genes that are collected and stored in different pathways. So the other element is the prior knowledge about the function of the gene that are collected and stored in different pathways. So a pathway is a group of gene that are functionally related. So an example of pathway could be the cell cycle. And in the cell cycle, we know that there are about 500 genes involved. So these two elements, gene list and the pathway, they can talk to each other if we use the same gene identifiers. So if the gene list is formatted as gene names, then the format that you need to use for the pathway database are also gene names. And if you use ensemble ID in your gene list, then you need to have the pathway database in the ensemble ID. And then the next step when we have these two elements is really to run the pathway enrichment analysis. And we do what we are looking for is over enrichment of the pathway in our gene list. So after this step, then we visualize the pathway as a network. And we also can add extra layers of information on the top of the network. So sometimes we think that pathway analysis is complicated, but it's actually just a way to organize our gene list in different categories that are biological process. So let's say this is on the left, the gene list that we would like to interpret. And then what we can see is that the black genes maybe belong to the axon guidance pathway and then the green genes to aging and the purple to stem cell development and the blue one to cell migration. So what we did now is that we summarize into four pathways. So now we can just concentrate and focus on these four pathways instead of the large gene list. And we can generate new hypotheses based on these four pathway. So we understand here that the need to summarize come from the fact that our gene list is very large and that we got a lot of hits from our mixed experiment. If we had a very small gene list, then we may want to interpret it in other ways. So we are querying our gene list again pathway database that are stored in. So pathway that are stored in pathway database. And this element is very important because we need accurate and comprehensive information to use it for a fine interpretation of the data. So accurate information entered in a pathway database is very often due to the manual creation of scientists whose tasks are to read the various paper and they enter the information in database and although now some text mining method are being developed. So one of the longest database is the gene ontology on go and go is divided into three categories which are cellular component, molecular function and biological process. So when we do pathway enrichment analysis, we are more interested by the third one, the biological process. And we can see here, for example, like the different stages of cell division. The one thing to know about the go structure is that it's organized in a hierarchical structure. So they are parent and child terms. So the parent terms are very generic pathway and they contain a lot of genes. So for example, it could be biological process and cellular process. And then the smaller gene sets, so the smaller pathway that are more specific, the child terms, they contain less genes, but they're also more informative. So we can see here, for example, so we have B cell apoptosis. So it doesn't contain a lot of genes. And we say that B cell apoptosis is part of apoptosis. So we can see here that apoptosis is like a more general term compared to B cell apoptosis. And then apoptosis is part of program cell death, which is part of cell death. So in our pathway analysis workflow, what we do is that we usually set a threshold to about 500 or 1000 genes per pathway. And we do that to kind of eliminate this very large pathway that are very generic and not very informative. So another database is Reactome and Robin is going to present it in module nine. So the strategy that we apply in our workflow is to use multiple, multiple database together when we perform pathway enrichment analysis to get the most comprehensive and precise information of the pathway that can be contained in our gen list. So this beta lab GMT file, as we call it, is available and updated each month. So it's publicly available because we use it for publications and everyone is welcome to use it if they wish to. So here is the presentation of our standard workflow for pathway enrichment analysis. So we have basically one workflow for defined gen list. So a defined list would typically contains a fixed number of genes, for example, 200 or 500 genes that we have selected using a threshold. So an example of such a list would be a list of genes that were found frequently mutated in a set of patients. And then we have a ranked gen list. So a ranked list is a list of all genes in the genome that we were able to rank using a score that is coming out from theomics experiment. So a typical example, for example, would be our RNA-seq data where we have two groups and then we compare, control and treat it. And we can rank all genes using the differential expression score from top up-regulated to top down-regulated genes. So we have two different protocols because the statistical test is going to be different if we use a defined gen list or a ranked gen list. So because of that, we use two different tools to perform pathway enrichment analysis. And then for a gen list, the tool that we are going to use is G-profiler. And then for a ranked list, the tool that we are going to use is GSEA. But both tools kind of generate a similar output table that contains the list of the pathway that we are tested with a p-value that estimates the probability that this pathway is enriched in our data by chance only. So on this table here that presents the pathway that we're significantly enriched in our data, what happens is that this pathway can be related to each other. They can have a similar function and they can share genes in common and we don't see it in the table. So what we do is that we use the cytoscape enrichment map to create a network of this pathway. And then the pathways are related to each other. They will form a cluster. So then we can identify the main functions that were enriched in our gen list. So we say that we have two possible kind of gen list, defined gen list and the ranked gen list. So the question we are going to ask when doing enrichment analysis would be slightly different in each case. So for a defined gen list, the question that we are going to ask is, are any pathway surprisingly enriched in my gen list? And if we have a ranked gen list, then the question is going to be, are any pathway ranked surprisingly high or low in my ranked list of genes? So for the defined gen list, then the statistical test is going to be a Fisher's exact test. And for a ranked list, the statistical test is going to be a modified Kolmogorov-Sinonov test. Included in the GATA tool. But then we are going to start by explaining the different. Veronica, may I ask you a question? Can you define this Q value? What is a Q value? Yes. So the Q value is the corrected P value because we correct for multiple hypothesis testing. And we are going to see it in details in the next part of the section. So we are going to first to explain the Fisher's exact test and the GAC. And then I'm going to explain the Q value. So let's start with the Fisher's exact test and the defined gen list. So this slide illustrates the important concept of overlap that is used to calculate the enrichment score, which is, I would say, the first step of the pathway enrichment analysis. So here on the left, we have our gen list. And our gen list contains 41 genes. And then we have the pathway that we are testing, which is called Paxon Guidance. And this is coming from the gene ontology database. And this pathway contains 39 genes. So what we see here is that we have 13 genes that are overlapping between my gen list and my pathway. So this 13 is my overlap size. So this is my enrichment score. And what we see here, what we can estimate is that 13 out of 41, it represents about one quarter of my gen list and about one quarter of the testing pathway. And then the next question is going to be, is this overlap significant? Is this overlap large enough to say that the pathway is enriched by my gen list? So how do simple enrichment tests work? So here in Brown, this is, this all rectangle represents the background. So the experimental background can also be defined as the sum of the experimentally detectable genes. So this is the background is an important concept when we use the Fisher's exact test. And if you have RNA-seq data or other omics data that cover all genes in the genome, you don't need to worry about too much about the background, because the background represents all genes in the genome. But if you would do pathway enrichment analysis from a custom array, or then a string array that just have like a portion of the genome on the array, you need to reduce the background to only the genes that you could measure in your experiment. So for example, if you did an array containing only immune genes, then the background are only these immune genes that you place on the array. And the gen list will be only the genes from these immune genes that are differentially expressed. So the first step is to calculate the overlap between your gen list and the pathway. So we said during the previous slide that this overlap was 30. And then the next question is, is this overlap as large or larger than expected by chance? So how to calculate that? So the question here will be answered by a p-value. And the p-value is going to assess the probability that this overlap is due to random chance only. So what you can do here to solve this issue, you could do random permutation. So what you could do is to use random gen list, which are the same size of your gen list, but with random genes, and you do it 1,000 times. So each time you have a random list, you calculate the overlap between your random list and your pathway. And this way you built your null distribution. And then you compare your observed overlap to your null distribution to see if this, if your overlap is larger than the null distribution. And what you can do is to calculate an empirical p-value by calculating the number of times that your observed overlap was larger than the random overlap. And then so the p-value that you get assessing the probability that this pathway is enriched by your gen list by chance. So a p-value can range from zero to one. So if you have a p-value equal to zero, it means that there is zero change that the results are caused by random chance. And you can be confident to report the pathway as enriched. And if it's one, then it's 100% chance to be random. So zero is good and one is bad. The problem with permutation is that it can be time and resources consuming for the computer. And luckily under some condition, we know that the distribution of the random samples looks like. And in this case, we don't have to do this random permutation, but we can use a statistical test. And this is the case for enrichment analysis. We know that the shape of the null distribution is a hypergeometric distribution. And then the test that uses this distribution is the future's exact test. So could you, so for the p-value, do we use the traditional 0.05 as a cutoff or? We are going to see that we use FGR 0.05 at the cutoff. So the p-value is the first p-value for one pathway. But because we are going to test many, many pathways, we will correct for multiple hypothesis testing. And then after the FGR, we are going to select so the q-value of 0.05 to consider the pathway that are significantly enriched. All right. Okay. Thank you. Sorry, I just have a question. Yes, you're starting gene list. That's derived from your omics data. So is this from stuff like DC or edger, where you get differentially expressed? Yes. So, yeah, so it's a very generic here. So if you have, so when you say DC and edger on that would be RNA-seq data and RNA-seq data, we will rather use the ranks list approach that I'm going to explain very soon. So if for the future's exact test is really when you have a generalist that you cannot rank. So it could be an RNA-seq and we just select a threshold and you have like your 500 genes could come from chip stick data, it could come from, for example, immunity genes. Okay. Thank you. And we will see, like in the practical lab number one, I took single cell RNA from Ravir and I extracted defined gene list. And I used chip profile on the future's exact test in this case. I'm going to explain it during the lab. So the future's exact test is better understood with red balls and black balls. And if you want like trival, you can imagine that also there are M&M's. So we have 5000 genes that are totally and only 500 are black. And we can say that the black gene represents one biological pathway. So now we do one random draw and what we get is four black balls and one red genes. So we know intuitively that it's not easy to get four black genes when we do one draw because we have many more red genes in the box. So now we want to calculate a p-value associated with this result. So what we can do is that we use the hypergeometry distribution to build the null distribution. And then we can calculate the p-value associated with this result, the four black gene and one red gene. And what we get is a p-value of 0.001 and it's very close to zero. It means that the chance to make type one error is extremely low, 0.1%. So we can conclude that this black pathway is significantly enriched in our genus and it's probably not due to chance only. So on the left side, this is the formula of the hypergeometric probability density function. And this is what is used to calculate the null distribution. But what I want to show here is the value inside the formula. So we have m, x, k, and k. And those numbers come from here. So m is going to be the size of the pathway. k is going to be the size of the gene list. x is going to be the overlap between the pathway and the gene list. And t is going to be the background. So that's why we say that for defined gene list and for sure exact test, the concept of the background is important because you can see that t is in the formula, so that's going to change the p-value. So those numbers, you also can put them in the two by two contingency table. And I'm showing it to you in case you have one pathway that you would like to test with your data. And then in this case, you just can use r and then you can use the physical exact test formula by enter this number. So you enter this x, this k, this v, and this m. And then you can test one pathway of interest manually, I would say. So g-profiler is the web-based tool that we are trying in the practical lab and that uses a Fisher's exact test to calculate the pathway enrichment from a defined gene list. So we can see here the output table. So the output table would be a list of pathways. And we can see here that they're all coming from the gene ontology database. And then t, q, and u are the same as the k, m, and t that I just showed you. So t-term is the pathway. So here we have a pathway that the original size of the pathway was 17. And the original size of the gene list, q for query, was a 20. And in this case, we had an overlap of five genes between the pathway and our query gene list. And the background, the universe, the background was about 70,000 genes. So all these four numbers were entered in the formula of the Fisher's exact test. And then we got a p-value. And then the p-value was corrected for multiple hypothesis testing. And we have here the adjusted p-value. And we are going to select the pathway that have at least an FGR or p-adjusted value less than 0.05. OK, so that was for the defined gene list. But now we are going to speak about the second statistical test to perform pathway enrichment analysis using a rank list. So a rank list, I would say it's really typically, I use it typically for when I have bulk RNA-seq data and to class design when I compare control and treated samples. So this is the first matrix where we have the samples for the control and the samples for the treated. And then we use our packages like HR or DS6-2 to estimate the differential expression of the genes. And what we are going to do is using this DS6-2 and HR output, we are going to calculate a score. And this score will enable us to rank the genes from up-regulated in treated versus control to down-regulated in treated versus control. And we would leave the genes that are not non-significantly differential expressed in the middle. So we don't remove any genes. So in order to do that, so we take the DS6-HR output and then in this output, we look at two columns. One is the log fault change. The other is the p-value, so not the corrected p-value, but the nominal p-value. And then, for example, these genes had a log fault change of plus 3.4, and it's very low p-value. And then we use this formula here. So we calculate the score here. We say equal sine of the log fault change multiplied by minus log 10 of the p-value. So sine of the log fault change here, just for the direction. We just want to extract the direction to see if these genes are up-regulated or down-regulated. So if the log fault change for this gene was plus 3.4, then the sine of the log fault change is going to be plus 1. And then the p-value, so the minus log 10 will just transform a very small p-value plus g0 to a high score because a very small p-value is a very significant p-value. So then, so this one is the top gene, so it is getting the highest score of 32. So then this way, with this value, we'll make the genes from top up-regulated to top down-regulated with the non-significant change in the middle, and we don't remove them. So you can, yes. Sorry, there's a question that Jose had in Slack. Yes. Is it possible to use SideEscape to analyze exome data? I don't know. Sorry, I don't know. We'll look into this then. Yeah, so this first step of the pathway enrichment analysis. So we are going to get you to SideEscape. But this first step is done outside SideEscape. So actually, this module is two parts. It's really the pathway enrichment analysis that we calculate when we get the geometry, and it's outside GSE, it's outside SideEscape. So we use GeoProfiler and GSEA. Some SideEscape apps can do the pathway enrichment map, but SideEscape is really like the downstream after we got the pathway enrichment result to visualize the results. So then maybe we can go back to this question later. So okay, so we have this formula. So you can calculate this formula in all, it's better, because we don't want to mess up with the dates in Excel, but you also can do it in Excel. Just be careful that your gene names are not going to be transformed into date by Excel. Okay, so now, yes? Where is the rank score coming from? Is it something that you define in your project, or is it something that people use? The formula? Yeah. So who found the formula, if you want? Yeah. I think it's really, we tried several things, and then the lock for change. Some people tend to rank it by the lock for change, but the lock for change is not very precise because the lock for change doesn't take into account reproducibility or standard deviation within your groups. But the p-value is really a refraction of the difference in the average between two groups and the reproducibility within the group. So it's really more accurate to use the p-value. And it's very standard in bioinformatics to use minus lock 10 of a p-value to get a score, because it's really easier to handle a score compared to this very small p-value. So that's why that's the way. Just use both of the p-value and lock for change. Maybe this is the formula. Yeah. I mean, yeah. I mean, we use the p-value. That's what we recommend. Depending on the project, sometimes I do different things. Like I do lock for change, multiply by minus lock 10 of p-value. So I would say it's sometimes case by case, what you need to look at that your ranking is, reflects very well what's going on in your data. So that you look at your data, you look at your metrics, and then you look at the ranking. And for example, you can do a heat map of your metrics and then the ranking. And you see that there is like a very nice logical ranking from genes that you think are really upregulated to downregulated. And you can depend on the data. So just a question. So why are we using p-values instead of the p-adjusted values? Yes. Yeah. That's a very good question. It's because here we don't select genes, you know, and the corrected p-value we correct for multiple hypothesis is to select a group of genes or a group of pathway. But here we don't select. We just keep them. So we don't need to correct for that. And the p-value has less ties. So when you look at FDR value, sometimes the way it's calculated that gives you ties. And parts are not good for ranking. So basically here it's just a ranking that we need. And the p-value is going to be more precise. Brownie, can I ask you a question? Yes. Sure. So with the p-value, like in just regular observational studies, like epidemiological studies, we noticed that an increase in sample size will give you a significant p-value. And so let's say if you have a smaller sample size, then if you want to see a difference, then something like p-hacking, does that concept also apply here? Do you think like the p-values could change in relation to the sample size or? Yes. I mean, so for me, I would call it the biological replicates. So the more biological replicates you have per group, the more precise the p-value is going to be. So in our case, it's not like it's good. The more biological replicates you have, the more sensitive your p-value is going to be. And we see it. And sometimes we try with three biological replicates. Maybe for anything, it could be for chip-sick and we cannot separate the signal from the background noise. So then we add biological replicates. We don't change anything. We just add biological replicates. And then we can see now we are able to extract the signal from the background noise. And in that case, do you notice a difference in the ranking score? Like the ranking order stays the same or does that change? Well, the ranking is going to change, but usually for the best. Okay, okay. So is there a way to, maybe during the break time, but I was just wondering if there's a way to calculate the sample size to make sure that we are getting the right p-values? Jerry's analysis, like power analysis. So basically people do a pilot experiment. So we do it when we think that we have noisy data in terms of variation between the samples or that the signal is a bit low. And then you, from that pilot experiment, then you can use statistical tools to predict if it's going to be enough or not. Thank you, Veronica. Thank you. Oh, honey, we have another question in the slack from the student. So Sneha has asked, a gene can be present in more than one pathway. How is that accounted for in the over-representation test? If it isn't accounted for in the over-representation test, how can we account for that? And they've found a follow-up question. Sorry. Maybe one at a time. So the gene redundancy problem is real and many pathways have genes in common. And so this is a weakness of the pathway enrichment analysis because we have this with gene redundancy. But the way we deal with this is by using enrichment map that I'm going to explain. So basically at the first step, we test pathway one by one as they would be independent. And then we have this table with enrichment results where we have all the pathway and sometimes they contain pathway genes in common. And then we use enrichment map to cluster this pathway. And then the pathways are really related with a lot of genes in common. Now they are going to be considered at just one cluster instead of 10 pathways. So we try to reduce the dimension like that. But in fact, in the statistical test, we say that they are independent, but they are not quite independent because a pathway A is significantly enriched and a pathway B with many genes in common is enriched. Then they are not quite independent because we know that if A is enriched, then B is enriched. So that's based on the statistical path. But in practical terms, we deal it with by using enrichment map that will cluster the this pathway that have genes in common. So what was the follow-up question? Okay. The next question was, is there a minimum and maximum number of genes that can be used in an overrepresentation analysis? So for a different journalist, I would say minimum would be 50 and maximum would be maybe 500 or 1000 is really big already. So 500. But for rank list, if you can rank your list, then there is no threshold. We try to avoid threshold. So you just put all the genes in the genomes. So that's why we recommend rank list whenever you can. Just a follow-up for me. Why do you set that threshold to 50 and 500? So it's a bit arbitrary. It's just that if you have less than 50 genes, then you really don't need to do pathway enrichment analysis because you almost can analyze your genomes by by eyes only. So really pathway analysis is to summarize your results. And 501,000, you are going to have terms of pathways, and you can do that. But what you are going to do is you are going to look at the top pathways anyway. So if you have more than 501,000 genes, so for example, in G-profiler, you can use the ordered query option that would just order the genes and consider the top ones are the most significant. So I mean, you can try 1,000. There is no problem with G-profiler in terms of statistics, but it's how you are going to interpret it. Are you going to be overwhelmed or not? And again, if you have like a long journalist like this, try to see if you can use a rank list. Thank you. So for GSEA, so we run pathway analysis. But here in this case, so we have two directions. We have the pathway enriched at the top of the list, and we have the pathway enriched at the bottom of the list. So we are going to have the p-value for all the pathway, but also the sign of the enrichment score, also a positive enrichment score would be a pathway significantly enriched in my upregulated genes, and then a negative enrichment score are going to be the pathway enriched in my downregulated genes. So it's Muba and et al. who just developed GSEA in 2003. They were studying diabetes, and they came out with this GSEA algorithm that showed that the downregulation of this pathway oxidative phosphorylation. And what was interesting is that if they looked individually in the genes of the oxidative phosphorylation, none of these genes were significantly downregulated. But then when they calculated like the addition of the small amount of the genes in the oxidative phosphorylation pathway, they found that this pathway was actually strongly affected, and GSEA could capture that, and it was further validated. So now we are going to see how the GSEA running sum is calculated. So on the left, we have the rank file, which genes upregulated at the top and downregulated genes at the bottom. So this rank file, we place it horizontally with the up gene on the left and the down gene on the right. So the genes that are not significant, they stay here in the middle, and they do not contribute to the enrichment score. So then the black bars are genes in the pathway that we are testing. So what we can see here is that we have a higher density of the genes on the left side, which are the upregulated genes. So it means that this pathway is enriched in our upregulated genes. So the GSEA running sums stop at zero and at gene one, and then you go by gene by gene, and the running sum is going to increase each time there is a gene that is in the tested pathway. So then you have to have a lot of genes in the tested pathway at the beginning of your rank list to have like a sharp increase in the running sum. And then it's going to decrease because you don't have that many genes in the pathway. And then the maximum here is what we call the enrichment score. So the GSEA enrichment score. And then GAT has a weight system that's going to give more weight to the genes at the two extreme of the rank file. So at the left for the very, very top upregulated genes and the very, very downregulated genes. So you cannot have a peak in the middle. So then you can have pathways like this. This is the name of this pathway that is enriched in genes upregulated in my data. So in this case, I have a positive enrichment score. And you also have pathway. So this is the name of the other pathway that are enriched in genes downregulated. In this case, you're going to have a negative enrichment score. So now that we have calculated the enrichment score we still need to estimate that the enrichment score that we got is equal or larger than the one that we could have obtained by random chance only. So GSEA method is using permutation to build a null distribution and to calculate an empirical p-value. So for each tested pathway, we have the observed enrichment score. And then we see if it's far away from the mean of the null distribution that could should be equal to zero because it is random. And so we can calculate GSEA is calculating an empirical p-value by calculating the number of times the observed score was larger than the random scores. Okay, so then the same as the GIPO fire, then we have an enrichment score and then we have a p-value. So for the other summary, if you have a defined gene list like 200, 500 genes, then you are using a tool that is using the Fischer's exact test. But if you have RNA-seq to class design, you try to rank your gene list to avoid arbitrary threshold. And then in this case, you can use the GSEA tool. So now we are going, yes, sure. I just confused in the beginning, the Fischer's test, we had two different groups, those genes that were in our pathway and the other was those genes that we are interested in. For the second one that we have this rank gene list, we have a group of genes that we are interested in, but instead of having just the gene list, we have the ranks for those genes based on, for example, gene expression level. And then where do we use those genes in our exact pathway in the formula? I don't understand that. Yeah. So for the Fischer's exact test, you understand, yeah. You have like selected genes of interest. So for RNA-seq, you use your top one and that's it, yeah. But for GSEA, you need a two-classy type, treated versus controlled, yes. And then you are interested by the gene, so you have two classes for GSEA, so treated, controlled. And you are interested in the genes that are higher expressed in your treated versus controlled. You also are interested in your genes that are down-regulated in your treated and your controlled. So we don't have any set of genes that are in our pathway analysis. We don't have any gene list that are, for example, belonging to a specific pathway. Yes, yes. So GSEA is going to, so this is a pathway, yeah. So in the, you see my slide. So in the pathway database, we have maybe 5,000 pathways. So the pathway, for example, cell cycle, we know 500 genes in this pathway. So then GSEA is going to look for this pathway, cell cycle. How many genes, if you want, are in the up-regulated genes are the top of the list. And then for another pathway, maybe the pathway for apoptosis is going to be enriched in genes down-regulated in my list. So I have pathway that contains a lot of genes up-regulated and pathway that contains a lot of genes that are down-regulated. But would you go to the next slide? And next. But in these figures, we don't use the pathway data. You're just using the up-regulated and down-regulated score for those genes that we have in our list. So that's the list. And for example, this one, signalling by FGFR-R1, comes from the pathway database that we are testing. So you are going to have a plot for each pathway, for the 5,000 pathways that are in the database. So those black balls are the genes that are in this pathway. And this one also, anti-gen processing, is a pathway that is in the pathway database. Okay, so all the genes are in order in our X values, but those black ones are genes that we have in our specific list. So when you speak about genalists, those are the rank-piloted genalists. And then you have one element which is the genalist and the other element is the pathway. And then you just try to see the genes in the pathway that I'm testing. Where are they in the up-regulated genes that I have for my genalists, for my data, or in the down-regulated genalists? Yeah, I got it. Thank you. So now we had a few questions about the Q value. And then that's true that the Q value is very important. And so far what I've shown you is one pathway. So those examples that I was showing to you are just one pathway. So we tested one pathway at a time. But in reality, we are testing many pathways at the same times, and we are basically testing all the pathways that are in the pathway database. So we are testing all the pathways in the pathway database that are and see how they overlap with my genalists. Therefore, because we are trying to test as many pathways as possible, we also need to correct for multiple hypothesis testing. So here, this slide is going to explain the concept behind multiple hypothesis testing. So we are going back to our example of red and black genes. And we had 5,000 genes containing 500 black genes. And the result that we hope to get is at least 4 black genes. So the P value was telling us that we only had 0.01 genes to get this result by a random chance only. But this is only if we do one trial. Because if we try and try again until we succeed, maybe 10,000 times at one point, we get the result of 4 black genes and 1 red gene. So even if it's not likely to get the result the first time, it becomes more and more possible when we try multiple times. So even if an event is unlikely, if we try it multiple times, we may be able to get it. And that's what people mean by multiple hypothesis testing. And that's what we need to correct for the number of tests that we are making. So if we would not correct for multiple hypothesis testing, we would generate a high rate of type 1 error, also named false positive. So we need to correct for the number of pathways that we have tested. So there is actually a simple way to correct for multiple tests. And intuitively, we could think about it ourselves that we just multiply the nominal P value, so the first P value that we obtained by the number of pathways that we have tested. And so in this case, the corrected P value is always higher than the original P value. So if you have an original P value of 0.01, maybe after the correction, your P value is going to be 0.05. So this correction is known as the Bonn-Ferrony correction. And it's very stringent. So it could be that if you use this Bonn-Ferrony correction, that you won't have a lot of pathway that passed the corrected P value threshold of 0.05. So then there is another method that is widely used and it's called the false discovery rate. So the false discovery rate is the expected proportion of the observed enrichment due to random chance only. So let's say we run the pathway enrichment analysis. We get the P value and then we get the corrected P value in this case for the FDR is called a Q value. Then we select the pathway that have a Q value under 0.05. And what we say is that at the 0.05 threshold, we expect a proportion of 5% of the selected pathway to be due to random chance. So the main method to calculate the FDR is called the Benjamin-Holberg method. And so the result is called sometimes we call the FDR, but it's the FDR Q value. So because the number of tests of the pathways including the equation to calculate the FDR, the more pathway we are testing, so if we are testing like 10,000 pathway, then the more or the higher the P value is going to be corrected. So one way to decrease the stringency of the correction is to limit the number of tests that we are making. So that's why, for example, in GSEA, when we are applying our protocol, we also remove the small pathway that contains less than 10 genes and the pathway that contains more than 501,000 genes. So we remove them sometimes because the large pathway are not informative, but it also reduced the number of pathway that we are testing so we don't have to correct so much the P value. Okay, so that's the end of the pathway enrichment analysis, which is basically, so take the gene list as a rank list to avoid arbitrary threshold or just take the 500 genes that we want to interpret and then see which pathways are overlapping with our gene list. But then when we have these tables, then we use a site-to-skip-enrich map to visualize the results. And this is the question I answered already, why we are doing it is because we get these very, very long tables, many, many pathway enrich, but actually in these pathways, some are related. So some share a very, very similar biological function and genes in common. And this is, the reason is that the pathway database are redundant and also because we like to start with a very large number of pathways in database to get very precise information. So to address this issue, the enrichment map was developed and then in enrichment map, so we are going to create a network and a pathway is going to be what we called a node. So each, all the pathway that passed the FDR 0.05 would be a node in the enrichment map. And if maybe pathway 4, 10, 15, 20 share a lot of genes in common because they share the same biological function, they are going to be connected by lines that we called edges. And so then this pathway that we're here somewhere in the database, now they will form a cluster on the map that we will be able to see. Then on this big table, then maybe we have like here 50 pathway, but after the visualization using enrichment map, they may be summarized by five biological function. So this is what I said that the goal is to address, the goal of enrichment map is to address the redundancy problem. So that's what I just explained to you. So enrichment map is a site to skip app that is compatible with many enrichment tools in the generic format, as long as the output table and the pathway name, the p-value and the genes included in each pathway. And in the practical lab, we are going to use enrichment map with the GSEA tool and the GIPO-Hire tool. And then one advantage of enrichment map is that we can upload more than one dataset. So if you have different time points or if you have different conditions, then you do your pathway and enrichment map independently. But then you can visualize, you do your pathway enrichment independently and then you can upload them together on the same map. So if you create an enrichment map with the GSEA output, then you will have a pathway that are enriching your up-regulated genes. They will be red nodes on the map. And you will also have pathway enriching genes that are pathway enriching in your genes that are down-regulated. So in this case, you have blue nodes. So red is pathway enriching. Up-regulated genes and blue is pathway enriching in genes that are down-regulated. So we have seen that if two pathways are connected by a significant amount of genes in common, they are going to be connected by lines. But we use an overlap coefficients score to calculate how many edges or lines we display on the map because if two pathways would be connected, if they share at least one gene in common, then there would be too many connection and the network would look like a hairball. So this is an example of an enriched map created with the GSEA results. So those each node again are pathway that are enriched, meaning contained. They contain a lot of genes down-regulated in my experiment. And those red nodes are the pathway that contain a lot of genes that are up-regulated in my experiment. And then so enriched map is doing this clustering. And then we use another map called Atto Annotate to draw these circles around the cluster and to Atto Annotate the cluster by using the three top words that are included in the names of the pathways. And then when you prepare your figure for publication, then you can further edit the labels. So this is a map that we call publication ready. So we have annotated our cluster. We made sure that the cluster do not overlap. And then we removed the label of the pathway using an option called publication ready. And when we do a figure, we manually, let's say, add a legend to notify what the nodes and edges are. So another feature of enriched map is that we can add an additional gene set to the map. So those pathway are the result of pathway enrichment on gene expression. But in addition to that, we added microRNA targets to the map and then we can see here that the macroRNA targets were mostly overlapping or contained in the focal adhesion pathway. So a network visualization has the advantage to be able to add different layers of information. So if your enrichment map is too busy because you had a lot of pathway that were enriched in your data, there's also an option to collapse the network. So that now each cluster that contained a lot of pathway are now represented as one node. So this is an example of an enrichment map that were created using single cell RNA. So here we see the tisny plot of the single cell RNA with the different clusters. And then what they did is they took different clusters of interest and then they run pathway and enrichment analysis on the individual cluster and then they create an individual enrichment map for each of these cluster. And then we are going to do something similar in the practical collapse number one. So the steps were cluster identification, then enrichment analysis in each of the cluster and then enrichment map. So this is the enrichment map that I found very nice in terms of design and content. And that we show a lot as an example. And the data were a copy number variant in autism. And this map show the relationship between pathway enriched so in CNV relations, which are here on to this red and the pigment node on the pathway and rich in genes that they found having a deletions. And what is interesting is that they merge this map with another source of information that were genes and pathway known to be related to autism. So that's what that was prior knowledge on genes known to be related to autism and into intellectual disability. So what we can see here is that they could see some pathway in common between this known gene and the pathway that were coming directly from their experiment. And then they really made a good use of the visual style because when we overlay different information on a network then we can use different visual styles to make this clear and obvious. So here they put like the red and the pink for their data and then the yellow for the prior knowledge. And they also play with the shape of the node. And they also play with the color of the edges. So the tips for a publishable enrichment map. So first you get your map and then you try to avoid like a map that would be too busy so you make it clear. You move your cluster apart and then you annotate your cluster and then you edit the actual annotation. So you play with the connectivity parameters dense or sparse to make it to make sure that the map is clear. And then you play with the visual style color and shapes of the nodes. And then when you are ready you can export this image as PDF and polish the last details using a graphic editor. So now we are going to start the last section which is not going to be a lot of slides but that's going to be opening on module 9 with Robin. So we are going to talk more generally about the cytoscape software that we use for enrichment map and also about the advantages of network and visualization and Robin will go more in detail. So cytoscape is a standalone application. It's open source and free. And this is a collaborative project involving multiple books. So when we introduce cytoscape and network visualization we usually start by explaining the concept of interconnectivity. So interconnectivity is defined as how two or more things of people are connected to each other. So this concept was first described for social network and with the idea that there is a maximum of six degrees of separation between two people using a friend or friend chain. So and then the number of steps that separate two people is called the social distance. So the first software to build networks was done for social networks based on graph theory model. But in life we can build a lot of different types of network because a lot of things are interconnected and they are connected by what they are having common. So there are similarities and there are correlation. And that's why I put this network below which is soccer players in a team and this network was done also with cytoscape. And so what they put the node as each soccer player of the team and if there is a large node is because during the game the soccer player received and passed a lot the soccer ball during the game. So basically this is two different types of network but cytoscape was mainly developed with the aim to create network related to biological questions and to find and visualize relationships between biological entities that we are studying which are genes, proteins, metabolites or pathways. So why would we use network visualization for biological data? This is because we want to discover and represent the relationships that are present in our data. So we usually receive our data in very, very long table or Excel spreadsheet. And this table they might contain a relationship between the data points in the table but we can see them. But when a network is created it's usually very easy to detect pattern with our eyes only. So in order to understand a network there are two terms to learn and this is node and age. And we have already seen it with the Amishma map so where a node was a pathway and the age was the genes in common between pathway. But we can have multiple types of network and we also can have networks whereas a node is a gene or a protein. And then so the age is going to represent the relationship between these two genes or proteins. So if it's a protein the edges can represent the proteins not expressed are known to physically interact with each other. So for each network we need to understand what the nodes are and what the edges are before trying to interpret the results. So as a question maybe. There's a really good question in. They're all good. All the questions are good. Fair, very fair. So Arianna asked as she said she's a bit confused about the biological implications of a pathway having some genes that are up-regulated and some that are down-regulated. And why is it okay to determine whether a pathway is generally up-regulated or down-regulated based on the proportion of genes that are up-regulated or down-regulated. Is there weighting based on which genes are more important than others? And I'm just going to add to the end there. What about negative feedback loops in pathways? Yeah, yeah. Yeah, I agree. So a pathway, yeah. So when we do GACA we usually differentiate between up-regulated genes and down-regulated genes. And basically if we have a pathway and reaching up-regulated genes if you look at the results. So it means that maybe one-third to be a realist it's not 80%. I'd say if you have one-third of your genes in the pathway that are up-regulated we say the genes are up-regulated. But what I don't say, I'd never say the pathway is up-regulated or the pathway is activated because I know that it's only a portion of those genes that are up-regulated. It's significant. It's more than random change. So something is going on. But I know that if I look down in the results in this pathway that is enriching up-regulated genes I also have genes that are down-regulated. And depending if it's a small pathway that is very specific or larger pathways sometimes you have different branches of the pathway. And you can have three branches of this pathway that all the genes are up and then one branch is that the genes are down. So some people they prefer to rank the genes but all the genes that are just differentially expressed you're not differentially expressed. And then you would just say your pathway is enriching genes that are altered in your model versus not altered. And then you don't, at this point, for you it doesn't matter if it's up-regulated and down-regulated because you think you don't know about which branch is active and accurate and which branch is an inhibitor. So that's one way to answer the question. And then for the feedback loop the only thing I can think but that's very different and sometimes when I do like we do a drug or even over expression of a gene and then the target pathway instead of being repressed as we thought it would be at the RNA level it's expressed because it's a feedback loop. But now you need to, it's nothing related to pathway enrichment results. It's just related to your model. If you target a gene there is the cells is going to try to compensate. So if you try to target a pathway to inhibit the pathway you may have different roads for that and then at the end your pathway you want it to be repressed is actually enriched in up-regulated genes. So I don't know if I answered the question. I thought that was great. Thank you so much. So going back to cytoscape so basically we showed enrichment map but now what I'm trying to tell you is that cytoscape is very generic and you can do a network with a lot of types of ideas in mind and you can create your custom network with no needs of apps and this is what I'm showing here to you and the advantage of creating a network in cytoscape is that you can add different layers of information. So on this here so we have a very small gene list and we have two types of information. One is the mutation and the number of mutation in the genes in my cohort of patient and the expression level of this gene. So I was able to add this to information on my network and then what I did is I put the node size proportional to the number of mutation and I put the node color proportional to the expression of those genes. So two kind of information on one network and then the edges represent it's another table that tells me which protein of those genes are known to physically interact with each other. So we were able to put three information on this network. So what are the advantages of network? That's why. So networks are powerful tools to reduce complexity. They are more efficient than tables. They are great for data integration and they represent an intuitive visualization. So here on the right this is an example that's why to illustrate this concept. So this network was built in 2020 representing the source of two proteins. I think it's 26 out of 29 and they were tagged and put down to detect the human proteins that could physically bind off interact with them. So this network is what we call a protein protein interaction network and so they use different visual styles to represent the different information. So they use the node shape here, the diamond and the color. So the diamond red nodes represent the source of two proteins and then the small round nodes represent the human proteins that were interacting in their experiment with the source of two protein. So then the edges represent this interaction and if you have a thick edge it means that they think that the interaction was very strong because the human protein was more abundant in their proteomics data that they got from their prudence. And then the additional information is that where the node was orange is because the human protein was known to have a drug that would target these proteins and the last one were that these yellow edges were the physical interaction between the human proteins. So at least they I think they were able to put five information on this network and maintain like a clear image that we can interpret and understand. So inside just gives you have once you have created your network it can be like a hairball it could be a bit messy but then you have multiple tools inside to skip to manipulate your networks and one important is to play with the layout. So you play with different layout algorithms so that would be a network before playing the layout and then after the layout then you can see that your cluster are now far away from each other. So today we are focusing on enrichment analysis, pathway enrichment analysis and we are showing you the enrichment map this morning and then the reactant if I plug in this afternoon but many other apps exist. And so you can go to the app store and then you can look at the categories and you can look at the description of the apps to see if there is one that can be suitable for your project.