 Okay, so then let's start with lecture. So first lecture of module two. So again, this lecture is under the Creative Commons license. It means that you are free to copy and distribute the work under the same license. So the lecture is named Finding Overrepresented Pathway Engine List. And during this lecture what we are going to do is we are going to cover the theory beyond the tools that we are going to use. And we do that because we think if you understand how the tools are working, then it's going to be easier for you to choose the right tools after the workshop because there are many enrichment tools that exist for pathman network analysis. So then you can choose the tool that is the best for your data and project and also for you to, it's going easier for you to adjust the parameters and to choose the best parameters and at the end to correct and correctly interpret your data. So first is the concept. So this one is our lecture and then it is going to be followed by a practical lab. And in the practical lab, we are going to use two tools called G-Provider and GSEA. So let's start by the general workflow of enrichment analysis and some definitions. So we have several new objectives during this lecture. So I hope that we will be able to understand the difference between a defined genelist and a ranked genelist, the concept of P-value and FDR in the context of pathway enrichment analysis, and to be able to understand the output of the management test. So here is the general workflow pathway enrichment analysis to illustrate the general concept. So we can see that we have three steps. So for the first step, this is where we get our hits from our mixed experiment. So in this first step, this is where we use statistical analysis to estimate which data points gene proteins or lipids in our experiment is going to be different from the background noise. So basically, if we have RNA-seq data, this is where we select our genes that are differentially expressed, significant. And so in this step number one, this is where we try to remove as much noise as possible. So our advice is that you take a lot of care when you do this step one so you don't carry the noise in your pathway analysis. And the step two is really on the focus of this current module. This is where you use a bioinformatic tool to interpret your data. So this is where you do the pathway enrichment analysis. So to do it, we query our list of genes against some biological processes stored in databases. And that's the biological processes. This is what we call the pathway. So that's why we say pathway enrichment. But sometimes in your project, you may also use other information. So you can use information about disease or drug targets, drug targets or transcription factors that are also stored in this databases. So at the end of step two, Dan, you will have your list of enriched pathway. And this is when we do the network visualization to better interpret the results. And actually it's an opening. So after the network visualization, usually we generate new hypothesis. And this is also where we need to validate. So once we have on your hypothesis and new pathways of interest, we go back to the bench work and we try to validate the pathways of interest using a drug or an enzyme. So this is here how the module integrates with the full workshop. So Gary talked about today and in the pre-recorded video talked about the gene list, how to get your gene list and the different pathway database and some example of pathway analysis. And so for the pathway enrichment, we start with these two elements. The first element is the gene list. So as I said, coming from your statistical and analysis where you remove your noise. And the second element is the pathway that comes from prior knowledge. So in this pathway database. And so here I put gene set. So the gene set is a format to store the pathway information. So basically, if we say cell cycle. So the gene set cell cycle will be the name says cell cycle and containing all the genes that are known to be involved in the cell cycle. So now for in your analysis tools for gene list and gene set to talk together, you need to use the gene, the same gene identifiers. So this is sometimes a common mistake that we are doing, but if your gene list, you have your gene symbols in your gene list. So pay attention than the database that you're using its format formatted as gene symbol. If you're using ensemble ID in your gene list that make sure that your pathway database is formatted as and some as we're using ensemble ID. So, then we are doing the pathway enrichment analysis. And so we are looking for over enrichment of some pathways in our gene list, and we are doing it to integrate the data in an biased way. So this is module two. And after we will do the visualization using site escape. And we will also see in the other modules how how it is very possible with site escape to integrate multiple layers of information. And another represent representation of pathway analysis, because something we sometimes we can think it's a bit complicated but it's actually just an a way to organize on Jean list into some categories. So I have my journalist on the left here. And it could be that within my genes in my Jean list. And these genes belong to different categories so I have the genes in black that might be part of the accent guidance pathway, the genes in green that are part of aging the genes in purple that are part of stem cell development and the other one in blue cell migration. So what I have done is I have summarized my gene list into four biological function. So it's way much. It's way easier for me to now try to find new hypothesis for my model and generate it like new experiment to do because I just have to focus on this for functions, instead of looking at my 500 genes. But we understand that the need to perform perform pathway analysis on a Jean list is that because the Jean list is large enough, we need to summarize. If you start with a very small Jean list you may need to interpret it in other ways and path analysis might not be the best way for you. And we are going to see tomorrow with some protein protein interaction network that you could do. So this slide illustrates the concept of overlap that is used to calculate the enrichment score when when we talk about over representation analysis. So now we have the same journalist on the left. And it's here so the journalist has 41 genes. And so we are testing a pathway called accent guidance. So we want to see if our accent guidance in is enriched in our journalist. So what we see here is that the original size of the pathway is 39 genes. So in the original original pathway database, this pathway has 39 genes and the overlap, meaning that the genes that are in common between my Jean list and my pathway is 30. So overlap size is 13. Not what we can see is that 13 divided by 39 is about one quarter. So one quarter of the pathway is in the overlap, which it's quite large. And for the gym is the same. So about one quarter of the gym list is shared with this pathway. So let's say this is the first measurement and the first start to calculate the enrichment analysis. So in addition to the simple concept of overlap, what we can also add sometimes in some method is a score associated with the genes. So it could be for, for example, for RNA 6 you could use the score of the differential expression and then you can rank your genes from high score to low score. And this score is going to take to be taken into account when you calculate the enrichment score, depending on the method. And another element that is important when you do pathway enrichment analysis is the concept of background. So the background is all the genes here that we could measure during experiments. So if we do like RNA sequencing and we work on the whole genome, then the background could be all the genes in the genome. But if we have, for example, a custom array, like where we put only on the array immune genes, then we only could measure immune genes. So then the background should be reduced only to those genes. This is a very important concept. So for RNA 6, usually we use the whole genome, but I would say that more accurately we could reduce the background to only the genes that are expressed in our model. So we could remove the genes with zero count or low count. Okay, so some definitions. So when we speak about overrepresentation analysis, sometimes we say that the pathway is overrepresented in our gene list, or we say the pathway is enriched in our gene list. So simply it just means that there are several or many genes from this pathway in our gene list, but a more accurate definition would be that there are more genes from this pathway in our gene list than what we could have obtained by chance only. So in the second part of the lecture, we are going to learn about enrichment analysis using a defined gene list. So that brings us to the outline of this lecture, which described the workflow of announcement analysis. So we see that there are two workflow. So we have two workflow because we can have a workflow if we have a defined gene list and another workflow if we have a rank gene list. So if we have a defined gene list, then the statistical test that we are going to use is the Fisher's exact test. And if we have a rank gene list, then the tool that we are going to use is GSA, which uses a ransom test, but the output is very similar. And the output on enrichment analysis is a P value that is associated with each of the tested pathway. And then because we are testing many pathways at the same time, we need to correct for multiple hypothesis testing. And so during the lecture, we are going to see the calculation of the Bonferroni correction as well as the force discovery weight. So what is the difference between a defined gene list and a rank gene list? So a defined gene list is typically a fixed number of genes. So you could have selected 200 or 500 genes. So a typical example would be like you have selected genes that are frequently mutated across some patients. And the question that you are going to ask is are any pathways surprisingly enriched or depleted in my gene list? Usually enriched. Usually we say are any pathways surprisingly enriched in my gene list. And the statistical test is going to be the Fisher's exact test. Now a rank list, and we will see that is recommended if you can. A rank list is a list of all genes in a genome that you were able to rank using a score that is coming out from your omics experiment. So a typical example for rank list could be RNA-seq example when you compare two groups treated and control. You can rank all genes using the differential expression score from top up regulated to top down regulated. So the questions that the question that you are going to answer in this case is are any pathways ranks surprisingly high, although in my rank list of genes. And what we are going to see to see is the tool GSE and we're also going to talk briefly about the Wilcoxon rank some test. So we are going to stop example one for the defined journalist. And so for this example one, the data that we are taking are from a single cell proteomics data and you have the reference here of the paper about this data set. So for this data set, we use a cell line that is derived from a patient with acute myeloid leukemia, which is a type of blood cancer. And so this is the flow cytometry plot. And so what we see in this plot is that within the cell line, so actually a mix of population, we have three groups. So the first group is the LSE leukemia, leukemia stem cells, it's very primitive cells. And we have the progenitors which are intermediate cells. And then we have the blast, which are more mature cells. And so all these cells are going to be analyzed at the single cell level. So the using proteomics, all the proteins are extracted from the single cell. And then at the end, we have for each cells, all the proteins that were detected with their abundance and we can use the same technique as for a single cell RNA sec. So we can cluster and grouped ourselves. So we can see here on this two dimensional graph, which is a U map, we can see that we have four groups of cells. So for this example, we are going to be interested in cluster A and cluster B. And we see then here on the left, this is the matrix of the cells and concerning all the proteins that were detected. So the cells are the columns and the proteins are the rows and the value here are the protein abundance. So what we can do now that we have done this clustering is label ourselves using the cluster labels. So we are going to cluster the cells that belong to cluster A, and we are going to label the cells that belong to cluster B. So this is the matrix here, this blue and red matrix. So we put all the cell from cluster A on the left and all the cells from cluster B on the right. So what we can do here is apply the tools that are for RNA sec to calculate differential expression. So we are going to calculate the differential expression between a cluster A and cluster B to get our two protein list. So for this one we use the surat pipeline and the function find markers to to find our list of proteins. So we have a protein of four cluster B and a list for cluster A. And this is the tool list that we want to functionally analyze. We are and do some pathway enrichment analysis on it. So we use the tool G profiler. So we just copy and paste our gene list into G profiler. G profiler give us the list of enriched pathway that were significant, and then later on, we will do the network visualization. So the next slide is, we are going to see how G profiler works to do this enrichment. So as I said previously, the first step of our representation analysis will be to find the overlap between the gene present in our journalist and the pathway database. So the pathway database is a very important element. So I wanted to show you the structure and during the lab you will have some pathway database like some GMT file so you can open them in text editor. It might be too big to open in Excel but try to open in text editor so you understand the structure. So it's basically like the first columns are the names of the pathways and then the genes that are associated with each pathway. And so what we do is that we look for the overlap between the genes in the pathway and our gene list. So we see here in yellow for this pathway, I think it's regulation of transcription that we have four genes that are in this pathway and in our gene list. So that's what we are doing. We are looking for each of the pathway for the overlap and we counted the overlap size and when I say we are doing actually the tools are doing this for us. So now we are slowly getting into more details on how the simple enrichment test works. So we have our gene list here in pink. So it's 41 genes if you remember from the first example, and the pathway that we were testing I think it was axon guidance at 39 genes and the overlap size was 13 genes so 13 genes in common. So that was the first step. But now we go to the second step. And the second step is we need to find a way to calculate if we can get the same overlap size. Or a larger overlap by chance only. And we do that to be able to calculate the p-value because the results of the enrichment analysis is a list of p-value associated with each of the pathway. So what we could do at that point is to do random permutation. So we could generate 1000 random gene list, calculate the overlap for each of these random gene lists and the pathway that we are testing. And then we will help us to build the null distribution. And then we will see if our observed overlap of 13 is higher than the null distribution because in this case, we think that if it's way higher than the null distribution, then the enrichment is not to be the same as what happened by chance only. So, when we do that, we say that we call, we calculate an empirical p-value and this is the formula that is here below. So the p-value is assessing the probability that the overlap between our gene list and our pathway is observed by chance only. So a p-value can range from zero to one. So if p-value is close to zero, zero, there is a low chance that the results are caused by random chance only and we can be confident to report the pathway as enriched. But if it's one, the p-value is one, it's likely due to random chance. So that's very good. We keep it and one is bad and we don't take this pathway into consideration. The problem with the random permutation is that it takes time. It can be it takes time and it could be also resource intensive for the computer. So which is good in the case of this simple enrichment test is that instead of doing the random permutation, we can use a statistical test because we can model the null distribution using the hypergeometric probability distribution. The test that is using the hypergeometric probability distribution is the Fisher's exact test. So here is the hypergeometric probability distribution and it measures the probability for each of this case. So this case is like 5,000 red balls that we will adapt for the pathway enrichment analysis to say that it's 5,000 genes. And what we can see here is that we have blood balls and then we say it's genes that are in the pathway we are testing like the axon guidance pathway. So what we see here is that we have way more red genes than black genes. And this is our gymnast. So we pick five randomly five balls, five genes, and the result that we got was a four black ball and one red ball. So I think we understand intuitively that it was not easy to get this result. So if we get this result on the first time, then it's not very likely that we can using the hypergeometric probability distribution, we can assess the probability in an exact way. So for example here, the probability to have zero black ball is very high. Yeah, zero black ball is very high. It's 0.6. And then we get the probability to get one black ball and four red balls, which is lower. And then the probability to get four black ball and one red ball is very low. It's actually 0.001. And then from this probability, we can derive the p value by summing up these two probability. And so the p value to get four black ball and one red ball is actually 0.001. So we know that it's very low so we can conclude that this pathway, this black pathway is significantly enriched in our gymnast, and it's probably not due to random chance. So to calculate the Fisher's exact test, what we can do, if we don't use tools, we can build like a two by two tables. And so the numbers that are going to be entering the formula would be the size of the overlap. So the overlap between your journalist and the pathway, the pathway size, so the number of genes that were in the pathway but not in the overlap, and also the number of genes that were not in the pathway, not in the overlap. And the fourth number is the background size. So all these four elements are going to be entered in the formula. So I'm just showing to you to you understand that these four factors are important for the p value. So this is what we are seeing there. So G provider is going to calculate all of that for us. And what is important. So this is the output of G provider with with the cluster be a list. So I just copy and paste the cluster be a list from the single support your mix example into G provider. And this is the top four pathways. So these were the top four pathways were a significant and rich and we look at the statistics here. So we have a column for term, a column called a query and T and Q. So term is pathway. So these numbers are the original size of the pathway. So the number of genes in the pathway that were tested and query is the size of my journalist. So my journalist was 51 genes for cluster B. And the overlap for this one was 21 genes 21 genes were in common between my journalist and the pathway. And the, the other one, as I said, was the background. So here we see the background to be 21,000 genes. And actually, this is because we use the default background. But as I was telling you, for the single cell proteomics, maybe we did not measure all the genes in the genome, we just could identify a portion of the genomes like we only a portion of proteins were identified. So I think it would have been more accurate to reduce the background to only the proteins that were detected that this one cell. But this is your decision. So each time you have a project, you need to think about your data and your project and tell yourself, do I use the full background or do I do I upload a custom background because I could not measure the full genome. So there is an option you will see in your profile to upload your custom background. So here we are going to see the differences between this four top pathway and which elements of the four elements that I told you are important for the P value calculation. So the first acting filament has the lowest P value P value. So the most significant. And what we see is that the overlap is 21. So there is like 21 genes in this overlap. So the other one they just have overlap of 10. So we think that the overlap size here was the main factor for this pathway to get the lowest P value. But now what is the difference between these three pathways? The three pathways have an overlap of 10. So the difference is the size of the original pathway. So this one has the lowest P value and you see that the size of the original pathway was 408. But this one was the original size was bigger. We have like 7 and 788. So if we do the ratio, the ratio 10 divided by 310 divided by 700. The first one has a higher value. So that's why we get a lower P value. So I'm showing this to you so that you understand the results and you just, you don't only look at the P value, but you also look at your overlap size and your pathway size to at the end make the decision for you to keep this pathway or not to keep this pathway in your further analysis. And this is the difference between the first one active active filament pathway, which is the most significant and two pathways that were not significant. The first one you see that for sure the overlap was very small. So we have gene size of four for this one and overlap of one for this one. So when you have a very small overlap than even if the P value is like on on the age, almost significant. Well, you have to be very careful with this. The P value assesses the probability that the tested pathway is enriched in our journalists by chance only, but we are testing many pathways at the same time. Therefore, we need to correct for multiple hypothesis testing and that's what we are going to see in the next slide. And just a note to tell you that G provider at put the adjusted high P value directly. So that's why I have like a copy paste the P value myself that I calculated because in G provider we don't see the nominal P value. So why do we need to correct for multiple hypothesis testing so we go back to our example of red balls and black balls to, to explain it so you remember that it was very unlikely to get this result by chance only we got four black ball and one red ball value was, I remember 0.001 and 0.001 is not zero. It means that there is still a chance to get this randomly by chance only. And it can happen if you try multiple times. So if I tell you well try until you get this results, maybe you will pick five balls 10,000 times and at one point, you will get four black balls and one red ball. So even if it's unlikely, then if you repeat the test, then you can get the chance to get it. So that's why we, we need to correct for multiple hypothesis sense testing. And if we don't do that we are going to increase the number of false positive in our results. So there is actually a simple way to correct for multiple hypothesis testing. And intuitively you could think about it yourself is to multiply the P value that you got by the number of pathway that you have tested. And this correction exists. This is the Bonferroni correction. And we are going to see fireworks. So we have the, so you remember this four pathway coming from this cluster B example, they were significant and this one they were like barely significant. So 349, it were the total number of pathway in the database. Total of, they were not all significant, but this in the original database, they were 349 a pathway. So we multiply the nominal P value by this number to get the adjusted P value. So what I would like you to understand is that the adjusted P value is always going to be larger than the nominal P value. And this is also our goal when we want to correct for multiple hypothesis testing. And this one that they were, they were barely significant now they are equal to one. Bonferroni is very stringent. So it's a good correction but it could be that when you apply it none of the pathway past the significance threshold. So there is another method method that is widely used and it's called the fourth discovery rate. So the fourth discovery rate is the expected proportion of the observed enrichment due to random chance only. So let's say you have FDR of 0.05. It means that you're going to select the list of pathway, maybe you are going to a 30 pathway that past this threshold of FDR 0.05 or less. And it means that in this pathway, you have a probability that five percent of them are due to random chance only. So the method that we are going to see now to calculate the FDR is called the Benjaminic orbit procedure and the result is a Q value. So we go back to our example with the cluster B cells. So we had 349 tests, yeah, pathways. So I showed you the top four and I showed you the bottom two that were not significant. So the first step when you calculate the FDR is to write all the pathway using the nominal p-value. So with the small p-value at the top and the large p-value non-significant at the bottom. And then we start to calculate our adjusted p-value. So the same as the Bonferroni correction, we start by multiplying by the number pathway 349. But now what we are also doing is we divide by the range and divide by the range. So the difference with the Bonferroni is that we are now going to correct the p-value in a more stringent way for the p-value at the top. So this p-value that are very significant are going to be more corrected than the value at the bottom that are not significant. That's why Benjaminic Holberg is more permissive than Bonferroni. So then when we have this adjusted p-value, then the last step is to get a Q value from the adjusted p-value. So it's basically kind of the same as the adjusted p-value. We go row by row, start from the bottom. So 2.06 is the same here. But for example, for this one, let's say we want to calculate this Q value. We look at the row here and the row below. And if the row below has the smallest p-value compared to this same row, then we take it. And we do it like this until rank number one. So this is how we get our FDR. So again, the tools are doing this for us, but we need to understand the theory behind it. So once we have the Q value, we select the pathway with our threshold, let's say FDR 0.05 or we can be more stringent. We can select all the pathway under FDR 0.01 and continue with our analysis. And again, this is the output of G-Profiler. So I think that now you can understand all the terms and the FDR is here. If you chose the method here, because in G-Profiler, you can use Bonferroni, Benjamin Vorberg or another method that they have for us. So really my goal is that now you can use any instrument tools that exist, look at the output and understand the results. So this is the output table of another tool called Envishore. And then we can see if we recognize all the elements. So one element will be the list of the pathways. And we see that this pathway, they are coming from the gene ontology database. Then the other important element is the overlap. So the overlap size between your gene list and the pathway that has been tested. So it's here, it's 85. And then the second here in this case would be the size of the original pathway. So they show us the ratio between the overlap and the pathway size because we know it's important for the result of the p-value. And here the second column is the nominal p-value and the third column is the adjusted p-value. And this is this column that we are going to take to select our pathways that are enriched. And this column here, it's also nice when the tools, they have it, they have the name of the genes that are in the overlap for each pathway. Okay, so let's try another one. So I hope that now you can recognize the elements. So Panther is another tool. What we have the list of the pathways, the origin of the database, I think it's the ghost limb from Panther. And then we must have like a measure of the overlap, which is this column. So overlap between your pathway and your pathway and your gene list. And then the original size of the pathway. And then we have the p-value, but most importantly, we have the FDR column and this is this column that we are going to use to select our enriched pathway. So now we are finished to explain the enrichment analysis using a defined gene list. But at the beginning I told you that there are two protocols, one for the defined gene list and one for the ranked gene list. And also we told you that whenever you can have a ranked gene list, then it's recommended to use the ranked gene list protocol. The main step for the ranked gene list is to generate first a rank list. We are going to see how to do it. And then the tool that we are using is GSEA, which uses a ranked sum test. So a white test enrichment in ranked gene list, it's to avoid the problem of selecting the genes using an arbitrary threshold. So if you have different expression analysis and you want to select your genes, if you are too permissive and you select your genes in a very stringent way, what's going to happen is that you are going to lose a lot of information. But on the contrary, if you are not permissive enough and you select a lot of genes with like FDR, I don't know, 0.06, which you should not, but then what you are going to do is to allow too many false positives in your data. But if you have a method that uses a ranked gene list, you don't have to select your genes so you avoid this issue and this is why it is recommended. So for this example number two, using a ranked gene list, so we have a different data set example, and this, this example comes from bulk RNA sequencing. And the reference paper is here below. So we are working with blood cells, normal blood cells, and what we are doing in this blood cells, we are over expressing the transcription factor called TFEB. So now we have two conditions, the control and the treated, and the treated, which is here, OE is the condition where we over express the transcription factor. So what we can do here is to create a rank file that will rank our genes from top up regulated to top down regulated and which is very important is that we leave the genes that are not significant in the middle of the rank file. And then we, this is, so the rank file is the input format for GACA. So we input the rank file into GACA. GACA gives us the list of the English pathway. We create a network for visualization. And in this case, and in this paper, at from the English map, we focused on two particular pathway, which were the lysosome and the myctargets. So that's to show you the, like, like a workflow example. So the first step is to create this rank gene list. So how do we do that? So here we have the example of bulk RNA sequencing. So we have a matrix with our samples that treated or samples are controlled. So we use differential expression analysis using our package like DE62 or Azure. And using the output of the differential expression, we rank our genes from top up regulated in treated to top down regulated in treated using leaving the non-significant genes in the middle. So what the score that we are using to do this rank, we obtain it from this formula here, sign of the log for change, multiply by minus lockdown of the P value. So basically for each gene, we have this log fortune column and this P value column. So if a gene is very significantly differential expressed, the P value is going to be close to zero. So minus lock 10 is going to transform this small number in a large number, because we want a high score for this, for this very significant genes. And then for the sign of the log for change, the log for change would indicate us if the gene has a high expression in the treated versus control or low expression. So basically the sign of the log for change tells us if the gene is up regulated or down regulated. So that's why we also use the sign of the log for change. So now we have the score for all the genes and we basically rank all the genes from high to low. And we are going to say like this to come like the gene name and the score to create the rank file in GSE and we are going to see this more in also in the practical lab. So usually we create this rank file in all but for sure you can do it in Excel as well. So the tool GSE. So it's Moussa et al who developed the tool GSE in 2003 and they were studying studying diabetes. And so they came up with the GSE algorithm that shows that showed the down regulation of the oxidative perspiration pathway in their model. And the particularity is that all the genes in the oxidative perspiration pathway none of them were significantly different expression. So if they had selected their genes using like a like a threshold, they would have lost all those genes. But what happened is that all the genes in the oxidative perspiration pathway were down regulated with a subtle amount, but the addition of this subtle down regulation of each genes had a strong impact of the path on the pathway activity. And this is what the GSE algorithm could measure and then they also did some experiments to validate this finding. So the algorithm is using a modified Conmogorov-Smirnov test, which is a rank sum test and we are going to see how it works. So now we have a gene list here, so genes that are ranked from high to low so are regulated to down regulated using the score that we have calculated. It's important to understand that all the genes are in the rank file so the non significant genes are in the rank file. So now what we need to do is to understand that we rotate this rank file to put it horizontally here. So now the up genes are on the left and the down genes are on the right and the non significant genes are in the middle. So we are testing one pathway at a time. So here we are testing this pathway, which is antigen processing and presentation. And what we are going to see is that we have a lot of blood bars here. So the blood bar represent the genes that are in the overlap so the genes that are in my rank file and in the pathway that we are testing and we see like the density of the blood bar towards the left of the rank file is that this pathway is in which in genes are regulated. And so what we can see is that GSE is calculating a running some so the running some is starting at zero and increase it goes gene by gene so gene one, gene two, gene three in the rank list. is in the pathway the sum is going to increase a lot. If gene number two is not in the pathway, the sum is is going to decrease slightly. So if you have a lot of genes that are in the pathway, what you are going to see is you're going to see that you're running some is going to increase very rapidly, and the maximum of this would be what is called the enrichment score. So this is how GSE is working, and also GSE has a weight system. So there, there, there is more weight for the genes at the very top or the very end of the rank file. So it means that we cannot have a peak it's not possible to have a peak in the region of the genes that are not different to expressed. This is how it works. And this is a zoomed image of the GSE running some so that you can see better that so this is gene one, gene two, gene three, gene four of the rank five. So each time the gene is in the pathway the running some increase and then decrease likely because it's not in the pathway and then increase again so when there are a lot of genes in the pathway the running some is increasing rapidly. So you can have in GSE a positive and a negative enrichment score. So we can see this to plot here. So the first plot. So it's a positive enrichment score. It means that the genes are enriched in the upregulated genes. So the left part of the rank file, and this plot on the right is, it's a negative enrichment score. So it means that the genes are enriched the pathway is enriched in genes downregulated. So now we have the enrichment score, but we still need to assess the significance so we need to go from the enrichment score to a p value. So GSE is doing this by by random permutation to build another distribution and to calculate an empirical be p value as I explained at the beginning. So the case that we are using most of the time the permutation is done by replacing the genes in the pathway. So each pathway will do 1000 random permutation, and then so the pathway we now contains random genes. But there is another permutation technique that consists of shuffling the samples from the beginning before creating the rank file. So you also can do like basically 1000 random rank five to break gene dependence. So now, so for each of the random genes that we've tested into the random pathway we have an enrichment score, like a random enrichment score, which built the null distribution. And now we have our observed enrichment score. So let's say our enrichment score that is observed is 0.8. And the mean of the null distribution is usually zero. So what we want to assess is how far observed enrichment score is from the mean of the null distribution so that we can assess like the p value. So we do that by calculating the empirical p value by calculating how many times the observed score was greater than the random score. So that's for the p value but we still need to correct for multiple hypothesis testing and GSE is calculating also an FDR from the normalized enrichment score. So now that we have seen the workflow for the defined gene list and for the ranked gene list. In this last part of the lecture we will see if we can rank or not our gene list depending on the omics data that we have and how to choose a tool so it's not extensive, but this is a few examples that we have. I'm going to stop by the data that are easy, easier to ranked. So first the RNA-seq, so bulk RNA-seq data, this is the example I took for example two, with the TFAP over expression. So when you have bulk RNA-seq data and your experimental design is controlled versus treated, then it's easy to do a rank file as I explained and you rank all the genes and you do you can input this into GSE. It's also possible to do in single cell RNA-seq data. So you first, for example, you first cluster yourself and I think it's, I mean, I find it easier when you have multiple biological replicates in your data. So let's say you have single cell RNA data, but you have three biological replicates and you put all your replicates together but you still know which one is replicate one, two, and three. And let's say you want to compare cluster A and B like we did. But what you do is that you create like an episode of bulk from the data and you gather all the cells that were from replicate one cluster A, all the cells from cluster B replicate two, replicate three cluster B, and then you do the same for cluster A. And now you have six column, three for the cluster A, three for the cluster B, and then you can apply the same technique as for bulk RNA sequencing, do your differential expression, create your rank list, and then do GSEA. Another one that also where you can do a rank list is a label free proteomics. If you have a sufficient number of proteins, I would say 5,000 or more, then you also can apply the same technique as for RNA-seq, you do your differential expression, treat it versus control, and you create your rank list and your enrichment analysis. And this is three examples where I do not use a rank list and I don't know if you have other examples yourself. So the first one is when the starting point is DNA. For example, so you have data for somatic mutations or CNV. And what you get is a gene list of your genes that are frequently mutated in some cohort of patients. So in this case, it's just a defined gene list. There is no way to rank it. So in this case, I use the G-profile workflow. And another one is when I have bulk RNA-seq, but I don't have the experimental design control versus treated. What I have is a time course. So I have for my sample, I have a time point at zero hour, 12, 24, 48 hours and so on. And what we are looking in this case is the profile of genes. So for example, we can use the K-means clustering technique and we look at, we want to retrieve all genes that go up in my time course or all genes that go down. All the genes that first go up and then go down. So we have different profiles and from the K-means clustering, what we get is different gene lists. So it's really, let's say three defined gene lists. So this one, I cannot rank them and I use them in the G-profile workflow. And another example, let's say you work with chip-seq data and you have a transcription factor and basically you have from the chip-seq output, a bad file that contains the chromosome region for each of the peaks that was detected. And so let's say you say you know that your transcription factors is binding at the promoter regions of genes. So what you do is you associate your chromosome region, so your peak with the gene that is nearby because you are looking at the promoter region and at the end then you have a list of genes which could be quite large with chip-seq data analysis of all the genes where you have a peak in the promoter region. So in this case it's a defined gene list and we have in module seven, we have optional module seven if you have chip-seq data that explained the workflow and a tool that is good to use is called GREAT for the instrument analysis of chip-seq. And if you have other data in your project, then we can discuss, can you rank it or do you have to use the defined gene list protocol? So as I said previously, we are presenting you a few tools during this workshop, but many other tools exist for instrument analysis so they can be web-based like G-profiler. They can be included inside your skip apps. I think it's the case of Bingo and Clugo. They can be standalone applications like GSEA that you download on your computer or they can be included in all packages and Python. So, but now I hope that at the end of this lecture you understood that any of the tools they will have a typical output, which is this table of list of enriched pathways with the pathway names, the number of overlapping genes, the number of genes in pathway, the p-value and most importantly the adjusted p-value. And this is this column that you take to select your pathway that are enriched to continue your analysis. So usually like you can have many pathways depending on your experimental design, you may have a lot of pathways in your output table that are significant and sometimes it's not easy to understand this table. And what you can actually see when you just look at this table is that those pathways, they can share a lot of genes in common because they have a related biological function. So that's why we do the network visualization to see the interconnection between this pathway and it's easier to understand and that's going to be in module three. So some questions that might guide you when you need to choose your tool. So first, does it cover your model organism? Some tool will just cover human and mouse and some tools will cover like a lot of model organism. Is there a good choice of pathway database? Are the pathway database up to date? Which statistics do they use? So now you can recognize is it for defined gene list? Is it for rendering list? Some tools they have option for two. So you can choose any of them. Is the description of the statistics clear enough? Do you like the output side? And can you connect with network visualization tools like cytoscape? So I'm done here a little bit of a comparison, but also to tell you why we are using G-profiler in this workshop. So the G-profiler has updated database, yes, they updated very regularly on a monthly basis and they make sure that they use the latest version of Ensembl. So the choice of database, yes, we will see that they have a good choice of database. You can use GoBP, you can use Reactome or WikiPathway. And what is interesting is that what is very useful, sometimes you just want to choose one database. You just want to test your pathway with GoBP. But sometimes you want to combine the database that you want to use Reactome and GoBP together to get more comprehensive results and you can do it with G-profiler. Covers multiple organism, yes, possibility to upload your own custom database, yes, so that's very useful. So let's say you have your project and your own generalist, but you want to compare with another paper. You have another paper and they got the DGN expression and they get like a signature, like a generalist signature. And you want to see in your data if you enrich, you're also enriching this signature. So you can build like a small GMT file for this signature and you can upload into G-profiler and you can test the pathway for only this pathway of interest. So that's very useful as well for sure they correct for multiple hypothesis testing, possibility to upload your background, custom background. So you remember for the single cell proteomics, we say that maybe the full genome is not the adequate background. So we could copy and paste the list of proteins that were detected at least once in the cells to reduce the background. So it's also possible and G-profiler is a web app. There is also an app package and for sure it's possible to connect it with such a skip enrichment map. So as some final notes, we usually test over enrichment of the pathway in our generalist. So we want to see that many genes in this pathway is in our generalist. Sometimes some people are interested in under enrichment. So basically they want to know which pathway are not present at all in the gene list as possible if you're interested. And the Fisher's exact test is often called in the tool to hypergeometric test. So if you look at the statistics, you look at the tool and you want to see, oh, are they doing the Fisher's exact test or not. And you see hypergeometric test. Usually it's an approximation of the Fisher's exact test. Sometimes they use Monte Carlo simulation to approximate the test. And some other tools in the same category will also use the binomial test or the key quest test. Okay, so for rank list, I didn't find so many tools. So I'm listening to GSE and Panther. And so GSE is using the modified KS test whereas Panther is using a Wilcoxon rank sum test, which is also very appropriate for enrichment pathway in rank list. For sure they both do correct, they both correct for multiple hypothesis testing. With GSE, you can choose the database, the pathway database, they are that are ready for you in the MC database, but it's also, there is also the possibility to upload your custom GMT file. So the GMT file is the pathway database and it's good because then you can upload the GMT file that corresponds to your model organism. And there is the possibility to visualize the results with cytoskeleton map. So I say that. So I mentioned the Wilcoxon rank sum test. So, so Wilcoxon rank sum test is also interesting. So it's different from GSEA because it's nonparametric. So in the Wilcoxon rank sum test, only the order of the gene is important. So it's not the expression scoring, but it's just you need to order the genes by one, two, three, fours. Just the order that is going to take into account and what we are going to look for is a shift in the distribution. So for example, in rank, this is all the genes from a pathway and we see that there is a shift to the right compared to the, the, the null distribution of all the genes. And this is the shift of this distribution is because the genes in this pathway were all upregulated. So, so then as a summary, we have seen the Fisher's exact test for calculating P value for defined journalists and GSEA or Wilcoxon rank sum test for computing enrichment P value for rank gene list. So the recipe for defined journalists. Then first you define your gene list and your background, then you select the pathway that you want to test, and then you run your enrichment tests and you correct for multiple hypothesis and for rank list is not very different except the first step where you create your rank list. So the advanced topic that were not covered in this lecture is the test that we are mentioning are not correcting for the correction between gene set or the dependency of genes. Some statistical methods are trying to do that but not the two, not really the two to test that I'm mentioning some other tools are more advanced they are topology aware they are looking into consideration. So if the genes, if like gene a activates to be or inhibits gene be there, there is another method where they look at the overlap, the genes in the overlap, and they look at the functional relationships between the genes in the overlap. If the genes in the overlap they are connected to each other, they would put a weight to the enrichment score so this connection would increase the enrichment score compared to the background gene that would be less connected. This is also an interesting method I think. But more or more tools are starting to include some network visualization in their output so you will see I think about a cluster profiler in odd and you will see you do your enrichment test and then you will have like a little of a network visualization. So if you want more detail on this advanced topic then you can go and look at this nature method paper. So last tips. So be precise at each step of your analysis, especially step one when you select your genes for your analysis and try to answer one biological question at a time. But very last two slides. If you want to know more about enrichment analysis I really like this example from stat quest so you can look at the video I think it's very visual so basically they they have this bag of elements and immense and they say this is all the genes in a genome. This is the background and all the pathway are the different colored of the M&M's and this is the gene list and we see for sure that our gene list is in which in the blue M&M's pathway. So I think I mean I really like stat quest is very easy to understand and the same for the FGR I think they have a good explanation of FGR. It means that, you know, in your in your output, you have a mix distribution, you have the true positive that are mixed with the random that we could not discard the random results are still there. And if we could remove the random results from the true positive, then we will have only a true positive and better accurate results and this is what the FGR is trying to do. Okay, so thank you to the sponsors.