 Okay, so hi everyone. So we are about to start the second lecture of this workshop, module two, which is name finding of a represented pathway engine list. So this lecture is going to be focused a lot on statistic and theory, but after lunch we are going to try out this concept into a practical lab. So it's going to be easier after lunch. During the lab you will be recommended to follow our steps, but you will also be free to try with your engine list. So the reason we teach you the concepts and the statistics used by these enrichment tools because there are many tools that are available and many tools are constantly being developed and they appear and the other tools are becoming deprecated. So we think that if you know the theory then you will be able to choose the right tools that is the best fit for your data. So also sometimes it happens that you work with a model organism that is not very common and in a few cases you need to develop your own custom tool. So if you are an advanced archscriptors and you understand the theory that you probably will be able to build your own tool. So this is the list of all learning objectives for this current module. So we are going to learn on how to select the appropriate enrichment test for our data. We are going to learn how to compute Fisher's exact test, which is the test that we use when we have a defined engine list. When we have a wrenching list we will use some other statistical tests and one that we are going to see is the minimum hypergeometric test. And we will also talk about multiple test correction and we will learn how to calculate the Bonferroni and the force discovery rate. So here is how module 2 integrates within the workflow. So earlier in module 1 Gary talked about the different genelists that we get from our mixed data. So we get our raw data and then we analyze and normalize the data to get a genelist. And then we would like to functionally interpret the genelist. And the other elements that Gary mentioned earlier is annotations that are constantly gathered into pathway database. And the format to store pathway database is a gene set. So a gene set we can take the example of the cell cycle. So the cell cycle gene set we know that 500 genes are involved in cell cycle. So the cell cycle gene set will be the name of the gene set, cell cycle, followed by the list of the 500 gene names of the genes involved in the cell cycle. So then we have two elements, the genelist and the gene set. And these two elements can talk to each other and connect to each other only if we use the same gene identifiers in the gene list and in the gene set. So if you use gene name for the gene list, then in your pathway database you use also gene names. So the gene set should have the format of gene names. If you use ensemble ID in the gene list, then make sure that your gene set is also the format of ensemble ID. And then the pathway enrichment is simply to test if I have another representation of a pathway in my gene list. And then for the other modules of the workshop, we are going to visualize the result of the pathway analysis. So here is a workflow that Gary presented in module one. I just wanted to present this again to recall that our gene list can have different origins. So we do different omics experiment. We use different statistics to analyze our data. And then we get our gene list. So I wanted to mention that the pathway analysis is really the third step. So it's really one you did all your statistical analysis. And module two is within these red lines. And the two tools that we are going to see for this workshop is going to be GSE and G Profiler. So the order of the lecture will follow the steps of an enrichment analysis. First, we will describe that there is two major types of enrichment tasks. Depending on the type of gene list we could get from our omics experiment. So we could get a defined gene list. And if we are able, we can get a ranked gene list. So depending on whether we have a defined gene list or a ranked gene list, the statistics are going to be slightly different. But they will all result by a p-value associated with each gene sets. And because we don't have just one pathway, so one gene set, but multiple pathways that are stored in the pathway database, we need to correct for multiple hypothesis testing. And we will see the Bonferroni and the fourth discovery rate. So we describe two types of enrichment analysis based on whether we have a defined gene list or a ranked gene list. So a defined gene list will be a fixed number of genes, could be 200 genes, 500 genes or more. And a typical example would be a list of mutated genes. For example, genes that are frequently mutated in a set of patients. So that would be like the typical example of a defined gene list. And a ranked gene list is a list of all genes in the genome that you were able to rank using a scarring system. So this scarring system should be a value derived from your genomics experiment. And a typical example comes from RNA-seq data. So you have two classes, two groups. You compare your two groups. And you can use the p-value and the log-fold change to score all the genes in the genome so you don't remove anything. You score all the genes from the top, upregulate it to the top, downregulate it. Why is the distinction important? Because the statistical test is going to be different. And if you are able to do a ranked gene list, you may have a bit more power. So if you have the choice to do a ranked gene list or a defined gene list for RNA-seq, for example, just go with the ranked gene list. And you are going to use the appropriate test for that. Yes? Yeah, we are going to see. I have a slide with the formula. So if we have a gene list, the question that we are going to answer when we do enrichment analysis is, are any gene-set pathway surprisingly enriched in my gene list? And if we have a ranked gene list, then the question we are going to try to answer with enrichment analysis is, are the any gene-set pathway ranked surprisingly high or low in my rank list? So and you see, so here, again, for the gene list, the test is the Fischer's exact test. And for the ranked gene list, we will see the minimum hypergeometric test and a modified Kolmogorov-Smirnov test for GSE. OK, so we are going to start by the test that is very simple, is the gene list enrichment test. We have a defined gene list. So given a gene list, given a gene set, are any of the gene sets surprisingly enriched in the gene list? So we are going to see, again, where the gene list can come from and, more importantly, how to assess surprisingly in statistics. And then we will see the multiple correction. So this is a classic design, which is called the two classic design for gene list. And so we have two groups of samples, so that we name here class one, class two. So that could be cases versus control, treated versus control, or that could be two subgroups, for example, like in a cancer type. Within each class, we have multiple measurements that we did independently. So we've collected different samples, different measurements that we call the biological replicates. So each group contains the biological replicates. And then we do a statistic to calculate differential expressions. So for example, of RNA-6, the data follows negative binomial distribution, and the tools that exist to calculate differential expression are, for example, HR, OGS-6. And then you get your list of genes with the associated corrected p-value at FDR. And you select the ones that are differentially expressed, significant. You use a threshold. You use an arbitrary threshold, for example, FDR 0.05. And in this case, you get two gene lists, one for the genes that are higher expressed in class one compared to class two. And you have a gene list for the genes that are down-regulated in class one compared to class two. So that's not the ranked. And actually, that's a classic example to obtain a gene list. But if you have this design, we will see that you would use the ranked list. So this is just an example to get the defined gene list, but we will see that it's preferable to use a ranked gene list. But this one, this design, the time-course design for gene list, in this case, you cannot generate a ranked gene list. What you do is you have different time points for your measurements, and you use clustering algorithm to get the genes that get the same profile over time. So then for each profile, you'd get a gene list. So in this case, there is no way to do a ranking. So you get three different gene lists, and you are going to do your pathway enrichment independently on the three gene lists, and then you can compare the results after. So how does a simple enrichment task work? So here in pink, we have our gene list corresponding to our hits, so our differentially expressed genes. And in brown, we have the background. So the background is an important concept. So we call it the experimental background. And it's all the genes that your genomics experiment, or your experiment, could measure. So if you are doing a whole genome experiment, then the background is the whole genome. One case where your background is not the whole genome is when you do a custom macro array. So let's say you do an immune array where you just put your immune genes on the array, and then you have different conditions, then your experimental background will be all the immune genes that you put in the array, and the gene list will only be the ones that are differentially expressed genes. That could be also if you use a custom array, and on your array, you only put the transcription factors. So the experimental background would be all the transcription factors that you wanted to measure, and the gene list will be only the ones that are differentially expressed, or hits. So for RNA-seq, if you use RNA-seq, it's like a whole genome experiment. So usually, you use all genes in the background. You could restrict to only the genes that are expressed in your cell, but in reality, it is very difficult to put this boundary and see which one are really expressed or not. And then, so when you define your gene list and your background, then you test each gene set that you have in the pathway database, and the output result is a table. So it's always a table with gene sets as rows and with a p-value associated with each gene set that tells you if the result is due to random chance or not. So what is exactly assessing the p-value in an enrichment test? So it's assessing the probability that the overlap between your gene list and your gene set is at least as large or larger that you would obtain by randomly selecting a gene list from the background. So if you take 500 genes randomly from your background, do you get this overlap? So one way to calculate this is to do random sampling. So you could do 1,000 random sampling of a list of genes that are the same size as your original list. So your original list is 200. And 1,000 times, you pick up random genes and you calculate the overlap with your random gene list and the pathway that you are testing. And that will help you to build a null distribution. And compared to this null distribution, you will see where your observed data or your observed overlap is. And it's very far away from the mean that you will see, well, it's very difficult to get this result by random chance only. So you could do random sampling, but it uses a lot of resources. It's time consuming. And in this case, we know what the distribution is. So for random sampling like this, the distribution is the hypergeometric probability distribution. So we can use directly the Fisher's exact test, which is using this hypergeometric probability distribution. So that's why we don't have to do random sampling. The easiest way to understand the Fisher's exact test is to take the examples of marbles or balls. So let's say your background would be like 5,000 genes or balls. And the majority of the balls, 4,500, would be red. And then you just have 500 black balls. And you draw five balls from the box, randomly. I mean, blindly. And your result is four black balls and one red genes. So I think that intuitively, you can think already that you were lucky. It's not easy to get these results. But to do it, we are going to do it the proper way. And in statistics, we need to do hypothesis testing. So we are going to work with a null hypothesis, which is, myelist is a random sample from the population. Or we are going to work with the alternative hypothesis. We have more black genes than expected in myelist. And so the only thing that we can do is to reject or fail to reject the null hypothesis. Because we know more about the null hypothesis that we know about the alternative hypothesis. Because we can kind of figure out what is not to be enriched, but it's very difficult to understand what is it to be significantly enriched more than by random chess. Do we need three black balls? Or do we need four black balls to set the cutoff? So if the null hypothesis can explain the data, we don't reject it. The result of the draw is random. If the null hypothesis cannot explain the data, we reject the null hypothesis. And it means that our result is equal or larger than expected by chance only. Therefore, the p-value is the likelihood that the results represent the null hypothesis. So if I have a p-value of 0.05, it's under my 0.05 threshold. So I reject the null hypothesis. I say, well, it's not random. But it means that I still have a 5% chance that it occurs by random chance only. So if I have, so the p-value range from 0 to 1, so 0 is the best p-value. I have 0% chance to get random results. And 1 is 100% that I'm making a type 1 error. So as I said, we could build the null distribution from random sampling, but we don't need to. And we apply the hypergeometric probability distribution. And for example, here in this histogram is the probability to get 0 black ball out of the 5. Here is the probability to get 1 black ball out of the 5, 2, 3, 4, and 5. So you see that the probability is decreasing and is getting closer to 0, because it's very difficult to get 4 or 5 black balls in one row. And so the p-value associated with the probability that 4 black balls, my result, is the sum of the probability of getting 4 or more than 4. So it's the sum of the probability of getting 4 or 5. So now if we adapt our example to a gene set enrichment that could look like this. So the black balls could be a gene set, the gene set that we are testing. And this on the right could be our gene list. Now we have in our gene list 4 balls that correspond to the one gene set that we are testing apoptosis. We directly calculate the Fisher's exact test. We get a p-value of 0.01. It means I have a one, I reject the null hypothesis. So I feel I think it's not by random chance only, but I still have a 1% risk that although it looks significant, I can make a type 1 error, meaning it's still random. So sometimes if you want to do, like for example, if you want to calculate the Fisher's exact test using an R function, you still need to give the R function some values. And the way to do it sometimes is just to draw a two by two contingency table where you put the size of your gene list, the size of the gene set that you are testing, the size of your total background, and this number x, k, m, and t, so t is total, m is the size of the gene set, x is the number of genes that are in my gene list and in my gene set, and you put this number into the Fisher's exact test function. This function, this formula here is the cumulative distribution of the hypergeometric function. It's just to remind me to tell you that this is a lot of permutations and the permutations, they contain factorials and actually it looks like an easy for a computer but it's not. It takes a lot of resource for the computers to do that and until now the computers were not able to calculate the Fisher's exact test from this exact formula and they will do approximation. So some approximation would be to use the binomial distribution or to use a square or to do a Monte Carlo simulation but nowadays the computers are powerful enough so I think that the tools that are developed now and in the future they are able to use the exact Fisher's exact test. And sometimes like in the enrichment tools that we are using, it's not called the Fisher's exact test, it's called the hypergeometric test and all the binomial on the car square. So when I'm not sure and I have two choice, for example the tool is saying use the Fisher's exact test or the hypergeometric test. Then sometimes I guess the hypergeometric test is going to be an approximation but the Fisher's exact test is going to be like this formula so I choose the Fisher's exact test. Isn't the Fisher's exact test the hypergeometric test? Yeah, so like the statistician, they will say a Fisher's exact test that use the hypergeometric probability distribution and sometimes the bioinformatician, they make like a shortcut and they call it the hypergeometric test. Which is an approximation. And very often because the computers they were not powerful enough, it was actually an approximation of the Fisher's exact test. So just to remind you that in our case, we ask for over-representation or over-enrichment of pathway in our gene list. So do we have a lot of genes in this pathway more than random chance but if you're interested, you could also look for under-enrichment. So is my pathway less represented by random chance? It's very rare but you could do it. You could do it for example by doing a two-sided Fisher's exact test where you could look for over and under-enrichment. And that the background population is very important if you are not doing a whole genome experiment and the results can be drastically different between a background that is set correctly or not if you do a custom microwave. And that the Fisher's exact test is applied on one gene set. So overlap between your gene list and one gene set. But what we are going to do is to test many, many gene sets that are in the pathway database. So it's independent Fisher's exact test. So we do one Fisher's exact test, another one, another one, another one. So at the end, we need to correct for multiple hypothesis testing. So the recipe for gene list enrichment test. The step one is to define your gene list and your background. Then you select your gene set and the pathways and you run your enrichment test using the Fisher's exact test and you correct for multiple testing. So that was the gene list enrichment test. Now, if you remember, we spoke about the right gene list. So if we have a rank gene list, it's recommended and we apply different statistics. So there are a few that are available, one which is very simple and we are going to see is the minimum hypergeometric test which is used in the tool G profile that we are going to see in the practical lab. And then we are going to see a modified comograph Smirnov test that is used in the method called GSEA. And then here it's listed the Wilcoxon rank sum which is the same as the Mann-Whitney U test. So two different names for the same test and the difference between the Wilcoxon-Mann-Whitney and the comograph Smirnov is that Wilcoxon-Mann-Whitney is assuming a normal distribution. Whereas the KS test is non-parametric. That's not true? Oh, Wilcoxon, yeah. Okay, okay. Thank you. So in the classic two-class design which is the best example for RNA. So we have two groups and now instead of defining two different gene list we are going to generate a rank gene list. So here is the formula that we talked about. So if we use HR or DEC we always have as output the p-value of the log-fold change. So the log-fold change we just extract the sign of the log-fold change so plus and minus to indicate if our genes have a higher expression in class one compared to class two or if it's a negative log-fold change we indicate if the expression of the genes is lower in class one compared to class two. So sign of log-fold change we just transform in a plus one or in a minus one. So positive means up-regulated, negative means down-regulated in class one compared to class two. So minus log 10 of the p-value it just transforms a small number so p-value close to zero in a high score. So let's say I have a very small p-value like 0.001 it will transform into a score of 30. So then maybe it's my top significant genes and you will be ranked number one. So it does look complicated but actually just let's say this is my top genes is up-regulated and I know it's a positive log-fold change, a very small p-value then it will be the number one in my rank list. Then I have another gene that is really down-regulated in class one compared to class two then it will be at the last position of my rank list. So then all my genes that I could measure in my whole genomic experiment will be ranked by from the top up-regulated to the top down-regulated with the non-significant genes in the middle. Yeah? So in this case, do you still have the background? No, no, because we are going to use a different statistical test and so we don't need to define the background but I would say that usually we restrict to the cells to the genes that are present in our experiment, you know, RNA-seq, we remove the low counts. Yeah, exactly. Even if they're not part of the up-regulated data? Exactly, the non-significant genes, they will have a lower score and they will be in the middle of the rank list. And we can only do it if we measure the whole genome because if we have like a small customary we don't need to do like a big rank list. Yeah? Can you explain the full change? So you would take the ratio from class one to class two and log it and then you would get the p-value from the test? So, yeah. Okay, so in the case of the... Just to give you an example of the RNA-seq, so in the case of the RNA-seq, the output of the results is from a GR and DS-seq which are using the negative dynamo distribution so it's not a t-test but we still have like a log full change that would say the way to... It's like a ratio, for example, in average in class one for my gene one, I have an average of recounts of 200 and then for my class two, I have an average of 400. So if I do the ratio 200 divided by 400, let's say I have a negative full change, log full change. And the log full change is in the RNA-seq, it's really calculated from the HR and the DS-seq. And yeah, if you had microwave, microwave being continuous data, then you could use a t-test and then in this case, you could directly use the t-value to rank your genes from top-up-regulated to down-regulated. Yes? So the value, no. We don't need it because we think that the p-value is a better estimate and so we just use the sign of the log full change. It's really to rank from top-up to top-down-regulated. We could remove the sign of the log full change and just do the minus log 10 of the p-value and that would rank the genes from the top-most significant to the last significant and we will not look at the direction of the changes. But usually, this rank gene list, we like to separate the genes that are up-regulated in class one compared to class two to down-regulated to facilitate our interpretation. So from the biological interpretation, it's different if a gene is up-regulated compared to a gene that is down-regulated. So that's why we do this distinction, but we also could do that from most significant to non-significant. Yes? Yeah, so that's the idea of the p-value that we'll look at the amplitude of the changes between the two class and also the variation within each group. So that's why we take this into account compared to log full change and the log full change usually just take into account the amplitude of change between two groups but not the replicability within groups. So the p-value should take into account the between and the within groups variability. So for the genomes and the genes? Yes. For this one, we actually pick all the genes and we generate the right genes. Yes, so people are used to do gene list with an arbitrary threshold. And what we say is when you have RNA-seq, try to avoid it because an arbitrary threshold is arbitrary. It could be that you too permissive when you choose a threshold or you too stringent and you are going to lose information. And we don't know really in the data, we don't know where to put the threshold. And so if you use a ranking list and a statistical test that use a ranking list, the statistical test is going to work on the threshold by itself. So it's probably going to be better at finding the natural threshold. So I just showed the two gene lists because this is what people used to do, but we recommend to do a ranking list when you can, for example, RNA-seq for sure you can generate a ranking list. There are other cases like the time course experiment or if you have a list of mutated genes where you cannot get a ranking list. In this case, you have to use tools that take as input a gene list. And the key value for the genes are actually the adjusted key value for the genes. So for you... Actually it's a good question for the rank list. We don't use the adjusted key value. So the adjusted key value, we would use it to select genes, but we don't use it here because it has, the way it's calculated and we are going to see this in a few slides, it has a lot of ties. Like you can have a lot of genes that have the same corrected key value of 0.4. So then 0.4, 0.05, so you have a lot of ties when you get to the adjusted key value, which, for example, the FDR. So that's why it's not the best way to use it in a ranking list. So the fourth discovery rate, the FDR, you use it when you want to select genes. Right. So I have 5 ranking genes, I have FDR.0, 0.05, and yeah. So how do I know? So if I run a differential expression analysis, I get a list of genes, right? Yes. So the genes are very, very long. Yes. And to each one of those genes, I have a key value and an adjusted key value. And an adjusted key value. But the key value can range from 0 to 1. Yeah. So, but the genes that are not so difficult significantly can have 1. Yes. So what's the kind of for the key value? So in the ranking list, there is no get-off. You take all genes, if you have 15,000, yeah, 15,000 genes, you rank all your 15,000 genes. 15,000 genes to begin with, like the background, okay? Yeah. So all the background will be, okay, got you. Okay, so the recipe for ranked list enrichment test is rank your genes, then select your gene set, run enrichment test, and correct for multiple testing if necessary. So here, we speak about the multiple correction after the gene set enrichment. So the FDR of the gene set, not the genes. So then the same as the same diagram as we saw for the gene list. But now instead of having a pink and a background, we have a ranked gene list. And what we look is if our gene set is more at the top or more at the bottom of the list, we use different statistics, and then we have a table at the end, which is the same, with the gene sets listed and the p-value associated with these gene sets to see if it happened by a random chance only and the probability of making a type one error. Yes. So it means that those genes haven't changed that much. Like a small change is what we consider as part of the gene set. Here? Yes. So for this gene set to be, let's say, enriched in my data or significantly enriched in my data, many genes in this gene set should be at the top of the list or the majority of the genes in my gene set should be at the top of the list. And if this gene set enriched significantly enriched in my down-regulated genes, then it should be at the bottom of the list. So the majority of the genes that are in the gene sets, so for example, the cell-cycle genes, I should have many of these genes at the bottom of the list. So, yeah. I'm thinking of genes like transition factors, like a slight change causes a lot of changes downstream. Yeah. So if you don't see, maybe, you don't see such a huge change, but it's important that small change is significant. Yes. Yes. It depends on the rank of the list. Yes. And GAC is going to be a good method for that. So even subtle changes, they will add up. So for example, GAC is going to calculate a cumulative sum. So even if we have a small change in gene 1, but plus a small change in gene 2, plus a small change in gene 3, is going to increase your enrichment score. So it's going to get significant because of the addition of the small changes. So in this case, it's going to be significant. So one goes for the rank geometry test that we are going to see in G-profiler. So what this test is doing is going to calculate a Fisher's exact test, but multiple times. And so it's going to, we're going to have our rank list, for example, rank from most significant genes to less significant genes. And each time we get a gene study that is in our gene set, a Fisher's exact test is going to be calculated. And each time we are going to get a p-value for this Fisher's exact test, and a minimum hypergeometric test is retaining only the threshold with the lowest p-value. And because we just do multiple gene sets, we need to cry for multiple testing. So here's the graph to try to explain it. So the rectangles, the black bars are illustrating the rank list. So for example, the first black bars here is my gene number one, is my most significant genes, and then from the last significant genes. And here is the gene set, the pathway that we are testing. And the red bars are all the genes that are in my rank list and in my gene set. So now for the first red bar, the test is going to calculate a Fisher's exact test. And we are going to have a p-value. So it's going to calculate of one gene that is overlapping between my gene set and the first five genes of my rank list. And I get a p-value. Then it's going to go on. And at each point, calculate a Fisher's exact test. And maybe here is going to be, well, I have 10 genes in my gene set. And it contains the top 20 of the gene list. So 10 out of 20 gets a p-value. It could be that this p-value is better than the p-value at the beginning because you got more overlap. So you got more overlap between your gene set and your gene list. So this is the way the minimal hypergeometric test is working. So this is the black bar. This is the rank list. It's bar is a gene. It's bar is a gene. This says, let's say, so this one is a rank list that contains 200 genes. So that's not the full whole genome rank list. That's the differentially expressed genes. And we are going to see in GIPO file. So let's say you've selected 500 genes. But you can rank them using your p-value from the top significant for the best p-value to the last significant. So you have a gene one, which is the best. And then this gene number 500, which is the last significant. And then you are testing this pathway, gene set, that contains maybe, I don't know, 30 genes. So 30 genes that are in the gene set and in my rank list. This is my overlap. And now the minimum hypergeometry test wants to see what is the best threshold to stop and get the best p-value to say that this gene set is in reach in my list and will tell you the threshold. It will tell you, well, the best threshold, the p-value 0.001, I will get it when your gene list has the size 150. The more genes you get, the more likely you will have overlap with the genes, right? You could think the more genes you get, you have more overlap. But you need to think also about all the genes that did not overlap. So at the beginning of the genes, you may have two genes that are overlapping between your gene set and your rank list and 10 genes that did not overlap. So it's not so good. And then you continue, and then you may have 15 genes that overlap and 20 genes that are not overlapping. And then the ratio is going to increase, and you are going to get more overlap compared to non-overlap. But then you have a bunch here of genes that are non-overlapping. So that even if you increase the size of your gene list, so here, you are not going to increase the ratio between the number of genes that are overlapping and the number of genes that are non-overlapping. And that's what the minimum hypergeomics test is trying to do, is trying to tell you the threshold where you have the most enrichment. And you do it when you choose permissive threshold. So you have your FDR 0.05, and you selected your 500 genes, but you're still more confident about the genes that are on the top of the list. So you want to put more weight on these genes that are on the top of the list. So that was the simplest method. But one very popular method is the GSEM method. And this worked best, so in this case, with the whole genome. So when you can rank your genes from the whole genome, so not your 500 genes. But the whole genome. So you choose a modified Kamogorov-Smeonov test, and it works by calculating your running sample. So now, the black bars will be all your genes in the genome, ranked from top up-regulated to top down-regulated. So again, the red bars will indicate the genes that are in this gene set that we are testing. And where they are located in our ranked gene list. And so we will start at gene number one. And gene number one, the running sum is going to be 0. And then gene number one is not in the gene set we are testing. So the running sum is going to stay 0. And then gene number two is still not in the gene set that we are testing. The running sum is staying 0. But then at gene number five, the gene set is present in our gene set. So the running sum is going to increase. And if you look at the graph, you can see there is a higher density of genes in this region. So that's why the peak here of the running sum is going to increase a lot. So oh, genes in the gene set, genes in the gene set, genes in the gene set, genes in the gene set. So it increases a lot until it reaches a point. And now we see a decrease. And we see a decrease because the density of the red genes is decreasing. So now we don't have any... these genes in my rank list are not in my gene set. So it's decreasing, decreasing. I have a little bit peak here because we have this red gene here. But after that, it's decreasing, decreasing, decreasing. And it's going up again and decreasing, decreasing. And here I have no genes in my gene set, so it's a complete decrease. So that's the way the running sum is working for GSEA. And so the maximum point here is called the enrichment score. So usually we will see when we run GSEA, we get a first result which is the enrichment score and the normalized enrichment score. So this is another example for GSEA. So I have here the rank gene list. So the genes are ranked from top up-regulated to top down-regulated with the non-significant genes in the middle. And I have three gene sets testing. And the gene set number one contains eight genes. And we can see in this red bar that the eight genes are more located at the top of the ranked list. So it's going to be a positive enrichment, an enrichment to walk the up-regulated side of the rank gene list. And on GSE plots, it looks like this. So now my rank list, you flipped it vertically. So here is my gene number one. This one is my last genes. This is the side of the top up-regulated genes. And this is the size of the down-regulated genes. And the black bars are all the genes that are in the gene set that I'm testing. And the name of the gene set is here. This gene set number two contains also eight genes. But it's not significantly enriched. So I don't have a GSE plot for that. And it's scattered randomly throughout the rank list. So you can see a gray gene at the top. You can see other gray genes at the middle of the rank list where the non-significant genes are located. And you can see a few at the bottom of the list. So it's not significant. There is no direction toward the up or the down-regulation list. And the third gene set, number three, contains eight genes. And we'll see that it's enriched towards the down-regulated side of the rank list. It will look like this on a GSE enrichment plot. And here is the zero. So it's negative. It's a negative enrichment score. Yeah. So the bars here, are they in the gene set at our rank list? Yeah. Yeah, so exactly. So that's all the genes that is contained in the gene set that we are testing. And that tells you the position in the rank list. That contains all the genes. Usually, they are being filtered out by GSE at the first step, so we don't see them. And there is also one thing to mention about GSE. We say it's a modified Morgarov-Smirnov test because it has a weight. And the weight is to want the genes that are at the very top of the very low end of the rank list because we don't want to have pick in the middle. And we use the expression value. So in our case, it's the minus log 10 of the p value to get this weight. So at the beginning of the list, when the running sum is increasing, it's increasing more. For example, it's increasing from 0.2. And when we reach the middle of the list, then as it uses the score, like your expression score, the sum is not going to increase a lot. It's going to increase of 0.001. So it's a way for GSE to avoid pick in the middle of the list. So we always get picks that are weighted towards the top-regulated genes or down-regulated genes. And there is also an option in GSE, which is called weight. And so we can, if we are not sure, and we say, well, I'm not sure GSE maybe is going to take into account genes that are not very significant, you can upgrade this weight and say, I want to put more weight for the very significant genes. So you can do by changing the weighted parameter. Do you know when I want micro-scoring GSE, I don't get to know the old things. And even though there's the possible to get the original micro-scoring? Yes, you can change the parameters. And instead of setting top 20, you set, for example, top 200. But what I do sometimes, I select my gene sets that are significantly enriched at my FDR. But I'm really interested in this gene set. So I basically extract this gene set from my pathway database. And I rerun GSE on this only. I do it when I do like a figure, something like that. But by default, I use top 200 instead of the top 20. The only thing is when GSE has finished to run the analysis and it creates a report, it takes maybe a few seconds more to finish GSE. Because sometimes you have like 500 gene sets that are significantly enriched. And you cannot see these plots. And it's very, very important for quality to go back and look at these plots. Yeah? I think you already mentioned at the beginning that the overrepresentation analysis is more popular. And I think I've seen also a lot in comparison to the GSE analysis. Which one? The overrepresentation analysis, the rich analysis that you do at this time. But has anyone compared, having the same data set, say, OK, I'm going to do with this one. What best is playing my research? So if I understand, so you ask someone compared like you have differential express data, you do your two discrete gene lists with your arbitrary threshold. And you do an over enrichment analysis. And you compare with GSEA. So yeah, we do all the time. And that's why we recommend a ranked gene list. Because we get more results with the ranked gene list. Because it's very difficult to set up an arbitrary threshold. And also, it is related to exactly what you mentioned with the transcription factors. Sometimes you have very, very subtle changes. And a gene has just a little change. It's barely significant. But all the genes in this pathway are changed a little. And you are going to lose this information if you use an arbitrary threshold. But GSEA is going to get this information. And this pathway is going to be significant in GSEA. Or let's say you use the arbitrary threshold. And you get eight genes in a pathway that pass the threshold. But two were just under your arbitrary threshold. They were just ranked just below. You missed it the first time. Your gene set did not get the significant. But you use GSEA, then you get these two additional genes. And it's going to be better. The only way I don't use GSEA when I have arenasic data is sometimes when I have kind of noisy data. And I had to clean the data, maybe using the law for change. Or maybe by using additional information, prior knowledge. And I have maybe merged with, like, mutation data. And then I have no choice. I just have a gene list. Sometimes when it's very noisy and you say, well, I don't know about the other gene, but I just trust this top 50. Then I just stop here. But usually ranked gene list is always better. Can you use the GSEA with a clean data? Oh, yeah. So let's say. Oh, yeah, for sure. It's recommended. This one is very, very noisy. And I just select my top 30, top 50. And I say, well, the rest maybe yes, maybe not. But usually, like 90% of the time could be clean or less clean. I use GSEA. Even for noisy data, GSEA is better. It's just when it's extremely noisy that you don't want it. Let's say you have a gene list from your experiment. And you want to clean it, filter it according to your law for change. Yes. Can you give the filter to GSEA? You could. Instead of using the sign of the law for change, you also could use the law for change value. I never do it. But if you have a good argument to do it, you could, because you are doing the scoring system. So if your argument and your calculation is valid, then you can do it. There is no one way to do your scoring system. So I give to the algorithm a list with a p-value and local. You do it before GSEA. So the scoring system, you do it on your own. So what you give to GSEA is two columns. One is the gene names, and the other one is the score. So the score, I gave you the minus log 10 of the p-value, multiplied by the sign of the law for change, by example. That's what we do. But if you had a reason not to use this score, but another score, then you could use it. So we don't recommend this option. I know it's possible with the classic GSEA that you can input a matrix of rpkm, and GSEA itself is going to calculate the differential expression using metrics like a spin-off to noise ratio or t-value. But we don't think it's the appropriate way to calculate differential expression in RNA-6. So because of the count data that are discrete value and not continuous value, so we think it's better to use the negative binomial distribution in HR in DSEC and then to create our own ranking list. When you get the output of HR or DSEC, even if you don't script in R, you could do it in Excel. And we are going to show you example in the lab practical. So but now the running sum that we just explained here was the peak, which is the point where the cumulative sum is the highest is just a score. So it's not a p-value. So how do we go from the ES score to the p-value? So here GSEA, which had the beginning new statistics, now is using random permutation to get the p-value. So each time we test the gene set, we set the permutation to 1,000. What GSEA is going to do is to replace the genes that are in the original gene set by random genes. And it's going to do it 1,000 times. And each time it's going to calculate an enrichment score for this random gene in the gene set that we are testing. And from all this random enrichment score, it's going to build the null distribution. And then from the null distribution is going to see where is our observed enrichment score for the gene set that we are testing. And for example here, so when we do permutation like this to calculate the p-value, we call it a perical p-value. And here it says, well, we got an ES score that was for we got only four results randomly that were equal or higher than my random score divided by the total number of permutation. So that's called the gene set permutation in GSEA. And this is the only permutation we can use when we use a rank gene list in GSEA. So we have seen the hypergeometric test for discrete gene list. We have seen the minimum hypergeometry test that is available in G-profiler. When you have, I would say, like a list of 500 to 2,000 genes, but you want to put more weight on the most significant genes of your defined gene list. And then we have GSEA where we can rank all genes in the genome using a score derived from our genomics experiment. So now, because we use to test multiple gene set, we need to speak about the multiple test corrections. So back to our example of red and black balls. Our background is 5,000 balls, containing five black balls only. And the results we got is four black balls and one red ball. And we got, if you remember, we got a very, very low p-value for that, like maybe 0.01. So we just had 1% of chance to get this result randomly. So it's very difficult and unlikely when you do one draw to get four balls. Yes, but that's when you do one draw only. But if you have the right to test it again and again and again, maybe 10,000 times, you are going to get the four balls, the four black balls and the red balls. So even if it's unlikely, you can get these results. That's why we need to test to correct for multiple tests. So now if we adapt this to the gene set enrichment analysis, then the squares are one gene set. And we don't have many squares compared to the round balls. So it's very unlikely to get these squares. But if we test again and again all these gene sets containing these square genes, we increase our chance to get it. So we increase our chance to make type 1 error. So that's why we need to test for multiple hypotheses. So one intuitive way to do it is basically just to get our original p-value and multiply by the number of tests we did. So we tested 10,000 gene sets. Then we could multiply our original p-value by 10,000. And it actually exists. And it's called the Bonferroni correction. And the Bonferroni correction is known as the most stringent method. So the Bonferroni correction, we say that we are controlling for the family-wise error. And it means that when we select the pathway under corrected p-value of 0.05, we say that the probability of any one of them to be a type 1 error is 5%. So usually in the enrichment tools, you will see this Bonferroni correction. And you can try it. But what's happening is that it's so stringent that usually no gene sets pass the threshold of Bonferroni 0.05. So there is another way to correct for multiple hypothesis testing. It's called the force discovery rate. And so the force discovery rate is the expected proportion of the observed enrichments due to random chance. And the method to calculate an FDR is called the Benjamin-Horberg method, actually the one that we are going to see. And so at the end, we obtain a Q-value. So now we are going to see the calculation of the FDR. And that's the last step of the lecture. So first, we have a gene set and the nominal p-value, which is the original p-value that we got from the test. And the first step is to calculate the adjusted p-value. So we ranked our gene set from the most significant, so the lowest p-value, to the last significant. So here, the last significant has a p-value of 0.99. And we tested 53 gene sets. So to calculate the adjusted p-value, we multiplied the nominal p-value by the number of gene sets that we've tested, divided by the rank. So here's rank number 1, 2. So here it's important to understand that the more gene sets we are going to test and the more pessimistic the adjusted p-values are going to be. So yes, we need to correct for multiple hypothesis testing, but by increasing the number of gene sets here, we inflate our adjusted p-value. And then the last step is to get the Q-value. So to get the Q-value, we start from the bottom of the list and we see which one of the p-value is the lowest. So here, 0.99, there is nothing below that, so we add 0.99. Then we go up and here we have the adjusted p-value of 1,0.04. But because below that we had a value that was lower than this, we put 0.99. And we go up and up like this. So for example, here we had 0.99, but the adjusted p-value is 0.053, so we put it. But here we have also this 0.053, but below that we had a lower number, so we still put 0.04. So now the adjusted FDR, the Q-value, we rank from 0.99 to the best one, which is 0.04. And you see all the ties here, so that's why we don't use it when we get the rank list. How did you get from the 0.053? So here, for example, so here we had 0.053. Because we had here maybe 0.99, 0.053 is less than 0.99, so we put 0.053, because we know it's more significant than the other one that is below. And here, so just one rank above this one, we had 0.04. But it's lower than this one, so we can keep it. And then we go one rank above this one, but in this case, it's higher than 0.04. So it doesn't make sense. So we take the smallest value and so on. So we always take the value equal or smaller. Nominal p-value. So now, if we want to select ginsets that are enriched under the FDR 0.05, then we get here the threshold. And we report this top number for ginsets as significant under the FDR threshold of 0.05. So again, this slide is just mentioning that the more ginsets you are testing, the value are going to be pessimistic. And if one way to filter the pathways could be to remove, for example, a pathway that contains two or three genes, because sometimes the pathway that I base, we have very small pathways that are not very informative. And we also have big pathways that contains more than 500 genes or more 1,000 genes. And if you remember what Gary said about the gene ontology, which is very hierarchical, so if you used the gene ontology as your pathway that you base, you may have parent terms like a cell cycle or regulation of cell cycle, which maybe yes, but maybe not the most informative. So a way to filter a pathway is to remove the small size and the big size of the pathway, and it's going to reduce the tests that you are doing. OK, so we did the future exact test, and an orange gene list, and a multiple test correction on the bone pharaony. So bone pharaony, very stringent. If you get results with the bone pharaony, you are confident to report this gene set. If nothing passed the bone pharaony, then you can try the FDR, which is usually what we do. And I think we did all the objective on module 2. And so what we learned is that the typical output of an enrichment analysis is always a table. And the tables will always contains all the gene sets, significant or not, and always a measure of the size of the overlap between this gene set and the gene list. So if it's a defined gene list, that could be the number of genes between your gene set and your gene list. If it's GSEA, that's going to be the enrichment score. If you have a big enrichment score, large value, it means you have a large overlap at the top of your list between your gene set and your gene list. And then the p-value. And you always have to get an adjusted p-value. It could be bone pharaony. It could be FDR. But this is the corrected p-value that you are going to use to select your gene sets. So it's the same. Like it's a tabular format. Yeah? You mentioned before an enrichment score, but also an enrichment score. Numerous enrichment score, yeah. It's normalized by gene set size. Yeah. So the enrichment score is when you test only one pathway. So let's say you've decided to test only cell cycle. You will just look at the enrichment score. But if you test multiple pathways and you want to rank the gene sets, then they have to be corrected for gene set size because some gene sets are smaller than others. So it's a normalization that GSEA is doing to correct small pathways and big pathways. So if you're in GSEA and you test more than one gene set, you should use the normalized enrichment score. Just to compare gene set one and gene set two. So usually we have a table as an output of an enrichment analysis. What we need to know is that the gene set, they are usually frequently overlapping. So some genes in one gene set are going to be in gene set two. But using a table is not very easy to identify the function and the gene set that are related. So that's why we usually need to go to network visualization. And so many tools are available to do enrichment analysis. Some are web based. So you just connect to a website and you copy and paste your gene list. Some are within the site escape. So we'll see some site escape application can do the enrichment analysis. Some tools are standalone. So you need to download it on your computer. So you probably did it today with GSEA. And some are available through our packages. So depending on what you prefer. And I like this Omic Tools website. There's a free registration, but you can search and give you a list of tools. So now we are going to see G Profiler and GSEA as example. It could be that depending on your data or depending on your model organism or personal choice, you want to choose a tool. So when I look at the tool, I try to answer a few questions that are correlated to the concept that we've seen today. And a few questions are, so does the tool cover your model organism? So that's the first question. Is there a good choice of gene sets path with database? So we usually like to test different sources of path with database. So some tools just use go or just react them. But we like when the tools are using multiple sources of the pathway database up to date. So sometimes the tools have been, has been developed like a years ago, but there is no one to maintain it and the pathway database are not updated. And that's not, we are going to lose information if we don't use up-to-date databases. So which statistics? So now you know, is it a gene list and or a rank list? Usually tools have two options, like the basic gene list, where I just copy and paste my gene names. Sometimes like intermediate, where I could put my gene list and a score beside it. Or sometimes they have like a full rank list statistics. And then do I like the output style? So we know it's a table, but sometimes like some, some tools like offer some network visualization or like better tools. And then because of this workshop, one question is, can you connect the output table with tools like cytoscape? Because as we will see it's very, it's easier if you can visualize as a network. And many tools, if you can download the results with the gene set name, with a p-value, then you can use the generic format, for example, of the application that we are going to see this afternoon, which is called enrichment map. So usually most of the tools, you can adapt it for the use in cytoscape. So these are topics that are not covered in this lecture, but if you want to know more about some issues with the enrichment test, we could speak about correlation between gene set and dependency of genes. There are some tools that are using the topology of the networks. And so we don't cover this, but these issues are existing. And if you want to dig more into this issue, then I can recommend this link of the protocol. Yeah, and if you want to know more about the Fisher's Exact Test, I like this little video of the MNFs. Actually helps me a lot. And the different colors of the MNFs would be the different gene sets. And the whole bag would be our background universe. So I think I put a link to the video and I think we are on a lunch break.