 So, hi everyone. So, we are starting the second module of this workshop, which is named Finding Overrepresented Pathways in GenList. So, this module will cover the statistical tests that are used in pathway enrichment analysis. So, we are covering the statistics here because this is a conceptual knowledge that is behind all the tools that we are going to present during this workshop, as well as there are other many other tools that are available for enrichment analysis. So, we think that if you understand the theory behind it, you will be able to choose the right tool for your project and for your data, and you will also be able to interpret correctly the results of the test. This lecture is going to be easier because we are going to use it in a practical lab and we are going to use two tools, GSEA and G-Profiler. So, this is the list of our learning objectives. So, we are going to learn first how to differentiate between a defined genelist and a ranked genelist. And then we will review the concept of p-value and force discovery rate, as well as the test that we need to apply for enrichment test analysis. And at the end of this lecture, you should be able to interpret the results of the two enrichment tools that we are presenting, GSEA and G-Profiler. So, here is how this module integrates with the full workshop. So, earlier today and in the pre-recorded video, Gary talked about the different genelist that we can obtain from our mixed experiments. So, we start by normalizing the data to apply statistical analysis testing and then at the end of all the steps, we get the genelist that we would like to functionally interpret in a biased way. So, that's one element. And the other element is the prior knowledge that we have about the function of these genes that are collected and stored in different pathway database. So, these two elements, the genelist and the pathway, they can connect or talk to each other only if we use the same gene identifiers. So, if my genelist in the format of a gene name, then my pathway database should also use the format of a gene name. So, when we have these two elements, we can perform pathway enrichment analysis, and we are looking for over enrichment of my pathway in my genelist, and this is the focus of this module. And then in the next modules, we are going to learn how to visualize the results of this pathway enrichment analysis. So, this slide represents the analysis workflow of pathway analysis. So, usually our end goal is to define or to find a very defined pathway that was activated or inhibited in our experimental model. So, here we see the PI3K AKD pathway, but usually before that step, we extracted this pathway from a more global picture as we see here from the enrichment map with many pathways that were significantly enriched. And before to get to that step, what we had to do is to run the enrichment analysis itself, and this is what we are learning in module two. So, another way to represent pathway enrichment analysis is this one, because sometimes we think it's complicated, but pathway analysis is just a way to organize your genelist in different categories that are biological processes. So, I have my genelist on the left, and I can organize based on the pathway. So, the genes in the black would be part of the exon guidance pathway, the green genes part of aging, the purple stem cell development and sudden migration. So, basically now we can concentrate on this for biological process only to interpret our data and generate new hypotheses. So, pathway analysis has simplified data interpretation. And most importantly, we need to do that, we need to summarize into categories because our genelist is very large. So, from theomics experiment, we got a very large genelist. If we had gotten a very small genelist, we may want to interpret it in other ways. So, we are going to here talk about one important concept that is the overlap that is used to calculate the enrichment score. So, we have our genelist on the left that I've represented here in this circle, and I have 41 genes. And exon guidance that comes from the gene ontology database is the first pathway that I'm testing. It contains in the original database 39 genes. So, what we see here is that there are 13 genes that are in common between my genelist and my pathway. So, 13 genes is the size of the overlap between my genelist and my pathway. So, the size of the overlap is going to contribute to the enrichment score. And we can see here that it's about one quarter of my genelist and about one quarter of the pathway that I'm testing. So, in addition to the simple concept of overlap, what we can do as well is to associate a score with the genes when we calculate the pathway enrichment score. So, if my genes here in the overlap would have a high score that would increase the enrichment score for the tested pathway. So, what we can do is to rank all our genes in our experiment using a scoring system. So, for example, for RNA stick data that could be differential gene expression value and for chip sick and addict six that could be the p value that is associated with the pigs that are associated with the genes. So, another important concept when we do enrichment analysis testing is the background. So, the background is sometimes called the universe, and it represents only the genes that could have been matured in my omics experiments. So, the genes that could not be measured are not expressed in my cells are not part of my background and should not be counted. So, an example where I have a restricted background could be a custom array, and on the array, we just have transcription gene factor genes or immune genes. So, in this case, I have to restrict it to only the genes that were placed on the array. But for RNA sick or other omics experiment because it's a whole genome, then we should not worry too much about the background because it's represent all genes in the genomes. But if you want to have a rigorous background for an exit for RNA sick, for example, what we do at a very early step of the analysis is to remove genes with a very low count. Our count is equal to zero, so we restrict the background to genes that are expressed in our cell model. So, that brings us to this outline of this lecture, which described the workflow of an enrichment analysis. So, we learned that we have two different type of gene list, a defined gene list and a range in list, and that the statistical test is going to be different. So, a defined gene list is going to be the Fischer's exact test and a for rank gene list is going to be a rank based some test that is included in the tool GSE. But both of them will give us a value that is associated with each tested pathway and the p-value assesses the probability that this pathway is enriched in our gene list by chance only. And then we will see later in the lecture that we actually testing many pathways, so we need to correct for multiple hypothesis testing. And we are going to learn two method, one is the Bonferony correction, and the other one is the false discovery rate using the Benjamin Horberg method. So, as I said, there are two different kind of gene list. And so the defined gene list would be typically contains a fixed number of genes, for example 200 or 500 genes that we have selected using threshold, and for example that would be genes that we found frequently mutated in a set of patients. And the question that we are going to answer to do enrichment analysis is are any pathway surprisingly enriched in my gene list, and the test is going to be the Fischer's exact test. And the ranks list is a list of genes in the genome that we're able to rank using the score that is coming out of our experiments. So for our next six, when we compare pair two groups, let's say treated and control, then we can rank all the genes from top up regulated to top down regulated. So one of the questions that we are going to answer is, are any pathway ranked surprisingly high, although in my rank list of genes, and the test is going to be included in the GSEO, GSEO tool. Thank you, we test in ranked gene list. So if we are able to test to rank all our genes. This method is always recommended, because we try to avoid arbitrary threshold, because with a defined list is very difficult to know where to put the threshold to select the gene. So if we are too stringent, we are going to lose information, but if we are too permissive, we are going to include too many false positive, but with a range list we don't have this issue. So now, those are three examples of type of data where it's easy to get a ranked gene list. So the first example is bulk RNA sec, a classic two groups design where you have control and treated. You do your differential expression, and you rank all the genes from top up regulated to down regulated to get the rank list. And you can do it in a similar way with single cell RNA sec data. So first you get your cell clusters, and then you do a cluster one versus all the cluster. Or you do cluster one versus cluster two. So you do your differential expression and the same way as for RNA sec, you can rank all the genes from top up regulated to down regulated. And it's also possible with label free proteomics. If you get a sufficient number of proteins, then you can rank them the same way as you do for RNA sec. But so how to do this ranking. So we need to learn because we are going to do it in the practical lab. So let's say I have RNA sec data, I have my my matrix I have to class, let's say control and treated. And then I do my differential expression using our package like HR or DS6 too. And then I can, I will rank them from top up regulated and treated the non significant genes in the middle so I keep the non significant genes. And then at the bottom of my list, I will have the genes that are down regulated in treated. So from DS6 to an HR, I get an output table and then you always have two columns that we are going to use. One is the log for change and one is the p value. We just choose the sign, we just look at the direction, we want to see if it's positive plus one that my gene is a regulated is going to be at the top of my rank file. And if the log for change is negative minus one here, it will be located at the bottom of my rank file, it will be done regulated. So the formula to calculate the ranking score is sign of the log for change. That's what I explained, multiply by minus lock 10 of the p value. So, a significant p value is very close to zeros, meaning that the gene is highly differential expressed. And when I do minus lock 10 I transform this very small p value in a high high score. So in this way I can rank my file from high score to low score. So high score for the upregulated, non significant in the middle, and then the low score. So now that was for the ranked file and we say that it's recommended but in some, in some examples not possible to do a rank file then we have to do a defined gene list. So here are three examples where we use a defined list. The first is when our starting point is DNA. We are looking for somatic mutations. And we get at the end of analysis the list of frequently mutated genes. So this one is a defined journalist. Another example would be a time course so we have our NSIC data, maybe we are studying developmental biology. So we have our matrix of samples with different type points in power. And in this case, what we would like to extract is different gene list based on the pattern of expression of the genes. So maybe we want to extract genes that are going up at the end of the time points or maybe genes that are going down. So in this case we have three different gene list that we will analyze independently using a defined gene list method. And the third example comes from Ataxic and Chipsic. So we get our PIC regions. And maybe we have treated and control. Maybe we select the PICs that are specific to the treated controls. And then we associate the PICs with genes. And then we get a defined gene list. So even if this gene list is very large is still a defined gene list. So now we are going to start by explaining the statistics behind the most simple test which is the defined gene list enrichment test. So given a gene list, given a pathway, are any of the pathways surprisingly enriching the gene list and we are going to learn how to define and estimate surprisingly. So here on the left, we can see the gene list. So you remember earlier I told you that the gene list was 41 genes. And here in the, like in the brown garden, we define the background. So the background actually it's the whole rectangle because the gene list is part of the background. And the other element is the pathway. So earlier we said the pathway was 39 genes. And we had this overlap size of 13 genes. But now we are going to the next question. And the next question is, is this overlap larger than expected by chance. And then we are going to get a P value out of this. And then if the P value is very close to zero, it means that the pathway is significantly enriched in my gene list and it's not due to random chance. So how do we do to get this P value? One way to do it is to try 1000 random gene list and compute the overlap size for each of these random gene lists. So when we do that, we are building a null distribution. And then we will see if our observed overlap of 13 is far away from that null distribution. So when you do that, you are calculating an empirical P value and you do that by calculating the number of times your observed results was larger than the random overlaps divided by the number of tests that you did. And the P value that we get is assessing the probability that the overlap between our journalist and the pathway is observed by chance only. So a P value can range from zero to one. If we have a P value close to zero, there is a low change that the results are caused by random chance. So we can say that the pathway is significantly enriched in our gene list. If it's one is likely due to random chance. So zero is very good and one is bad. So the problem with the method of permutation, it's because it takes a lot of time to do the permutation. So it takes a lot of resource for the computer. And sometimes like in the case of the enrichment test, we know what the distribution of the random samples looks like. So we know that like the shape of the distribution. And in this case, we know that the shape of the distribution, it's hypergeometric probability distribution. The test that is using this distribution is called the Fisher's exact test. So then we don't have to calculate an empirical P value, but we can have calculated the P value analytically. And this, so this is what we're going to see next. So now we are going to see next the Fisher's exact test because we know that we can apply this hypergeometric probability distribution. So to understand the Fisher's exact test, sometimes we explain it with balls in a box. So here we are going to change balls to genes. So let's say our background contains 5000 genes, most of the genes are red, but 500 are black genes. And what we do is we take five genes randomly, meaning we don't look at them. And then the results that we got is four black genes and one red one. So we know already intuitively that it's not easy to get these results because there are way more red genes in the box. But what we are going to do now is hypothesis testing. So the null hypothesis is that the list that we got is a random sample from the population because it was easy to get. And then if we withdraw it again, we would get it. So we know it's not the case. We know that it was not easy to get this result. So we probably will have to reject the null hypothesis. So if the P value is zero, it means that there is zero change that the result that we got represent the null hypothesis, meaning that the results are not due to random chance. Now, if the P value is one, it means that there are 100% chance that this result represent the null hypothesis, meaning that the results are random. So then if we have a P value of zero zero point five, which is very low, it means that we can reject the null hypothesis, meaning that it's not random. But even though we reject the null hypothesis, we still have like a 5% chance of making a type one error. So as I said, so the null distribution in this case is modeled by the hyper geometric probability distribution. So we can get the probability of getting five or eight genes, which is zero point six is quite high because it's easy to get this results. And then to the probability to get one black jeans, which is a bit lower, and then two, and then three, and so on. And the results that we got was four black jeans and one red jeans. So now we have to get from the probability to the P value. So the probability to get this result is the probability of getting four or more black jeans. So we add up, we sum up the probability of four black jeans and five black jeans and we get our P value. So our P value here is zero point zero zero one. So it's very low so we can reject the null hypothesis and say that we have a significant enrichment of this pathway. Let's say, let's say that the black jeans are the pathway that we are testing. So we have a significant enrichment of this black pathway in our list. Okay, so now, in case you need it, you need to do it yourself. And in case you have just one pathway that you would like to test, you can manually create like this two by two contingency table and use, for example, in all the features exact as formula. So what you need to enter it's four numbers. And the first, so the first number is K, K is the size of your gene list at the beginning. Then the second number that you need is M, which is the size of the pathway that we are testing. Then T is the size of your universe. And then the last one is X, which is the size of the overlap between your gene list and the pathway. So you enter this in the formula of the features exact test and then you will get your P value. So I'm showing this formula to you because we say that the background is an important concept. And then, if you are not doing a full genome experiment, you need to restrict your background. And see here because it goes in the formula that it will strongly affect the result of the P value. So just be aware to be aware of entering the correct background. And also, just to tell you that this formula has a lot of combinations and factorial. It was very, it was taking a lot of resource for the computers. So people when they built their enrichment tools, they use approximation of the features exact test. So nowadays, the computer are more powerful. So you, you, they can really use the features exact test, but it many, in many occasions use the approximation of the features exact test for this reason. So if you still need to learn more about how the features exact test work, I really like this video from stat quest. Just a summer explain, explain enrichment analysis using a bag of m&ms. So, like the full bags would be the universe. And then the different colors will be the pathway and then we, we draw like I think eight m&ms and then we got like seven blue, which is the blue pathway and one red. And I think that the P value is going to be very low. I think the this path, this blue pathway is significantly enriched in our data. So now we are going to apply for the next three slides. We are going to try to apply what we just learned because that's the goal of this lecture that you after this workshop, you go and choose your favorite tool. You go to the output table and you're able to to see these elements that we are just learning. So we are going to apply it right away with G profiler which is a web based tool to do enrichment analysis that we are going to use right after in the practical lab so we copy and paste our and perform pathway and analysis and this is the output table. So what we can recognize here is that we have tested a pathway one by one. They were all coming from the gene ontology database. And what is interesting in this, this part of the screen where we have T for for pathways was we just need to learn the vocabulary. So 17 would be the size of the pathway that we are testing. And then Q is for query. This is the size of our journalist 20 and then what we see is that we have five genes that are in common between the two. So this is the size of overlap. Q is the universe. This is the size of the background. So G profiler is taking these four numbers put it in the equation of the fissures exact test and get this P value the P value is is very close to zero meaning that the pathway is significantly and rich in our gene list. And just to note that this is the corrected P value but we are going to see this right after. So you can see that those elements when your tool is doing gene list enrichment use you should see this element in the table output. And we are going to try another example with another tool called calls and richer. And this is the output table and we recognize again that we have a list of pathways that were tested were tested one by one. They are coming from the gene not only database and the value that we are looking for here. So eight overlap 80 85 divided by 230 230 is the size of the pathway that we are testing and 85 is the overlap between my journalist and the pathway. So my gene list were 200 here and my background 20,000. It's not in the output table because it's a constant value, but every child is going to use this for values to enter it in the in the fissures exact test formula to get the P value. And a third example with Panther. And this is the output. It's also a web based enrichment tool. And again, you can recognize the list of the pathway tested one by one. And this column here would be the number of genes in their original pathway. This one, the second column would be the size of the overlap between the pathway and my journalist. And again, Panther is going to put these four values into the fissures exact test to get the raw P value that it's here. So I hope now that after this lecture you will look at any of the tools look at the tables and try to find these elements. So we have finished the first part with which is to go through the journalist enrichment. As a last note, we are looking for over enrichment of pathway in rare case people are looking for under enrichment of the pathway and it's possible by inversing like the labels. As I said, sometimes you look like at the description of the tools that you want to use. If it's not the fissures exact test, it might be named hyper geometric test, and some tools are using approximation of the fissures exact test. So we are going to start the enrichment test for a ranked journalist. So we've said earlier that if you can do a ranked journalist is really recommended to avoid arbitrary threshold. What we are going to present is the tool GSE and GSE is using a modified Kolmogorov Smirnov test to do to calculate the pathway enrichment test. Remember your rank file. So we were able to rank our genes from top up regulated, non significant in the middle, and then regulated at the bottom of the rank file, and then we run GSE which is going to see if each of our pathway is enriched at the top or at the bottom of our rank list. GSE will give us a P value and a direction so the P value assesses the probability that our pathway is enriched by chance only or not. The small value that that says that stays like for example here 0.025 it means that even if we reject the null hypothesis and we say it's not random we still have a 2.5% change of making a type one error. The direction is indicated by the sign of the enrichment score. If it's a positive enrichment score it means that our pathway is enriched in genes are regulated and if it's a negative enrichment score it means that our pathway is enriched in genes down regulated. So it is Muthar et al who developed GSE in 2003 and they were studying diabetes and what they found is the down regulation of this pathway called oxidative phosphorylation. But what was interesting is that in this pathway the genes contained in this pathway they were down regulated but only by a very small amount so none of these genes were significantly down regulated. But when they sum up all these genes they could find that the pathway was strongly down regulated so that could be captured by the GSE method and further validated. Now we are going to see how the GSE running sum and enrichment score are calculated. So our rank file is here so from top up regulated to down regulated non significant genes in the in the middle we place our rank file horizontally here so then we have the up regulated genes and the down regulated genes. So then we are testing this pathway and then we can see that the black bars here indicate the genes are in the pathway and it indicates where they are on the ranked file. So it goes gene by gene so the running sum, the running sum start at zero for gene one and then you go to gene two, to gene three, to gene four. So each time a gene is in the pathway, the running sum is going to increase a bit and if there is no gene the running sum is going to decrease a little bit. So to form a pick like this, you need to understand that you need to have a lot of genes that are in the pathway. And then so the running sum is reaching a peak here that we called the enrichment score. So we have a weight system. So the gene that are on the left on the right of the rank file is going to have more weight than the gene in the middle. And it's because we don't want to take this gene is this non significant genes into the calculation of the enrichment score. So we can have pick on the right and pick on the left, but not pick in the middle. So we have an assumed image that shows us the running sum, which starts at zero. We have a gene here, the running sum increase, no gene decrease. We have another gene increase and here we have like a, like a no gene so we see the reading some decreases a little bit and then increase it again so you have to have a high density of genes in your pathway to to increase the running sum. The pathway can be enriched in genes are regulated so we will have a positive enrichment score, but they also can be enriched in genes down regulated and in this case we will have a negative enrichment score. So, now we need to go from the enrichment score to the p value. So we need to estimate if the enrichment score is equal or larger than the one that could have been obtained by chance only. So GSEM method is doing permutation to calculate the p value. And so in the case that we are using most of the time, the permutation is done by replacing genes in the pathway that we are testing by random genes and we do like 1000 times. So it's going to be 1000 random pathways. But there is another permutation method that consists of shuffling the samples before creating the rank file. So basically we create 1000 random rank file to break gene dependency. So, for each of this method, what we are looking is at the observed enrichment score and how far it is from the mean of the null distribution we should be close to zero. And then we can calculate the empirical p value by calculating how many times this observed enrichment score was greater than the random score divided by the number of permutations that we did. So that was it for GSEM. Just a note that there is another test, the Wilcox unranked some test that we can use when we have a rank list. It's a nonparametric test and it only considered the rank of the genes. For example, on this graph we have on the x-axis the log file change and on the y-axis the gene rank. And then we have this null distribution, so we have the global distribution in blue and in red we have the pathway that we are testing. As you can see, we see a shift toward the right, meaning toward higher log file change. It means that this pathway is enriched in genes that are regulated in our experiment. And this one is an output from Panther which is using the Wilcox unranked some test. So in summary, we have seen that we use the Fischer's exact test to calculate enrichment p-value for defined generalist and we can use GSEM or the Wilcox unranked some test for computing enrichment p-value for ranked generalist. So the last part of the statistics is multiple test correction. Here are some examples that we showed you. We were testing one pathway. But in fact, we are testing all the pathways that are contained in the database. So we testing all the pathway at the same time. Therefore, we are trying to test, because we are trying to test as many pathways as possible, we also need to correct for multiple hypothesis testing. So we are going to go back to our example of red and black genes. So our background is about 5000 genes containing 500 black genes on all the other one on reds. And the results that we got was for black genes and we say it's difficult to obtain and I think the p-value was 0.001. It's very unlikely to get this result, but this is only if we try it once, because if we really want to get this result and we try it again and again, maybe let's say 10,000 times or maybe 1000 times, we are going to get this result with four black genes and one red gene. So even if an event is unlikely, if you try multiple times, you may able to get it. And that's what people mean by multiple hypothesis testing. And that's why we need to correct for the number of tests that we are making. So if we don't correct for multiple hypothesis testing, we are going to generate type one error, also named false positive. So we need to correct for the number of pathway that we are testing. And so then you can think if we need to correct for the number of pathway that we are testing, then intuitively then what you could do is to multiply your p-value, nominal p-value that you got by the number of tests that you've done, the number of pathway that you have tested. And it actually exists and this corrected p-value is called the Bonferroni correction and the corrected p-value will always be larger than the original p-value. So if my p-value, original p-value is 0.01, maybe the corrected p-value is going to be 0.05. So when we use the Bonferroni correction, we say that we are controlling for the family-wise error rate. It means that when we select pathway and the corrected p-value of 0.05, like we have a set of pathway, we say that the probability of anyone to be type one error is 5%. So this correction is very simple, but it's extremely stringent. So it could be that you are doing this correction and none of the pathway passed the threshold of 0.05. But there is another method that is widely used and it's called the false discovery rate. So the false discovery rate is the expected proportion of the observed enrichment due to random change. So if we set the threshold to FGR 0.05, what we say is that we expect a proportion of 5% of them to be due to random change. So the method that we are going to show you today is the benchmark method and the result is a Q-value. So we are going to see it in an example because it's just easier to explain, but I would just mention the steps here. So to calculate the FGR, we first have to sort the p-value of all the pathways that we got in increasing order. Then we calculate the adjusted p-value from the p-value by multiplying the p-value by the number of tests. So we are going to do the same as the Bonferron correction, but divided by the rank. And then there is an additional steps to assign the Q-value to the adjusted p-value, but I'm going to show it to you because it's just easier to understand. And then we get the Q-value and then we can select the pathway with FGR 0.05. So in this toy example, we've tested 53 pathways. We have here the nominal p-value ranked from the lowest to the largest. So now we calculate the adjusted p-value. So p-value multiplied by the total number of 53 pathways we are testing divided by the rank of the ordered pathways. And we get this adjusted p-value, which is already larger than the nominal p-value. And here what is very important and the most important on this slide is that you see that we multiply by the number of tested pathways. So the more tests, the more pathway you are going to test, the larger the adjusted p-value is going to be. So the more pessimistic this value is going to be. So the more pathways we are testing, the more we need to correct for that. And so then the next step, once I've said that the next step is to assign the Q-value to adjusted p-value. So for that, we start from the bottom. So here it's empty and we want to assign the Q-values. What we need to do is to look at the adjusted p-value from the same row or the row just below. So here we don't have a rank below. So we just put 0.99 and then we go up and then we want to assign this new value. So we go to the same rank 1.004 or to the rank below 0.99 and we choose the smallest value. So here the smallest value is here. So we put 0.99 here in the table. And we do here the same way until we get to the top. So now we have the FDR. So we would select the first top pathways because they then have FDR Q-value less than 0.05. So now we have learned how to do the correction. For sure you don't need to do it in your analysis because it's already computed. Or if you want to compute it in R, you have like a formula for it, but it's good to know how it is done once and really importantly that you need to multiply by the number of pathways that you are testing. And so as you said, as we just saw that the more pathways we are testing, the more we have to correct our p-value. So one way to decrease the number of pathways that we are testing to be less pessimistic in our corrected p-value is to filter the pathways by size. For example, to remove pathways that are less than 10 genes or contains more than 500 genes or more than 250 genes because like large pathways are sometimes very generic and not very informative. And that's what we are going to do when we use GSEA. So again, if you want to learn more about the FDR, I would recommend this video from StatQuest. So in summary, what did we learn from this lecture? We learned that there is like a typical output for enrichment analysis. So all the tools that we are using give us a table with a list of pathways that we are testing. And what we can look for is like a column that indicates a value related to the size of the overlap. And then a p-value that assess the probability that the pathway is enriched by chance only and very importantly a corrected p-value. So I'm not to mention that many of these pathways, they can have a function that are very, very close together. And in this case, they contain genes in common, but we cannot see it as it in the table. And that's why we do a network visualization to gather better view of this output table. So we are presenting two tools, but many available tools are existing. So they can be web-based tool, they can be CytoScape app, they can be a standalone application, or they can be included in some of our packages. So how are you going to do to choose a tool? So here we put a list of questions that might help you to do it. And so the first is, does it cover your model organism? Is there a good choice of pathway database, for example, genotology, reactome of the pathway database up to date? So and now you will be able to understand which statistic they use. Is it for a defined generalist? Is it for a rank list? Or is it for both options? Is the description of the statistics clear enough? Do you like the output side? And can we connect it with vector network visualization like tools like CytoScape? And so we have some four tools for instrument analysis. And Gprofiler is the one that we have chosen and you will see why. So Gprofiler has a to-date database. Gprofiler has a good choice of database like reactome, go, and other ones, and we can test all these database together, but not one by one. So it increases the power. It has a lot of model organism that we can choose from and we also can upload our gene set database, our pathway database. It uses the Fisher's exact test because it's for a defined generalist, but there is an option to do auditory. And we can for sure upload and restrict our background, which is very important for defined generalist. It's the web-based and an R package, and we can connect it with CytoScape and Ruchemap. For a rank list, the tool that we are using is GSEA, using a modified KS test. It corrects for multiple hypotheses. For the pathway database, we can use the one that they have related to MC, but it's for human. But what is very good, it's very generic because we can upload our own custom database and we can connect it with CytoScape and Ruchemap. Okay, so as a summary, what is the recipe for defined generalist enrichment tests? So first we define the generalist and the background, and then we select the pathway, and then we run the instrument test using the Fisher's exact test, and we correct for multiple testing, and we get our corrected p-value, we get, and then we select the significantly enriched pathway to interpret the data. And for rank generalist, while the steps are very similar, first we rank all genes from our experiment, we don't remove anyone. We select the pathway, and we run the enrichment test using a rank-based sum test, and then we select the one with the FDR1-0.05, and we interpret the results. And on the last slide, some topics that we did not cover is the issue with correction between gene set, and dependency of genes, and some recent tools also topology aware. So if you want to know more about these two topics, they are included in the protocol paper that was a reference in the pre-work. And that's it. So we went through the statistics and hope you are still here. And then as final tips, what I can say is to be precise at each step of your analysis, even at the earlier step when you clean and analyze your data, and try to answer one biological question at a time.