 So, we will look at differential gene expression analysis and also enrichment analysis. So, of course, once we have identified different cell types, now we are happy with our clustering. We have decided which clusters we want to keep. We want to start seeing which are the genes that are differentially expressed between different conditions, but also maybe identify marker genes that are more expressed in one cell type compared to the other. And then once we have a list of genes, we want to start to annotate them and see what functions they're involved in. So, we'll do enrichment analysis. So, first, for the marker gene identification, the implementation in Surat is used towards a function which is called find all markers. And this, by default, will use a Wilcoxon test to detect genes that are markers for cell clusters or cell types. Basically, what you will do, so, for example, you have five clusters of cells in your dataset or five cell types, you will use find markers to find the marker genes that are more expressed in one of the clusters against all other cells. So, for example, if you're interested to see what are the genes that are more expressed in cluster zero compared to the rest of your dataset, you can use this find all markers function. So, it's really taking one cluster against all other cells. And it does this for all of your clusters, so cluster zero against all other, cluster one against all others, etc. You will see if you look at the help that you have the option to only output the genes that are upregulated in one cluster or to include both. So, of course, you can also have genes that are less expressed in one of the cluster compared to the rest of the cells. Then, there is another function which is very similar. It's called find markers. And this is to perform pairwise comparison of genes. So here, it's not taking one cluster against all others. It's taking one against another, so specifically two clusters. It's really a pairwise comparison, so between cluster one and clusters two. Here the default is also using the Wilcoxon test, and there are other possibilities. You can have a look at the help to see what other methods are implemented. So, as we have seen already. Can I just add something? Yes. You can also find markers compare one cluster versus three others, for instance. Yes. But it's two comparisons that you will do. Exactly, yes. So every time you, for example, if you want to compare cluster one against cluster two and cluster one against cluster three, you will have to run two comparisons. So running these comparisons twice. That you can do. It's just for you to know it's only pairwise. So we have seen already yesterday and the day before that for a single cell data analysis, every time you want to do a task, several options are available. And of course, this is also the case for differential gene expression analysis. And this is one paper that it can be quite interesting. It is from 2018, but I think it's still valid. And here it's an evaluation of 36 methods that can be used for differential gene expression analysis in a single cell data. So if you're familiar with bulk RNA seek data analysis, at the beginning of the single cell data generation, people would use methods that were designed even for micro rays like Lima or for bulk RNA seek like the SEC applied on single cell RNA seek data. And then there were some new methods developed specifically for a single cell data. And in this paper, the authors compared actually these methods. So using the bulk methods and also comparing some of the single cell methods and see which could be, you know, the most appropriate. And I think they used simulated data to see how the methods would recover the truly differentially expressed genes. And one of the interesting figures in that article is this one. And here on the rows, we have all the different methods that they tested. And they had several metrics here in the columns that were evaluated for each method. So I'm not going to discuss them all, but we'll have a look maybe at some that could be interesting like scalability or speed or complex design. So here some of the methods maybe are familiar to you. So for example, the T test, and this can be used in any analysis of any type of data. So not only bulk RNA seek or single cell, also any type of data. And we can see that in all of the metrics, it performs quite well, except for complex design. And that means that when you have a T test, it's really a pairwise comparison. So like you have two groups, control and treated, for example, and you want to compare. In this case, it will be totally fine. You could use a T test. On the other hand, if you have a more complex design, like suddenly you will have batch effects, or you have like, I don't know, for example, five different experimental groups where you had different combinations of drugs, for example, a factorial design. Then of course, T test is not going to be applied on a more complex design. There are other methods that perform very well. And so for example, HR. HR is a method also designed for bulk RNA seek data. And within HR, you have several methods of analysis. And here would be the one that they found as the top ranking. And there is another one also masked. I think that was developed specifically for single cell RNA seek. And then there is another tool, which is called Lima, which was designed at the beginning for microarray data analysis. And it's also used a lot for bulk RNA seek analysis. And one that I often use, for example, is the VUM Lima method. And we can see that it's ranking at the top here. So during the exercises, you'll be able to test VUM Lima. And in terms of scalability, so this is something also important to take into account that some methods do not scale very well. So the more cells you have, the slower it will be and also for the speed. The nice thing about this paper is that the authors made available all of the code to run all of these methods. So if, for example, you would want to test the HRQLFD method, you could have a look at the GitHub repository for that paper. And you have the R script that you can take the code from and apply it to your data. So it's really a nice resource if you're not familiar with these packages. For example, also DSEC2, we use a lot for bulk RNA seek. You could also have a look at the methods there in these R scripts. All right, so describing the figure of the paper, I described these Lima or HR packages. So as I said, these are methods that were designed for microarray or bulk RNA seek. And the nice thing is that we can include complex designs. So if we have a batch effect, for example, we can include it in our model as a covariate. And this we will test in the exercise, including the donor ID as a covariate using Lima. Of course, you can use also these packages to analyze factorial design. So if you have a combination of genotype per treatment, you can easily use Lima and specify the interaction, for example, in the model. So here is another approach at looking at differences in gene expression across different cell types, for example. And this is called a meta cell. And so here you have the link to the paper. And here they applied, for example, to CD8T cells. So here what it does here, we have a dimensional reduction image of some immune cells that were classified according to their different functionalities, or like if they're dysfunctional or if they're more memory, et cetera. And the idea is that the authors wanted to see if there was a relationship between an expression score for a cytotoxic pathway, comparing it to the expression score of a dysfunctional pathway. So if you try to calculate the score, as we did yesterday with addModuleScore and create a scatter plot with XY axis, since the single cell data is sparse, for example, you probably will see quite a lot of noise if you look at that at the single cell level. And even more if you plot one gene against the other. So if you're interested in checking what could be the correlation of expression between gene A and gene B, if you create a plot like that, you will see a lot of noise. So to sort of reduce that noise and trying to obtain better correlations that are a bit cleaner, the idea here is to aggregate several cells into what is called a metasell. So for example, you take like 20 to 50 cells and you create like an aggregate of their gene expression. And so the noise that is within each will be sort of corrected for. So here, this is what we have here represented. So here, every dot is like the aggregate expression of several cells. And you do that for your different cell types. So you have these dysfunctional in the dark purple and light purple, transitional and the yellow cytotoxic. So you can see now that you have kind of a much cleaner inverse correlation between these two scores and like the transitional or in the middle. So this can be a nice approach to sort of clean up your your correlations and have clearer views. So that's another approach that you can use for looking at different differential gene expression or different scores among yourselves. So now many people use, for example, find markers to look at differential gene expression between a control group and a treated group. So let's say here we have healthy donor A and a patient. So we have two conditions, healthy and patient. And we have found that we have one cell type X of interest. So patient A has 98 cells and patient A here. Well, healthy donor A has 98 cells and the patient has 105 cells. And so maybe sometimes you can try to look at differential gene expression analysis between healthy and patient by using find markers, using all these cells against all these cells. But imagine that the fact that all of these cells come from the same patient, they're actually not really independent. So you have to ask yourself the question, how many independent replicates do you have? And this is something that is considered a lot in more like ecology. So for example, if you would like to compare the number of species between forest and grassland ecosystem, if you choose one forest and one grassland and sample several spots in each, and so you just run a statistical test to compare the species between forest A and grassland. So forest A, you think you have three replicates and grassland you have four. But actually how many real independent replicates do you have? And in this case, forest A should rather be considered as one replicates and grassland A should be considered as one biological replicates. And so you go to another site and then you compare again another forest to another grassland. And of course, you can evaluate many spots within each ecosystem. But conceptually, each site is a different independent replicates. And so we can compare donors or individuals like individual mice as being the actual independent replicates. So in this case, instead of having like 200 cells of each, we actually have two independent replicates in each condition. And so what we will do to compare the differential gene expression between healthy condition and patient is to aggregate the gene expression for each patient to have one representative value for each independent replicate. And this can be done using a package which is called MOSCAT, which will do pseudo bulk analysis. So here we have our healthy donor A. We found 98 cells of cell type X and we create a pseudo bulk of the gene expression profile for the cell in this donor. So now we have one replicate and we do the same for the healthy donor B where we found 87 cells, for example, of cell type X. We have one replicate for this. So now we have two independent replicates for each condition. And if you have several cell types in your experiments, so in each patient, you didn't only evaluate one cell type, but for example, all these. So you had B cells, monocytes, et cetera. So here we have two conditions, control and stimulated, for example. So you will obtain one pseudo bulk per patient, per cell type, per condition. So here, for example, you can see that we have these FCGR3A monocytes that were aggregated. We had several patients. So for example, for the stimulated here, we had one, two, three, four, five, six, seven, eight donors. And then we have the same here in the control. So the same cells and also one pseudo bulk for every cell type for every donor. And after you obtain these aggregated gene expression per cell type per donor, you can use, for example, dimensionality reduction. So this is something similar to a PCA, for example. And then you can start to see if your conditions sort of diverge or not and how the cell type, where they're located in the dimensional reduction space. Then you can, of course, perform differential gene expression analysis, taking each donor as the independent replicate. And here we have an example, for example, of two genes that were differentially expressed in one of the cell types between control and stimulated. And of course, then you are free to go back to your single cell data and create a violin plot of these two genes within each cell. So here is going back to the single cell representation, showing in the single cell how this gene is distributed across donors, across the single cell of each donor in each condition. And you can, of course, create a heat map also at the donor level. So here we have some of the genes that were differentially expressed between the conditions and you have control and stimulated for one of the cell type, for example, for B cells. And these sort of figures you can create using this package, MOSCAT, and also the aggregation of the, so the creation of the pseudo bulk for every cell type, for every patient you can do with functions of the MOSCAT package. All right, so that was it for the background. So the question is to determine whether some genes are differentially expressed among cells coming from any of three groups, such as cells from newborn mice, middle-aged mice or old mice, which functions or methods would you use? So I haven't mentioned the F test, but this is basically like an ANOVA that you can do when you have more than two groups that is implemented in Lima package or defined markers function from Surat, which does pairwise comparisons, or the T test. One more second. Yeah, good. I'm glad to see the message mostly came through. So basically, if you have three groups, if you're working in a setting outside of single cell RNA-seq, you would probably use an ANOVA or an F test. And you could, if you have bulk RNA-seq, for example, or in this case single cell, you could use F test implemented in Lima, which is like an ANOVA, which will tell you for every gene if it is different in any of the three groups. So after you run an ANOVA, you could run a post-hop test to see which groups differ in terms of gene expression. If you use find markers, so that could also be an option, but you would have to compare one group against each other. And when you do that, you may be overestimate. So you don't correct completely the p-value because of the multiple comparison issue. And when you do an F test implemented in Lima, for example, this is already dealt with usually. With the find markers, you would have group one against group two. And then you adjust the p-value only within this comparison. So maybe you would have to further manually adjust for the multiple comparisons that you're doing. And then finally, a T test you cannot really use because here we have three groups. And T test could only be used for comparing newborn to middle-aged. But again, that's not what we want. We want to see if any of the three groups differ. All right. So once we have identified differentially expressed genes, what do we do? So I'm sure that if you have performed any bulk RNA-seq or even single cell RNA-seq before, you always end up with a long list of differentially expressed genes, sometimes hundreds of genes. And maybe you know what a few of them do. But how do you know what all of them do? So the idea is to gain biologically meaningful insights from long list of genes. And this is where we do enrichment analysis. And of course, several methods are available. And two that I will discuss are the overrepresentation analysis or ORA and the gene set enrichment analysis or GSEA. You can have different approaches to using these two methods. So first is if you have a particular expectation of what should be going on in your system. So for example, if you knock out a transcription factor that is involved in a pathway, you could have a look specifically if the genes that are involved in this pathway are enriched in your list of differential gene expression analysis or not. So you could just go ahead and test if this particular condition is enriched. The other approach is if you don't have an expectation and just want to see functionally what your differentially expressed genes are doing, you could test a large number of gene sets to see if you find interesting functions or new functions that you didn't expect. And you could test these gene sets or pathways that you found in collections online. And I will show you some ideas. So what is a gene set? And it can be, of course, many things. It can be the genes working together in a pathway. So here we have the citric acid cycle. And at each step, you have an enzyme involved. And so, of course, you can see if the genes that code for these enzymes are differentially expressed or not in your system. So, of course, a gene set could be all genes involved in a pathway. Or it could be all genes located in the same compartment in a cell. For example, you want to see if all the proteins that are usually located in the nucleus could be differentially affected by your condition. You could see if the proteins that are regulated by a same transcription factor are affected. So again, if you knock out the transcription factor and if you have a list of target gene of this transcription factor, are these differentially expressed or not? Then you could, of course, compile your own gene set by using a custom list that you found in a publication. So let's say you found an interesting publication of genes downregulated in a mutant. You could see if in your setting, you also see an enrichment for these genes in your system. If you're working on diseases, you could, of course, look if a list of genes that contain a SNP associated with this disease are also differentially expressed, et cetera, et cetera. So this is really not like a fixed thing. It's not just the gene that are involved in pathway, but it's any list of gene that do something together or are linked in some way. And in terms of where to find gene sets, if you don't have any as previous knowledge, you can find some that are grouped in knowledge bases or in databases online. One such example is the gene ontology. So it's a consortium who tried to develop a comprehensive model of biological systems. So you can find gene sets that are linked to the molecular functions at the organism level and they try to include a multiplicity of species in the tree of life. It's one of the I think it's the biggest knowledge base in terms of information on the function of genes and it is organized in what they call different ontologies. So different type of processes. One is called the biological processes. So here it's more linked in terms of maybe a bit more pathways and a function of genes. So specifically biological processes in the cell. Then you have the cellular components. So more where the different proteins and gene products are located in a cell. And then you have the molecular functions also. So you could have genes that have a similar like DNA binding domain, for example. And all of these genes that have this similar binding domain or sort of a similar enzymatic functioning, they would be grouped in a gene set of all of these genes under the molecular functions ontology. So the gene ontology gene sets are sort of organized hierarchically. At the top, for example, here is one sort of tree. You have one gene set that is called metabolic process. And for every gene set, you have what is called a go term. So it's like a number or an identity. So you have the name. It's metabolic process. And here is the the ID. And since it's at the top of the hierarchy, this gene set is very big. So you can have like 2000 genes, for example. So at the top of the tree, the gene sets are very big. And as you go down, they become more and more specific and specialized. So here at the bottom, we could have a very small gene set. Maybe could have like 20 genes. And it's very specialized and specific. And it's called hexos biosynthetic process. And it has its name and also its ID. And then you can see, so it's not always exclusive. Some gene sets can share genes with several other gene sets. And this creates the fact that some of the gene sets are a bit redundant. So you can have overlap of gene content among different gene sets. So if you run an enrichment analysis and find several of these gene sets that are significantly enriched, then it's probably because some of the genes that are differentially expressed are shared among several of these Go terms. So other sources of gene sets, if you don't have any previous expectation or want to browse a bit what type of functions you have enriched in your data set. Here you can find some online resources. So there is the molecular signature database for MCDB. It contains several types of gene set lists and collections. So it includes access to the gene ontology gene sets. But it also contains, for example, Hallmark. And these are 50 gene sets that were compiled by the authors of the gene set enrichment analysis method that I will describe a bit later. And these are sort of well-known gene sets. I would say they mostly apply to human and mouse. But they're well-known gene sets and well annotated and curated. So it can be interesting to have a look at these. And then in MCDB, you also have access to published gene sets. So you can download a list of gene sets that come from publications that are all linked to, for example, oncology or other types. I think now they have COVID also, but I'd have to check. So every time you have a publication that describes some genes that are involved in oncology, for example, it can be added to this collection. Then there is the KEG database. And this one mostly is linked and includes mostly pathways as we understand them. Like, for example, the citric acid cycle or other type of pathways. And one thing I like is that they have these images of how the genes are organized. So maybe we can have a look if I show you this. So the good thing about KEG is that it doesn't only include human and mouse, but you can have a whole list of different organisms. And if you want to search in KEG, you need to have an abbreviation of your species. So for example, if you're working with Arabidopsis, for example, you have it included here, Arabidopsis, two species or other species are also present. So you even lemur and other species. Let's have a look at human still. So for human, you have HSA, which is the abbreviation in KEG. And then you can, for example, type, let's see if we can find the cytokine pathway. Yes. For example, a signaling pathway in human. So you can have access to these maps here and see what are the genes that are upstream, what are the genes that are downstream. So I think it's quite interesting to have a look at these KEG pathways if you're not familiar with and it can give you some ideas whether they would be interesting for you to use or not. So this you can actually access also directly the list of all of these KEG pathways for human, I think, in MCDB. Then there is another database which is called the Reactome and there they try to classify the genes in terms of their, for example, reactions and thematic reactions or other type of reactions and how they interact. So each of these collections are compiled by a consortium of people that work towards one type of classification of the genes. So Reactome is one and there is the wiki pathway and wiki pathways is a bit like Wikipedia where anyone can suggest a pathway and make modifications. So it's more of a public effort to try to put genes together in terms of pathways. So these are a few that you can access directly from the MCDB web page. So here I would like to describe how the enrichment analysis works and the idea is that, so here we can have for example several donors that are healthy and some patients and we measure the expression of several genes and then we have some genes that are part of a pathway that we're interested in. We call it the blue gene set and we want to see if these genes are enriched in our differential gene expression results or not. So one option could be to sort these genes based on full change and so we have some genes that are down-regulated and others that are up-regulated and then we can see whether our blue genes are rather located at the top or at the bottom of this sorted list and this will tell us if the genes are enriched at the bottom or at the top that means maybe our pathway is enriched. So for the enrichment analysis method that's the overrepresentation analysis we actually look at the proportion of genes that are actually differentially expressed or not and that are part of our pathway or not and then we use a Fisher's exact test that here is an example of how it works in R. So there is an R function which is called Fisher test where you could run a Fisher test for any type of data as long as you have a count matrix. So what you have to count here is an example of this little code here. We create a contingency table of the counts of the differentially expressed genes so here we have two that are differentially expressed and that are part of our blue gene set of interest and five that are differentially expressed but they're not part of our blue gene set and then we have some genes that are not differentially expressed that can also be part of the blue gene set and 12 that are not differentially expressed and not part of the blue gene set. So the idea of the Fisher test is to see if this proportion two out of seven is different from this proportion three out of 15 and here if we run the Fisher test on our contingency table this is the output that we will obtain. So first the odds ratio will tell you whether this proportion is bigger or smaller than this one. So if it's higher than one then this proportion is bigger if it's lower than one this proportion is bigger. Then you can have a look at the confidence interval for your odds ratio and here we see that the confidence interval includes one. So basically we can expect that our odds ratio is not significant and indeed if we have a look at the p-value it's one so basically here we don't have an over representation of our blue gene set among the differentially expressed genes. So this is how it works in R. So in the example that I just showed we were testing if one gene set of interest the blue one was over represented in the differentially expressed genes but of course if you decide to test a collection of gene sets for example all of the pathways that are included in the CAG database then you could run individual Fisher exact tests for each gene set. So you would run one for the green gene set one for the pink gene set one for the blue etc. Once you obtain a p-value for every gene set since you run multiple tests you need to adjust for the multiple testing issue and so you adjust the p-value. This you will see when we do in the exercises we use a package which is called cluster profiler which has a function that allows you to run multiple Fisher tests for several gene sets at once. So basically the way that you run this analysis is by providing the list of differentially expressed genes that you had and the list of genes that were not differentially expressed and then you provide your collection of gene sets that you want to test and in the output you will obtain a list of all of the gene sets that were significantly enriched and the p-value will already be adjusted for the multiple comparisons. So you will see it's quite easy to do. One limitation of the Fisher test is that it's threshold based so basically what I mean by that is that each gene is classified as either a sort of black and white classification. The gene is either differentially expressed or not and this doesn't take into consideration the magnitude of the full change. So here we have three genes that were classified as being down-regulated but this gene is much more down-regulated than gene number nine for example and the same for the up-regulation. So there is another analysis that actually allows you to take into account the magnitude of the full change of the genes that are differentially expressed and this is the gene set enrichment analysis method that I will discuss just afterwards. So I just already mentioned this cluster profiler package so of course it is one possibility among several other packages to do enrichment analysis but why I like it is that these functions are more or less easy to use and it has also a nice tutorial online so you can see have a look at the vignette and it has a built-in access to gene sets for human, mouse, yeast and even other organisms and you can have access to gene ontology for example and keg. It also has a visualization function so once you have generated an enrichment analysis using cluster profiler the object that you create you can use in functions to create like bar plots or dot plots of p-values etc directly with the cluster profiler package so it can do several things which is quite nice. The only maybe difficult part of cluster profiler is to understand the arguments of the functions. So I wanted to take a moment to explain a little bit what are the arguments of the cluster profiler functions. So this is the function of the Fisher test that's part of the stats package so it's the one I just described before you have many arguments possible and so the important one is X which is the contingency table that you have to provide and maybe the alternative to either if you have if you don't have an expectation of whether your proportion should be higher or lower you can run a two-sided Fisher test but if you have an expectation you can either choose greater for example and so these are maybe the two options that you can choose in Fisher test. Now for cluster profiler functions so here's an example of one of the function which is called Enrich Go and this allows you to perform an overrepresentation analysis so it's like a Fisher test implementation in cluster profiler and the format as you can see is quite different from the Fisher test function because you don't provide a contingency table you provide just the list of genes that were significantly differentially expressed in this gene argument then you have to give an argument which corresponds to the species with which you are working with and this ORCDB actually this will call a package which are called ORCDB packages on Bioconductor which are like gene annotation packages for several species so if you want to run cluster profiler for human or mouse or other species you have to install the ORCDB package in R first and then in this argument you just provide the name of your ORCDB package that has to be used to find the correct gene annotations then you have another argument which is called key type which will actually be needed to inform the function how the genes are labeled in our gene argument so a gene can be labeled according to a symbol but also for example ensemble ID or other type of IDs you can even have uni-prots if you work on proteomics you could have enzyme ID even and the idea here is to tell the function how your genes are labeled here so if you use a symbol then you can change this key type to symbol then for the ontology so I have described to you that the gene ontology is organized in three ontologies the biological processes the cellular components and the molecular functions so here is the argument where you can select which ones you want to test so for example if it's here it's the molecular functions that is selected then for the p-value cutoff this is the for the output actually of the results do you want to output only the significant gene sets or all of them so by default it outputs only the significant one because the p-value is at 0.05 but if you want to output all even the non-significant maybe that can happen you can change the p-value cutoff to one for example then I told you about this multiple correction step that has to be done once you test several gene sets at once and this is performed directly by the Enrich Go function and the adjustment method is the Benjamini-Hockberg one and this of course if you have knowledge about different methods you can change you could use for example the Bonferroni correction if you only test like five gene sets so that's one of the options then the universe so the function needs a way to know which are the non-significant genes in your data set so you provide the list of gene symbols that were not significant as the universe argument. Finally the minimum gene set size and maximum gene set size so in some collections of gene sets some gene sets can be very small so I think in the Go collection some are less than 10 so we're not necessarily interested in testing for the small gene sets so we put a limit on at 10 and the same for the maximum gene set size I told you that some gene sets in the gene ontology are very long they have thousands of genes inside and so maybe they're not that interesting because they're very general so we put also a maximum set size threshold of 500 but that's of course up to you to decide if you want to change these. Now there is another function which is called enricher and this arguments are very similar to the one I just described to Enrich Go but this one you can use for a list of gene sets that you defined on your own so if you compiled a list of gene sets from several publications and want to perform over representation analysis for all of your 10 gene sets for example you could use the Enricher function. The first arguments are the same as I just described and there is an additional argument that's called term to gene that is actually a data frame that contains the list of your gene sets so you have 10 gene sets each one will have a name maybe with the author of the publication where you found it and will also have for every gene set the list of genes that it contains so this is a data frame that is formatted as two columns that you have to provide to this function and then the output will be very similar you have the list of significant gene sets you'll have the p-value for each and etc so with cluster profiler you can do a visualization of your results so here you can see I have a small example where we performed an over representation analysis of the gene ontology gene sets so the we generate an object that's called ego so in the format of this object is specific to the cluster profiler package and actually you can directly use this object with some visualization function and here for example is a dot plot of our results and we show for example 10 significant gene sets and here is an example so you have the different gene sets and the dot and is colored according to the p-value so you can see which are the most significant among all then the size of the dot is proportional to the number of genes that were significantly differentially expressed and that were part of the gene set and finally on the x-axis there is something that is called gene ratio and it's linked to actually the proportion that these genes represent so out of it it's it's not exactly the same information because if you have 10 genes in a gene set of 100 genes it's not going to be the same proportion as if you have 10 genes in a gene set of 20 genes for example so here we have this proportion that is represented so 10 out of 100 or 10 out of 20 for example so you have this on the x-axis then another representation that could be interesting is the enrichment map and for example I told you in the gene ontology collection some gene sets are maybe a bit redundant and they overlap in term of gene content and if you create an enrichment map you can start to have a look at the bigger picture of the type of functions or gene sets that are significant in your data set so here again we run an overrepresentation analysis providing our differentially expressed genes and we use the emap plot function which will create this enrichment map here the link actually shows that these genes share gene content these gene sets share some gene content so all of these gene sets are sharing some genes so they might be all linked as we can see here to some sort of cancer gene sets and then we have some other gene sets here more viral or disease linked and so you can start to see sort of what's happening at the bigger picture so maybe cancer related gene sets disease related gene sets were significant in my differentially expressed genes and then you have some that are more unique this one for example doesn't share any genes with the others another tool that I put here is called Revigo and this is a website where you can actually enter a list of go terms so all of these gene ontology gene sets will have a go term associated to each one so you can put the list of go terms on Revigo and it will create also a map a bit similar to that but the approach is different it's it will do a dimensional reduction analysis of the list of gene sets that you provide and put them in a 2D representation according to their gene content similarity and also on Revigo once you have created the plot you can actually download the R code to regenerate the same plot in R and if you want to customize it with like changing colors etc you can do it easily so it's also a nice tool to to try out all right now I will describe the gene set enrichment analysis method so I will try to explain it so that you can understand more or less how it works this one can be performed if you have some sort of statistical value for all genes that were detected in your single cell RNA-seq data set so in the previous over representation analysis I showed you that you just provide the list of significant genes and that this method is threshold based so you just have a gene that is significant or not and you sort of ignore the full change this analysis the over representation analysis you can do if you for example use find all markers and then you have the marker genes of cluster A and you provide this to the over representation analysis to see what functions these genes are involved in if you perform differential gene expression analysis between two condition and you use for example lima or hr in the output of these tools you will obtain for every gene statistic so if we compare phenotype A to B like in this example we will obtain a t statistic for every single gene even the non-significant ones so if we have 20 000 genes in our data set that will be assessed we will have a t statistic obtained for every single gene then what we do is we take our 20 12 000 genes and we rank them according to their t statistic or a full change it could also be so here at the top we have the genes that have a high full change for a phenotype A that means they're up-regulated in phenotype A then you continue and maybe in the middle you have genes that have full change close or around to zero so these are the genes that are not differentially expressed between the two conditions and then at the bottom you have the genes that have a negative value for the t statistic or the full change so these are the genes that are down-regulated in phenotype A compared to B and in the gene set enrichment analysis what you do is that once you have ranked your list of 12 000 genes you will walk along this list and every time you encounter a gene that is part of your gene set S so gene set S could be a pathway you're interested in so let's say pathway S so here you walk along your list and if the first gene you encounter is part of this gene set you draw a black line and if it's not part of the gene set you don't draw anything so you leave it white and here we see that if we walk along here now we encounter two or a few genes that are part of this gene set so we draw a black line and probably you have seen in publications this sort of chart or series of black lines which is sometime called the barcode plot if you take this little series of black line and you put it horizontally so this is exactly the same and here you will calculate an enrichment score for that gene set S this enrichment score actually starts at zero and then as you walk along your list of sorted genes so you have your 12 000 genes here this green curve actually shows the T statistic or the full change value for example so here at the top we have the genes that are upregulated in phenotype A and then in the middle we have the genes that have a full change close to zero and then finally the genes that have a negative full change so the ones that are down regulated in A and the enrichment score starts at zero and every time you encounter a gene that is part of the gene set the enrichment score increases and this increase is proportional to the magnitude of this value here so that's why I was saying that the gene set enrichment analysis actually takes into account the magnitude of the full change in that are included in in your gene set of these genes that are included in your gene set so you do this little walk you increase and here we encounter many genes that are part of the gene set so the increase is quite rapid and at some point most of the genes that we encounter are not part of the gene set so if we don't encounter the gene the enrichment score decreases again up to like reaching here zero for example then you will determine what is the enrichment score for your gene set of interest s and the enrichment score will be the maximum deviation from zero so here let's say we reach 0.3 for example so the enrichment score for my gene set of interest will be 0.3 it can of course also go down so if the gene that are the genes that are part of the gene set are rather located at the end of my ranked list then the enrichment score will start decreasing and going below zero because every time I encounter a gene that is not part of the gene set my enrichment score decreases and then it could go back up if most of the genes of the gene set are located here and then my enrichment score will be the maximum deviation from zero like it could be around here and in this case the enrichment score will be negative so the sign of the enrichment score can tell you if the gene set is upregulated in your condition A or if the gene set is downregulated in your condition A so you have an additional information based on the sign of the enrichment score. The last concept of gene set enrichment analysis is what is called the leading edge subset and the leading edge genes are the genes that most strongly contribute to the enrichment score and these are the genes that are located before we reach the maximum of the enrichment score so here we reach the maximum deviation and all the genes that are located here these ones and with the black bars are the genes that are the leading edge genes and this if you perform an analysis with cluster profiler you will get the list of these leading edge genes so maybe just make sure you don't when you obtain this list of leading edge genes you don't confuse them with all the genes that are significantly upregulated or not because a positive well a significant enrichment can be found by gene set enrichment analysis for a gene set even though the individual genes if you look at them individually maybe they're not statistically significantly differentially expressed but the fact that they're all located at like the top of the rank list is enough to have a significant enrichment of that gene set so that is the advantage of the gene set enrichment analysis compared to overrepresentation analysis is the fact that you can pick up a signal for a gene set even though at the individual level the genes are not significant in cluster profiler so I don't think we will perform this in the exercise but if you're interesting to perform a gene set enrichment analysis of the go gene ontology collection there is the gsc go function the important part is that you provide a list of genes so you have to provide the t statistic or the full change values for all of your genes that you have included in your data set so if you have 12 000 genes you have to provide the full change for all of your 12 000 genes because it requires the information also from the non-significant genes to calculate the enrichment score so and this list of full changes for all of your genes has to be ranked previously manually before you use the function so remember that you have to do that the gsc go function will not rank your full changes automatically and then you have the gsc keg function so it's exactly the same principle it's just to test the keg collection and finally if you have your own custom gene sets so you have your 10 gene sets that you compile from 10 different publications you can run gsa of these custom gene sets using the gsa function again you have to provide your list of full changes that are ranked here if you're interested in checking how this enrichment score is calculated in detail you can have the description in this paper here which is also the paper that compiled the hallmark gene sets that you can find on the mcdb website there is a question which is is there a python equivalent for testing user defined gene sets as with the enricher function and the honest answer is i have no idea i will see if i can find something i never use python personally so i don't know but i'm sure there is i know there is a package i have to check once i was teaching enrichment analysis and one of the participants told me that they were developing a python package for visualization of results similar to revigo for example so probably there is a hurt already posted so good gsa i just quickly google tia tia perfect i was gonna look already pretty good also great thank you perfect yes if you if you need info google it google doctor google has an answer all right so the vbox question i'll do it after the exercises so if i if i forget remind me to do it i think that's it for the presentation as usual i speak more than i wanted let's see so