 Welcome to this short introduction to the core concepts of enrichment analysis. Enrichment analysis is most often used to characterize a gene list by looking for classes of genes representing functions that are overrepresented on the list and which can therefore be associated with the study in question. The basic idea is this. Imagine you've done an omic study and identified a list of regulated proteins and you wonder whether this list might be enriched for mitochondrial proteins. You can now take all protein-coding genes in the genome and divide them into four classes. The proteins that are regulated and mitochondrial, regulated but not mitochondrial, mitochondrial but not regulated and finally neither regulated nor mitochondrial. With this 2x2 contingency table, you can perform Fisher's exact test and that way obtain a p-value that answers whether mitochondrial proteins are in fact over represented on your list of regulated proteins. Of course, mitochondrion is just one of many terms in genontology. Genontology is a directed acyclic graph of terms that are commonly used to describe the functions, localizations and processes that proteins are involved in. The idea of enrichment analysis is that you don't just go test one of them but instead you systematically test them all. Of course, if you test them all you run into the problem of multiple testing. Multiple testing is best explained by XKCD. Some scientists were asked to test whether jelly beans cause acne and they found that this was not the case. Then someone got the idea maybe it's only a specific color of jelly beans. So they go and test 10 different colors of jelly beans and find that none of them are significant. They test another 10 and find that green jelly beans had seemed to have a significant association with acne at a p-value of 5%. If you don't correct for multiple testing, the next thing that happens is this. The problem is of course that a p-value of 5% means that there's only a 1 in 20 chance of this happening at random. However, when you try 20 times having it happen once is exactly what you would expect by random chance. Now when we're doing enrichment analysis we're doing nothing like 20 tests. We're doing thousands of tests and that of course makes it that much more important to correct for multiple testing since even a p-value of 0.1% probably means nothing if you haven't corrected for multiple testing. You can do this either by doing one for only correction or by calculating in one way or another the false discovery rate and that is typically done by enrichment tools. Another thing you have to worry about is so-called custom backgrounds. To understand this, let's look at an extreme example. Let's say I've done a study and I've identified a list of significantly regulated protein kinases. If I do a standard enrichment analysis I would use the genome-wide background and ask what's special about these regulated protein kinases and the analysis would with certainty tell me that it's enriched for kinases. While true, that is also completely missing the point since kinases was a filter that I used to get my list from the beginning. So the right way to do the enrichment analysis was to not use the genome-wide background but use the kinome as a background. That is, take the set of all kinases in the genome and ask what's special about the ones that are regulated compared to the others. While this is an extreme example you often find yourself in similar situations. Imagine that you're doing liver protomics comparing two different groups of patients. If you do your standard enrichment analysis using the genome-wide background, the answer will be that you find enrichment for all kinds of liver functions. Again this misses the point. It doesn't tell you what's different between the two groups of patients, it just tells you that your samples were liver samples, which you already knew. For that reason you will typically want to test against the observed proton instead. That is, ask the question what's special about the proteins that are regulated in my study compared to all the proteins that I actually saw in these liver samples. You don't have to necessarily have the set up where you're comparing a list to a background. Another option is to work with a ranked list. Let's again imagine we have our liver protomics study, we have two groups of patients and when we're comparing them we find things that are up-regulated and down-regulated. We can take all the proteins and we can rank them making a list where at the top you have the most up- regulated proteins and at the bottom you have the most down-regulated. I can now take different gene ontology terms and map them onto the sorted list. If I look at something like mitochondrion and that is over-represented among the up-regulated proteins the list would look something like this. The genes that are localized to mitochondrion will be near the top of the list. Similarly if nuclear proteins are down-regulated I'll see the term nucleus appear with genes that are near the bottom of this list. And if there's no association for cytosol I'll see the genes with the term cytosol be seemingly randomly scattered over this list. This can of course be formalized and you can use something like the Kolmogorov-Schmirnov test to identify which terms show a non- random distribution across the sorted list. And after that you again get a p-value like for Fisher's exact test which you have to correct for multiple testing since again we're doing this systematically for every gene ontology term. Enrichment analysis is not restricted to linking genes to functions. It can go beyond both genes and functions. Firstly you can use enrichment analysis on any gene sets. It doesn't have to be functions. It could be diseases where you have a list of genes associated with each disease. It could be tissues where you have a list of genes expressed in certain tissues. It could be a list where you have all the targets of a given transcription factor so that way you can figure out which targets are found for which transcription factors. So you can find enriched transcription factors. You can also use enrichment analysis on phosphoproteomics data where your list is not even a list of genes. It's a list of phosphorylation sites that were significantly regulated. If you compare that to sets of phosphorylation sites associated with different kinases you can do what is called kinase enrichment analysis to figure out like for the transcription factors which are the relevant kinases in this study. Finally you can use enrichment analysis even in microbiome analysis where we're neither looking at genes or proteins or sites and proteins. We're looking at organisms. We're doing a gut microbiome. We're comparing again maybe two groups of patients and we figure out that we see certain organisms and some of them show a differential abundance between the two patient groups. We can then look for for example organism disease links and see are the organisms associated with certain diseases overrepresented among the set of organisms that show differential abundance between two patient groups. I hope this gave you an idea of how broadly applicable enrichment analysis is and if you're interested in these topics of how to work with long gene lists I suggest that you take a look at this presentation next. Thanks for your attention.