 Welcome to the dark side of enrichment analysis. The dark side is a new series in which I focus on major problems that are rarely talked about in bioinformatics, and I'm starting with gene-set enrichment analysis. If you're not already familiar with enrichment analysis, I strongly recommend that you go watch my introduction to the core concepts first. The first problem I want to focus on is confusing term names. If you do genontology enrichment, you will very often find that the most significant terms are high-level terms with very long names. Examples include G06796, phosphate-containing compound metabolic process, or G051171, regulation of nitrogen compound metabolic process. If you're a biologist, you would be excused for thinking that this might relate to processes like phosphate homoestasis or maybe nitrogen secretion. But if you take a look at the child terms, you will find terms like G06468, protein phosphorylation, and G06355, regulation of transcription, DNA templated. At this point alarm bells should ring. What you're looking at is protein kinases and transcription factors. In other words, you're looking at signal transduction. And you might ask yourself, is this even metabolism? Well, when you attach phosphates to proteins or form RNA, you are making and breaking bonds, which is the very definition of metabolism. So even if this is a confusing usage of the term, but may lead to misinterpretation, it is correct. The second problem is overlapping pathways. If you do pathway enrichment, you very often find surprising pathways showing up as enriched, often involved in various diseases that you're not studying. So it's not rare at all that you see asthma, inflammatory bowel disease, malaria, and tuberculosis all coming up as pathways enriched in the same study, which may seem absurd. It turns out the explanation is that the same genes can be in many pathways and are thus not at all pathway specific. This is why if you do enrichment analysis on a list with many cytokines on it, you shouldn't be surprised to see enrichment for pathways involved in every disease that has to do with inflammation. The last problem I want to talk about today is study bias. We live in a world where disease genes attract more funding, and highly expressed genes are simply easier to study. For this reason, both of these classes of genes are going to have more annotations than the average gene. If your gene list thus comes from either a disease study or was produced using techniques like transcriptomics or photomics that have abundance biases, you will have the same biases on your list as you have in the database that you're using for doing enrichment analysis. Consequently, you will get inflated significance estimates and possibly even artificial enrichment of terms that are not enriched if you did the analysis properly. You can address this by using a custom background, but this is only a partial solution. I want to end on a positive note. It's not all bad. Yes, there are confusing term names. Yes, there are overlapping pathways in terms of genes. But there's a very simple solution to both problems, and that is that you should simply look at the underlying genes. If you look at which genes are responsible for the enrichment of a given term, it's immediately obvious if it's protein kinases or transcription factors. Disease bias is more difficult to deal with, but when it comes to abundance bias in transcriptomics and protomics, there are correction methods that can deal with this, and these are implemented in tools like, for example, a Go tool. The most important tool, though, is critical thinking. You should always look at your results and think very critically about why the terms you're seeing come up as significant. That's all I have to say about enrichment analysis today. If you want to learn more about how to actually carry out enrichment analysis, I suggest you take a look at this presentation. Thanks for your attention.