 So this module 10 we are going to take information from genes and moving on to networks. Does everyone hear me well? I think the microphone is sort of on and off so maybe I'll stand closer. Some of the objectives that we want to learn through these modules is before we even get into pathway network analysis we need to make sure that we're calling genes by the right names and the handling gene identifiers is a mundane task but it's a really important one. The second is that we discuss which kinds of gene identifications there are available so what type of information can we use to interpret genes. Then what is a basic gene enrichment test and how does it work? To conceptually understand what we're doing in the context of pathway enrichment analysis you obviously can use many different tools to run these types of tests. Then why do I need to use multiple testing correction and well we always need to use multiple testing correction because we're dealing with big data. How do we analyze gene set enrichment results using enrichment map? We do that because usually these types of results are very redundant and there are many similar ideas that need to be clustered together and then we briefly go through general principles of network visualization using the cytoscape software this is really about three of our slides but you can do some hands-on work later on. Here's an introduction and the motivation of doing this type of analysis in case you have already produced your interesting gene on white or protein on white data then you know that sometimes you get a lot of results. The main motivation is that my cool new screen work then produce a thousand hits and now do what I do. A lot of that preliminary work you already practiced during the workshop you need to analyze this data carefully to distinguish signals that are you know going above statistical noise but then you need to go in and interpret those list of genes through pathways and networks and this can be a very challenging task because you can go to PubMed and try to look up a few papers about each of your top genes and you can easily go into hundreds and thousands of papers to read. So pathway network analysis is really a technique to automate that process and maybe do a statistical analysis of keywords, functions, and processes that have been annotated to genes previously. So when you consider this pipeline you have some raw input data that you have processed carefully using pipelines and algorithms. You perform some sort of a statistical ranking or clustering algorithm on these data in order to extract genes that are coming out as the ones with the highest statistical signal in some sense perhaps they're highly expressed in your disease of interest or maybe they form specific clusters that you're interested in. And then you use prior knowledge about those genes that has been studied over decades of research using certain analytical tools and you may actually come up with a new hypothesis of what a gene might do in the context of a disease, annotate the previously unknown gene or explain some mechanism underlying the disease. Maybe we'll try to clear this first. What is a pathway and what is a network? And I think as many people work on this area they will have their different definitions. My definition is the following. When you talk about pathways then you talk about a fairly small scale system of maybe a few dozen genes. And those few dozen genes have their interactions and those interactions are fairly well defined. It may be a phosphorylation event or transcriptional deactivation event but each one of those relationships or edges between the different genes have been defined very carefully in experimental settings. And then that could be an EGFR centered pathway with the various upstream and downstream members that control EGFR or found in the EGFR related downstream signaling. Now an EGFR centered network would be a slightly different concept. You would also see several genes interacting with EGFR. However, the system may be a much larger scale system and the edges may be a little bit more vague. So perhaps they're not coming in from individual experiments but these edges between EGFR and the additional genes or proteins would be coming from say a few large scale omics screens. So the edges are perhaps not as well defined, they're not as high confidence and those types of large scale networks may still include additional information that we may find valuable for our data analysis and interpretation. So fairly briefly what types of network and pathway analysis techniques are available. This is a very diverse field. There are many types of technologies and methods and even goals that we can do to achieve better interpretation of data. This particular classification is coming from a recent or not so recent but a review paper from Nature Methods in 2015. The first type of a pathway network analysis is an enrichment of fixed gene sets. So fixed gene sets will be groups of genes that have been annotated previously to carry out a particular function and then we perform an enrichment test to ask if our experimentally defined list is somehow characteristic of these gene lists. So this is perhaps the most widely used and easiest method to apply because it has fairly few assumptions. The second type of a method would be a de novo subnetwork construction and clustering method in which case we have a particular set of genes that we're interested in and we're asking which kinds of networks are connecting these genes together. And then the third category would be a pathway based modeling exercise where we have an existing pathway diagram as a scaffold and we ask if our genes of interest behave according to the rules set by that particular pathway diagram. So the third method is perhaps the most specific. It requires you to have a very detailed set of data that you can use but it can potentially help you model your genes of interest using an existing pathway with high detail assuming that you have a good diagram of the pathway and you have very high throughput and detailed data. So if we talk about cancer research, what would be these types of pathway network analysis, what questions would they answer? The first one is fairly simple. If we have analyzed a set of cancer samples, for example, and we have come up with a list of genes that's really characteristic of this subtype of cancer, we could ask what kinds of pathways and processes are active in this subtype of cancer. When we do de novo subnetwork construction clustering, we could ask are there certain kinds of new pathways or new networks particularly altered in this set of samples representing a particular cancer subtype and perhaps are some of those cancer subtypes clinically relevant in terms of maybe they have different patient prognosis. And then when we do pathway based modeling, we can ask how is a pathway activity altered by a particular patient? So do the genes follow a particular regulatory pattern that has been predicted by a known pathway or is the regulatory pattern a little bit different? So when you look at these questions, you see that some of them are fairly simple to ask and others are required to have a very good insight of the genes and the pathways at hand. So for the rest of the workshop, the tutorial or session, we will actually focus on the first class of methods that deal with the enrichment of gene sets in your high throughput data. So what does that pathway enrichment analysis really do? This can be visualized with this really simple Venn diagram, where on the one hand in blue, you would have a gene list that you identified as being particularly active, for example, in your experiment of choice. And in orange, you would have a previously annotated list of genes that have a certain function or are known to be involved in particular process. For example, that process could be neuro-transmitter signaling and those genes have been accumulated over time into a particular database. And then using a statistical enrichment test, such as a fish's exact test, you would ask, are the neuro-transmitter genes particularly highly enriched in my experimentally derived gene list? And if the answer is true, you may pose a new hypothesis that perhaps neuro-transmitter signaling is somehow related to my experiment that actually initially derived that particular list of genes. So perhaps you were doing a drug sensitivity assay and then followed up with R&E seek experiments, then if genes are differentially regulated and those genes are often related to neuro-transmitter signaling, then maybe that poses a new hypothesis. In practice, we actually have many different gene list, gene sets, each representing a different functional category or molecular functional pathway, and we test many of those gene sets one by one. Each time we may get an enrichment or not. And then we need to do something called FTR correction or multiple testing correction to derive a more conservative list of findings. And then based on those, we interpret our data, perhaps leading to a follow-up experiment or at least explaining what we see in our high throughput experiments. So so much about perhaps the very basic background of pathway enrichment analysis. Now we need to go through a few different details that you need to care about. Before going to the analysis, how do we identify genes reliably? Because these pathway enrichment analysis, they rely on particular gene identifiers and databases to run the functions. Why is it important to identify genes in the right way? This is because genes do not follow one particular standard of naming, but genes will have many types of names, many types of identifiers. Then importantly, these tend to change over time. So some identifiers might become obsolete and even worse, the same gene may have multiple names and multiple names may belong to the same gene. Sorry, that was confusing. The same name may actually belong to multiple different genes. So what are the issues to keep in mind? What is the output of the experiment that you initially performed? So proteins may have different identifiers compared to transcripts or genes. Non-coding RNAs will have even a different space of identifiers. If your original experiment produced a particular list of identifiers, you may need to convert that list to another list in order to enable pathway enrichment analysis. When you analyze a particular experiment, you may not even be able to directly convert your identifiers to the space that is used by a pathway tool, but you may need to convert it twice or multiple times. Each one of those conversions may lead to a certain loss. So the statistical test we will talk about a little later. So what are the identity mapping challenges we need to worry about? We need to avoid errors. We have to map by these correctly. In many cases, there would be mappings between single gene and multiple alternative identifiers. If you're using an older platform for experimental analysis, some of those identifiers may not map correctly to the current identifiers and therefore lead to losses in your pathway enrichment analysis or worse, that you point to a different gene overall. So gene names are ambiguous. Therefore, you need to use a particular kind of identifier which would provide just a particular identifier per gene and not multiple aliases. Here's an example of the famous tumor suppressor gene P53, which may be called P53, T-P53, TR-P53, LFS1 and so on and so forth. And some of those are historical, and some of them are used in parallel. So it's important to know about that. There is something in the human space that there's a standard nomenclature of genes which represents these alphanumeric sequences. Unfortunately, that may also change. So genes get occasionally assigned new identifiers and that can confuse people and algorithms both. So it is sometimes better to use these gene symbols in parallel with actual database identifiers to be able to consistently map your identifiers. Probably, if you have used Excel spreadsheets before, then you know that gene identifiers, if they resemble dates, may be converted to dates automatically. And you definitely want to avoid that and be very careful, especially if you're not dealing with like a dozen genes, but maybe a few hundred genes. It's very easy to have these mistakes automatically replicate. And there have been studies recently who have screened supplementary tables of high-impact journals and showing how much these tables are affected by date conversion. So this is by fairly widespread. And when you look at gene identifiers in large-scale experiments, you will most likely face issues with having 100% coverage because if you have hundreds of genes you analyze, sooner or later a few of them might be obsolete in terms of their identifiers or not be found by a pathway analysis tool. So if you really care about your gene less deeply, you should just go through these manually and make sure that all the identifiers are mapped automatically in the pathway tool. And if not, then you maybe should Google around a little bit and find what the actual identifier for a gene is. There are public services and other tools that allow you to convert multiple kinds of identifiers. This is the gene profiler server where you can use literally hundreds of types of identifiers to map to other hundreds of types of identifiers both in human and many other species. And this would handle the majority of different kinds of identifiers for you. Certain exceptions are obviously there. And it will allow you to use a mix of identifiers and then get an output of a specific kind of an identifier and if there are multiple alias, those would be revealed to you as well. So much of gene identifiers and I really hope that you won't need to spend a lot of time on them in your practical research, but this is definitely something to pay attention to. Are there any questions before I'm going to the next slide? Is there some sort of tooling as opposed to manually editing each gene if you have, let's say, a dream list of three genes? Is there any tool you can do or you can use and put this dream list and at least see if they're all being picked up and then be able to look at the difference? So that will allow you to see if they're all actually absent to some sort of... The gene profiler gene convert tool might do it for you and there's a feature where if there's multiple equivalent identifiers it may actually offer or ask you which one you think it is. But it has its restrictions as well, not always will it work. And there are certain identifiers that are purely numeric and for those you need to know in advance which database they're coming from because numeric identifiers, at first sight they won't be separated from one another. So the second really important component or really the core component of path to enrichment analysis is gene sets and gene annotations. So this is the public knowledge that allows us to do this type of analysis coming from databases and previous studies and careful expert curation and so forth. So cell is a complex machine. It involves concerted activity of many different genes and many different processes and functions and cell components. So in order to perform path to enrichment analysis we use gene sets that have been previously compiled using our knowledge of how the cell works. And then there are massive expert teams that curate existing knowledge and previous studies in order to compile gene lists that are representative of our understanding of biology both on the cellular side and the organismal side as well. So gene sets are available from multiple sources. Public databases, gene ontology is one of the most important ones that we will discuss next. So there are different databases out there that provide you with this information and then different kinds of information as well. So pathways and biological processes would be most valuable type of a resource that's most applicable for different kinds of omics analysis but you can also use other kinds of annotations. So in gene ontology there are three groups of gene sets representing biological processes, molecular functions and cell locations. You could also imagine that in certain contexts you may want to look at different parts of the chromosome which genes are annotated there and do an enrichment analysis on that. You could collect a gene list that are corresponding to different human diseases, perform enrichment tests on those gene lists and so on and so forth. But in practice we would recommend that if you do a pathway enrichment analysis you would start with a fairly basic set of pathways and processes. Those are presenting gene ontology in biological process group and then pathway databases such as Reactome or Keq that are most useful in the largest majority of the gene omics analysis and now depending on additional questions you ask you may want to look at things like genes targeted by the same transcription factor or proteins targeted by the same kinase depending on what your experiment is about. So what is the gene ontology? Gene ontology is like a structured dictionary so it is a collection of keywords of various kinds of biological processes, molecular functions and cell components that researchers have developed over time. So for example apoptosis would be a biological process, membrane would be a cell component and protein kinase or kinase activity would be a molecular function. So besides gene ontology being a dictionary it's also a structured dictionary so between these various terms there are associations or links because there would be certain parts of the gene ontology that represent fairly specific knowledge and other parts of the gene ontology that represent very general knowledge. So the dictionary is a hierarchical structure so more specific terminology links to more general terminology. So here's an example of the Go structure. Gene ontology is structured almost like a tree except that it's called a directed acyclic graph which means that each node in the tree can have multiple parents. And here's an example as well say B-cell apoptosis is a very specific kind of cell death and it's a specific node in the Go tree and this is a B-cell apoptosis it's a kind of an apoptosis and as you can see I don't think I have a point at that. B-cell apoptosis is a kind of an apoptosis which in turn is a kind of a program cell death and further on it's a kind of a physiological process. So scientists have developed this set of terminology and each one of those terms would have a list of genes annotated to them which would give us the gene sets that we can use in a path of enrichment analysis context. So what does Go cover? As we mentioned it covers cellular components, molecular functions and biological processes and they're structured in a way that can be machine interpretable and perform for any enrichment analysis tools. One of the components is terms and those terms come from expert curation who routine screen literature and add these terms as our knowledge gets more specific about biology and there are also occasional large-scale developments of the gene ontology where major parts of the gene ontology get deleted, a lot of terms get added and rearranged. As you can see from the plot on the right this is a highly dynamic system as more and more studies get published our knowledge of cellular organization increases and therefore more and more terms get added over time which also tells you that path of enrichment analysis performed today are more likely to be invalid in a few years time or at least they will be representing only a majority of knowledge because our knowledge increases and these results will change over time. The second part that we mostly care about here is how genes get annotated to these hierarchical dictionaries those also get annotated by expert curations but also very semi-automatic and fully automatic procedures. These are known as gene associations or go annotations and they tell us how genes function in which processes they're involved in which cell components they get expressed and so on and so forth and as a rule there are multiple annotations per gene one is that genes are involved in multiple processes and functions but also because of this hierarchical annotation where there are more specific and more general representations of the same process in gene ontology. It's important to notice that not all go annotations are assigned by experts experts actually assign only a minority of go terms and that there's a lot of statistical analysis and curation and automatic labeling happening especially in non-human organisms. So here's an example why a particular gene usually would have many many different go annotations. As we discussed earlier the go is represented as a sort of a tree where there are general and specific terms and say a researcher has annotated a particular gene to a particular term in this hierarchical tree say Aurora kinase B is known to be involved in B cell apoptosis and then by the definition of ontology or the definition of this hierarchical dictionary that gene is also annotated to each parent node above that specific B cell apoptosis and this is kind of natural because if a gene is involved in B cell apoptosis it's also involved in cell death in general. But this leads to a higher redundancy of pathway enrichment analysis which means that if you have a rich gene set rich gene list coming from a well-designed experiment then it's quite likely that you will receive a lot of pathway enrichment results but many of them will resemble each other because they represent fairly similar areas of the gene ontology. Why is it important to fill the gene sets? When you look at the tree where there are specific terms and very general terms in the gene ontology then it turns out that there are thousands of very highly specific terms with very few genes annotated to them because our knowledge of biology is getting increasingly detailed and people could spend their entire career studying a particular gene or process of interest and therefore these annotations can get really very well detailed. On the other hand, because all of the annotations of gene function are propagated upwards towards the tree then at the very top of the tree there are these very general nodes or very general terms that are not very informative but they contain thousands or even tens of thousands of genes. So as an extreme example the go-note biological process the biological process would contain all genes in that particular species. So therefore it's important to apply certain filters when you do a pathway enrichment analysis you may want to exclude very small terms such as those that have only one or two or five genes and you may also want to exclude the terms that have maybe 10,000 or 5,000 or 1,000 terms and then there are several reasons for those filters one is a statistical reason if you do an analysis over thousands and thousands of potential pathways which are all highly interrelated then you increase your multiple testing penalties so you should be more cautious about each individual results because you conducted so many tests. On the other hand, there's this biological interpretation issue as well so say you have your enrichment results from pathway analysis and they're enriched in the biological process that doesn't help you much in understanding what your dataset is about but also these large-scale nodes at the very top of the tree can lead to statistical inflation which means that because they're so large they may get very significant p-values purely because there's so many genes are involved. So a good practice would be before you even conduct your pathway enrichment analysis you set aside the very, very general terms and the very, very small and specific terms. So where do the annotations come from? The best ones of them come from manual annotation of existing literature and then an expert would evaluate the literature and perhaps assign a score or a particular label to a gene being involved in a particular process depending on what kind of evidence is available. Maybe there was a few muted phenotypes and there's a fairly strong association between the gene and the process but maybe it was only based on sequence similarity of that gene and one could argue that the evidence is weaker. A large amount of these annotations or predicted functions of genes are coming from automated curation that evaluates all kinds of databases and literature and doesn't really involve expert curation and this is particularly common in less annotated model organisms where a large bulk of gene annotations would just be inferred from other species where more careful experiments have been conducted. And it's generally assumed that many of those electronic annotations could be of lower quality so it's just something to be mindful of. When you're analyzing your pathway enrichment data and you see a lot of results unfolding then it might be worth just evaluating whether the underlying evidence is entirely automated annotation or if there's actually some experimental annotations that come from detailed studies of that gene. So here's an example of various evidence types that get annotated to particular genes being involved in certain processes so they could be coming from experimental evidence either high throughput or low throughput experimental evidence or perhaps computational analysis of existing data or sequence homology between related species or just based on literature curation such as like Author X in a particular study said that this gene would be involved in that process. So in a particular pathway enrichment analysis you just want to pay attention and observe what are your results about. Some tools such as GeoProfiler would allow you to understand these annotations on the fly. For example here we use collared evidence code of gene annotations to understand in a pathway enrichment result which genes would be annotated using which kind of evidence whether it would be experimental evidence that associates a gene and a particular function. So briefly mentioned this already that the genome annotations and pathways and this is a dynamic system that evolves over time so one of the studies in my lab involved the question of whether using pathway tools with recent and not so recent databases of gene functions affects our results and the first thing we asked was yes? I will use these terms here interchangeably. The difference would be perhaps that the gene sets can be all kinds of gene sets not necessarily pathways. You could look at the microRNA targets or chromosomal regions while pathway enrichment analysis would focus on gene sets that are representative of pathways. I would say that perhaps that definition depends on what type of input data do you give and what type of results would you like to get out. The statistical methodology would be mostly the same. Any other questions that I can answer quickly? Back to best before dates of gene annotation databases. So at this point a very popular tool that was used was David and we asked when was David last updated relative to the date of this analysis and then it turned out that at that point in the past David was updated five years ago while it was by far the most cited pathway enrichment analysis tool indicating that that tool had been frequently used in various omics analysis. So that leads to the question of whether the data sets representing pathways and networks and biological process had evolved since the five years when David was last updated. So we asked how much would the practical researcher benefit from having up-to-date databases in their pathway enrichment tools compared to out-of-date pathways. And then this was published shortly thereafter where it turned out that if you use an outdated pathway enrichment analysis tool to analyze your current omics data you may actually use up to 80% of your pathway enrichment results purely because pathway definitions and gene set definitions have evolved fairly rapidly over time and therefore if you use a tool that employs these outdated databases none of the new gene annotations would be used and you will just miss out on your gene interpretations that come from more recent studies. So in other words if you use a pathway enrichment analysis tool that there are many of them available online you should definitely look when that was last updated and whether that update was a regular type of an update because this really matters of what kind of knowledge you can get out of your pathway enrichment analysis. So is David updated after your paper? David was updated after we released our pre-print and bio-archive but I think that was that so since then they haven't updated it again and it has been a few years. Why do they use or use the other ones? I would say that they have had an advantage of being first I don't think that they do statistically something more sophisticated it's about what the community uses and the ease of use and what they recommend to others to be used. It's often the case that many equivalent tools one of them will for some reason gain an advantage and it will remain so and expand over time. I think it's also fair to say that updating tools requires resources and effort and not always are these resources and efforts provided by funders. So in order to maintain a tool you need to have people working on it and funding and maybe David wouldn't have that. Onwards we have been talking about gene set enrichment analysis pathway enrichment analysis it is worth actually asking what this is statistically about. There are many types of gene set enrichment tests the most common of them would be a fissures exact test or a hyper geometric test that would evaluate the enrichment of genes in your experimentally derived list and those genes that are enriched are having a certain biological function we already went through this slide once where we compare your gene list from an experiment in blue and a certain list of annotated genes from a particular database and ask if the over representation of a function in annotated genes is statistically significant and because you have many pathways or many gene sets as we looked at the gene ontology earlier then you conduct this analysis multiple times if you conduct a statistical test multiple times you need to be more cautious about it you need to have a false discovery rate in order to be more cautious and then based on those results of enriched pathways and processes you may derive your hypothesis for follow-up work or publishing a paper so what's a typical experimental test about so here in this light brown or beige color on the left you see the entire list of genes so that would be considered that experimentally detectable list of background genes if you're doing RNA-seq experiment that might be all protein coding genes in that species and the subset of those lists is the experimentally positive genes so perhaps those that you detected as being significantly up-regulated in your case control study and then you take those yellow genes or the up-regulated genes and then you pass them on to a gene enrichment test that uses geneset databases or pathway databases from the literature such as the gene ontology and then the gene enrichment test accounts for the yellow genes but also the beige genes so it will account for the enrichments given the background set of genes and it will conduct a separate statistical test for each one of those gene sets and come back with a short list of gene sets that represent biological processes and pathways that are enriched in your experimental data so a little bit of a detail what's a typical enrichment test do it assesses the probability that by random sampling of an equivalent list of genes in size you would see as many genes of a particular biological process that you actually saw so you could conduct this experiment easily or competition experiment easily using statistical programming by just sampling the random genes of that size and seeing how often do you observe a particular function of the annotated gene to arise FISH's exact test that doesn't require you to perform that random sampling but it has an analytical derivative to determine how many genes would you expect in the equivalent gene sizes and how unlikely it is to observe in a random chance so therefore FISH's exact test is really fast and you can run it across multiple gene sets so here's a contingency table for this particular enrichment test so you're measuring essentially two binary properties yes or no is a gene part of your experiment yes or no was it detected significantly significantly high for example and does that gene belong to a pathway or not a specific pathway and then these categories are respectively A, B, C and D and then using the hypergeometric distribution which you don't really need to know we determine how likely it is to have as many genes that are both in the pathway and your experimental list of genes it is also very important to pay attention to what the background is so when you run a pathway analysis online then by default the background usually is the list of protein coding genes in that species or maybe the list of protein coding genes that has at least one or two annotations and that background set is appropriate in the vast majority of cases however it can lead to very dangerous statistical inflation of results if you actually weren't able to analyze all genes in the background using your experimental procedure so if you're unable to detect several genes using your experimental procedure you should make sure that your background list is also appropriately shorter an example of that would be a phosphopropyomics experiment where the results are phosphorylpeptides that in turn represent all phosphoproteins but those phosphoproteins only represent a subset of all known proteins therefore in order to do a pathway enrichment analysis appropriately you need to limit your background list to all phosphoproteins and not just all proteins because if you consider that larger list of all proteins then certain categories would be systematically getting higher P values so depending on how you do this experiment the background list may be fairly easy to determine or quite difficult so if you have an experimental platform that only measures one type of gene or one type of protein that type of protein should be part of your background and other protein shouldn't so when you consider there's multiple types of experiments that are not essentially genome-wide and then you need to carefully deal with the background so if you didn't define the background system as an all-protein protein that is the baseline standard so if you had a targeted panel exactly how would you select that background because it also selected a type of possible protein well if you know the composition of your panel that becomes the background because you didn't measure any other gene that wasn't part of the panel therefore that gene in no way could end up in your gene list therefore the background becomes the panel and you know in theory the same statistical exercise applies even if you have a fairly small panel like 500 genes but like obviously the numbers will shrink and a good pathway analysis tool would also distinguish the gene sets that don't have any members of that panel so those would be set aside and not analyzed this panel is a fairly easy example because you know what is on your panel and therefore you can put all these genes into your background and nothing else but there could be more subtle definitions of the background such as genes that you wouldn't be able to detect although you are trying to do a genome-wide analysis but genes that you won't be detecting one is for example an RNA-seq experiment that will be genes below your detection limit whose transcripts you cannot capture because of your eat now is it correct to exclude those genes from your background or should you still keep them in I don't think there is a clear cut answer to that so now we are going to go through multiple testing corrections which is a tool everyone should use as long as they are conducting more than a handful of statistical tests and then this is usually true whenever you are looking into genomics data or any type of omics data because you have a lot of data and therefore you need to be cautious about any single finding in that data because you have so many measurements so how do we win the p-value lottery imagine that this is a simplified example of pathway enrichment analysis there are 500 black genes representing a particular pathway and 4500 red genes that are not representing that pathway so then you do a pathway enrichment test and ask if the pathway is enriched but you could do the same on completely random genes so you have this ball of genes and you randomly pull out the few genes and the majority of cases you should see that the red balls or the red genes are dominating because they make up the majority of the gene population and therefore the corresponding statistical significance of p-value would not be strong or significant because you see a balanced or expected proportion of red and black balls but you can do this many times and in genomics you do this many times and every once in a while maybe almost 8000 draws later you will end up with a result that looks nominally significant so because it's representing most of the genes would be colored black but in the population they are the minority so if you take this at face value you would say well this is a very strong enrichment because you wouldn't expect to see as many black balls or black genes in the population but then because you did so many random draws this just happens by chance so if you do many analysis then every once in a while you should expect something to emerge that looks very significant at face value and therefore you need to be more conservative about each test you do and downgrade each test you do in order to account for the attempt to win the p-value lottery this is the practical pathway analysis case so you have a fixed list of genes say RRP6, MRD1 and so on and so forth and you conduct many tests each time a test would be a different type of an annotation so first you deal with black and red balls you deal with the shape whether it's a circular or a rectangular gene and so on and so forth because you do many tests you are bound to find some results that look significant and this is the reasoning to do multiple testing correction which in essence takes every p-value from a particular test and makes it a little bit more weaker or downgrades that p-value the simplest type of a p-value correction for multiple testing is called the Bonferroni correction and this was developed before the Second World War so it's here mostly to just demonstrate how simple it is but practically Bonferroni correction is considered too stringent and it's not applied in practice in genomics these days but the basic principle is the following we have a certain number of M tests that we conduct and those M may represent all different kinds of pathways that we analyze for a particular gene list and each pathway will derive some original p-value and in order to correct p-values we multiply each of them by M so they become M times less significant if we run 100 tests each p-value will be multiplied by 100 and obviously that makes the majority of results much much less significant and anything that survives that Bonferroni correction and is still nominally significant say at 0.05 will be then considered a significant and rich pathway post-multiple testing correction and then this kind of a test is called a family-wise error rate which means that there's a probability that in your results there are at least one or more of the results that are remain false positives so they are there only due to random chance so if you have a large number of results then after the Bonferroni FDR correction you believe that there's at least one result or more that remains still a false positive regardless of your multiple testing correction so this is like a caveat of that analysis but it's actually a fairly stringent consideration because if you even have a very large number of results and you maintain that one or more remain wrong then that one or more may be a fairly weak condition to have right? So for the M that's going to be not like the total number of genes observed across the gene set but the specific set of genes that you're interested in comparing a statistical test If you had like say 100 gene sets each one representing a different pathway which is normal when you do a pathway enrichment analysis then M would be the number of these different pathways you just tested So you always have a one list of genes coming from like say an RNA-seq experiment but you interrogate that RNA-seq experiment using a large number of pathways just finding like a smaller subset that are very interesting you test many of them at a time and here the M is the number of pathways you test so you will result in having M tests but because M is large then each test you should be more cautious about or more conservative about as I mentioned most people would not run a Bonferroni test these days but instead they would run a Benjamin-Hochberg false discovery rate which is a slightly more complex algorithm but it is less conservative so it would maintain more significant pathways and it would allow you to perhaps conduct a less stringent test So this we already discussed because it's very fairly stringent in real enrichments and leads to false negatives then you would be able to take a less stringent test using a different algorithm but also allow to accept a less stringent condition the false discovery rate and that leads to a gentler correction and would just give you more results in return but with a caveat that maybe a larger amount of them would be potential false positives So FTR or false discovery rate is the expected proportion of the observed enrichments due to random chance so that would be perhaps 5% so if your FTR cutoff is 0.05 then that means that you assume that no more than 5% of your results could be wrong maybe none of them but up to 5% and then compare that to Bonferroni which says that any one of the observed enrichments could be by random chance So the typical FTR procedure is the Benjamin-Hochberg procedure which is much more recent that was in the 90s not in the 30s and the FTR threshold is also called sometimes a Q value and sometimes it's also called the adjusted P value and the FTR value so these usually they represent the same thing and unfortunately people often even in their own papers they use multiple terminologies but what you need to know is that they attempted to correct for multiple testing in these cases and as a side note I would be cautious about genomics papers that do not do multiple testing correction because you almost always need to do the moment you have multiple tests you need to correct for multiple testing By doing these corrections aren't you increasing your likelihood of making a type 2? So leading to false negatives you might for sure but I think this is a smaller scene compared to having many many false positives due to multiple testing In my mind I'd rather have like a longer list of things that I could try to validate in terms of whether it's it'd be easier to you know, cross things off the list than to not have something that's never on the list to begin with depending on your list your validation test costs you thousands of dollars Pathway analysis is notorious for garbage in garbage out so you could if you doubt the value of false discovery correction then you could disable it take a completely random list of genes run it through like a pathway analysis and see what comes out and my bet is that something will come out and you could construct a story around it and that's a fairly dangerous practice and this is actually like if FDR doesn't work for you for some reason and you doubt that it gives you additional insights then this is a very brute FDR test but generate some random data and run exactly the same procedure as you would with the real data and see what it delivers and if it delivers more than your real data than you're in trouble so basically an FDR correction using empirical randomly generated data is always a good look to understand how many false positives I actually deliver why don't we walk through this Benjamini-Hoffberg example it's a little bit involved so when you run a pathway enrichment analysis prior to correction you will receive the p-value for each gene set you test or each pathway you test and say the nominal p-values are here ranked from the most significant one so what Benjamini-Hoffberg FDR correction does is that it first takes these nominal p-values and multiplies them by their relative ranking in a gene list so if we have a gene list or a pathway list if we have 53 pathways tested we rank them from the most significant to the least significant then we multiply them by their inverted rank so 53 divided by 1 53 divided by 2 and so forth until 53 divided by 53 then each one of those adjusted p-values will become will become one of the steps in the correction as you can see then it sort of flattens out the distribution each p-value will it becomes a more discrete distribution of p-values at the right 0.03 53, 53, 53, 40 and so on and now the procedure will identify an adjusted p-value that is lower than the ones that are above it and then will up-propagate so you can see how 0.04 is up-propagated to the pathways that initially had a stronger p-value from the unadjusted tests and that that becomes a cut-off then because we were seeking to identify p-values that after correction would be less than 0.05 so then the nominal p-value that initially was derived 0.031 indicated by the arrow is the least significant finding and everything below it would be considered non-significant so following this test only the four first results would be considered significant and others would not so this is a more gentle correction compared to the Bonferroni where you brutally just multiply each p-value by the number of tests you conducted and you can see that you actually derive more results from this analysis compared to a particular Bonferroni test you don't need to ever implement this by hand because in statistical packages this would be just a single command so you're saying only the first four but those four did not force it so does that mean that initiation of transcription is significant and then some other 0.04 somewhere to somewhere that you're not showing us or are those 0.04 applied to transcription? because it was up-propagated and only those four would remain significant so you're saying that basically the transcription of transcriptional regulation transcription factor initiation is significant but I'm not sure I understand because their p-values were higher than the point of view initial p-values were higher but after you correct them by their rank they were not I'm not sure I understand why you're doing it that way so I'm not doing this and I think the bottom line is that initially people developed the FTR based on a large series of random sampling so how we would observe p-values to come out and then they developed an analytical procedure to replicate that random sampling at the closest rate so with such a test you want this to run quickly if you doubt that it works you have to do a permutation test which is more expensive to run. The percentage is up to the researcher to decide very often 5% is often the most common significance cut off that you could choose people also use 0.1 or 0.01 and then this is something that you determine in advance you say I will trust results that are significant at 0.05 and the assumption is that 5% of your results could be wrong and hopefully you did a power analysis before you started the experiment that I will look for results at that significant level and how likely it is that I will replicate them in a different analysis. From where did you get the rank number? The rank number? This is usually the algorithm to do it for you. You mean that 53 divided by 153 divided by the very first column this is based on the nominal test so the first pathway enrichment analysis will consider each pathway at a time compute the p-value for each pathway and then you rank your pathways based on that p-value from the most significant to the least significant and then you correct it by each rank and that will replicate. So here are a few practical notes. When you correct the p-value then the strength of the correction will always depend on how many tests you were running if you tested 10,000 pathways your correction will always be more stringent. So a good way to deal with this is actually pre-define your search space in advance if you don't need to test 10,000 pathways but you can focus on an error set of a thousand then that will lead to less stringent correction and therefore perhaps more results. So this is one reason why you may want to start with biological processes and molecular pathways and not analyze the types of gene sets because you will need to deal with more stringent multiple testing correction that will lead to perhaps more false negatives. And the second question is of course interpretation. So now fairly quickly why would we need to what else would we need to do in order to really be able to interpret our pathway enrichment and analysis results and then I'd like to describe the enrichment map which uses network-based visualization in order to visualize pathways reduce redundancy and therefore enable better interpretation. So the gene ontology and also pathway databases have a lot of redundancy in them. One is that pathways and biological processes are not always very well defined so they have overlaps between them but also the redundancy problem of multiple specificity layers in the gene ontology tree where there are specific nodes and general nodes and they're all interlinked and they all cover very similar sets of genes and when you use a mix of different pathway databases then the same pathway may be characterized using different terminology in different databases as well. So therefore the results very often look like this. You have dozens or hundreds of pathways that have a statistical significant enrichment but when you just browse the few first top ones they seem to be saying the same thing all the time with slight variations of specificity. So in order to overcome that there are multiple techniques but one of them is an visualization technique called the enrichment map which will cluster together similar pathways into these little sub-networks. Here every node you see or circle is one specific gene set or pathway and it's connected to another gene set or pathway by this green edge if the two share a significant proportion of genes. So the assumption is that if they have more or less the same genes they would be also biologically fairly similar. And then you couple that principle with an automated network layout and that will pull together the pathways that are fairly similar and then they start to represent a biological theme rather than a group of very very related pathways. So here's a motivating example. A few years back we were involved in a study of pediatric brain cancers and adult brain cancer is called ependomoma and then the primary way of diagnosing that ependomoma is based on pathology and imaging and that particular group in Germany collected a large amount of molecular data transcriptomic data and methylation data in order to discover molecular subtypes that are coming out of that molecular data clustering. And they found nine subtypes of ependomoma that represent different patient age groups and different prognosis and severity of disease. So there were hundreds of genes expressed in each one of those subtypes so we're seeking to use pathway enrichment analysis and visualization to outline these different pathways involving the cancer subtypes. And then here's a pathway enrichment map where each group of nodes is a similar set of biological processes and pathways and each color represents a particular subtype of that disease and sometimes you see nodes that have only one color a pathway that's representative of only one single type of subtype of cancer and then the multicolored nodes would be pathways that are up-regulated in multiple kinds or multiple subtypes of that disease. So this is a fairly compact overview and much easier to understand than large spreadsheets of pathways for these individual tumor subtypes. So this type of visualization obviously needs a little bit of manual curation but the input is constructed automatically using the cytoscape enrichment map app which we will also practice during this next tutorial. So very briefly what is cytoscape? Cytoscape is an open-source platform or a Java program that runs on your computer that enables network visualization of biological networks. So what is a network? In a very core way a network is a set of interactions so the relationships on the left so perhaps A1 and A2 are genes if they're interacting of some way then conceptually this is just a set of pairs of genes you can visualize that as a network where the edges might have different annotations or weights or you could also visualize this as a two-dimensional matrix or a heat map. They're kind of computation and they're equivalent but the network representation may have some additional characteristics because we are very good at visual perception and we could just analyze networks visually and understand the underlying trends. So what are the key ideas in network visualization? One idea is layout if you just look at the network it could be a fuzzy hairball but if you use an automated layout it may actually reveal its underlying structure in a meaningful way and in cytoscape you have a large abundance of different automated layouts which will help you to interpret those just by eye. The other aspect is that we have a wealth of information in omics experiments and we can assign that to different properties in the network. One very easy one is that if you have an important gene in the network you may want to give it the largest size of that node or maybe you want to use a brighter color. So some of those examples are shown here if you're just loading a network into cytoscape it may look very hard to interpret but perhaps you assign colors to genes of a particular pathway and then you use an automated layout that will pull together genes that have many connections among themselves and pull apart genes that have fewer connections and then you could move from the image on the left to the image on the right and you can see that nodes of the same colors start to cluster together maybe representing the pathway or a protein complex and then there's a wealth of visual attributes you can assign in cytoscape you can essentially color all parts of the network or maybe use different fonts or different lines and there's a lot of creativity in this process but some of that can be also automated as you can see from the enrichment map tutorial coming soon. So that concludes this lecture part