 Alright, good afternoon everyone. I'm going to stay close to the microphone, but if you can't hear me then just wave at me and I'll be louder or even closer to the microphone. My name is Juri Reimond, I'm a principal investigator at the Ontario Institute for Cancer Research. It's my third year now, I haven't done this for a while, but I have done a while worked on pathway network analysis, so this is what I'm going to tell you about today. So, we're still on. Alright, so learning objectives of this module, you should be able to identify situations where pathway network analysis is useful, and if you are working on genomic or any other high throughput data that will be plenty of situations where this is useful, you will learn about the main components of pathway enrichment analysis, gene lists and pathways. And in our context pathways are mostly sets of genes, and many biologists find this a little counter-intuitive, but it's the easiest to analyse them as sets of genes. And then a boring but important part of pathway network analysis is different types of gene IDs, you can't get around with them. And then also you should be able to understand how gene sets come to be around, where do they come from and what do they mean. So, hopefully many of you have been in this situation about interpreting gene lists, my new cool screen worked out finally or I performed this massive next-generation sequencing experiment and I found a thousand genes. So, what do I do about those thousand genes? And many technologies these days allow you to extract those thousands of genes from experiments. So, maybe they represent a differential gene expression analysis or proteomic screen or maybe something about epigenomics where you identify regions of interest in the genome. In any case, these genes to go through them one by one is way too much. So, we really need to focus on analytical techniques that allow us to analyse sets of genes quickly and efficiently. So, here's a small workflow. It looks like an old-school microarray platform. We use some sort of genomics techniques or analysis techniques to rank or cluster or filter these genes and we end up with a long list of candidates. What do we do next? We use tools that allow us to interpret those genes. And then, in order to interpret those genes, we really need to use the body of knowledge of biology that has been accumulated over decades. That could be about the specific biological processes or the gene labels, gene annotations, different databases, information about how genes interact or how proteins interact in cells. And through those analysis tools, you may actually find out something really interesting about your experiment or a particular candidate gene of interest and maybe you'll publish very soon. There's an alternative approach. You can take your 100 list of 100 genes and go to PubMed and do this one by one and you'll realise that there's a lot of research out there. You could read 100 papers about each of your genes and up reading 10,000 papers and spend a lot of time. So, Pathway Network Analysis is designed to give you a shortcut to maybe focus on a couple of very interesting genes instead of your hundred or a thousand. So, what is Pathway Network Analysis? This is one of the potential diagrams of how you can use that in order to get from genotype to phenotype. So, on the genotype end, you may have whole genome sequencing data. You have all these hidden variations or copy number alterations or different strange things that happen in the genome. And on the other hand, you may have some observations. Some patients do better than others in a couplin marker up on the right-hand corner or there's a differential gene expression analysis. You associate those two sets of data using information from public databases. Maybe there have been lots of large scale experiments conducted earlier, experiments on things such as protein protein interactions. There are large databases out there where each gene has certain labels assigned to them over decades of research. There's a lot of literature and there's experts. So, all of those sources contribute to what we know as pathways and networks. Those diagrams of pathways have been based on very long research and maybe each one of those nodes and edges over here is someone's life's work. So, Pathway Network Analysis is any type of analysis that involves pathway and network information. Most commonly, this is applied to interpret lists of genes. Any contemporary paper will have some aspect to it, I would argue. Most popular type is pathway enrichment analysis, but many others are useful. Sometimes these other types are more complex and make more assumptions on your underlying data. And it helps you gain mechanistic insight into large scale high throughput data, which we also often refer to as omics data. So, it's a good shortcut, but sometimes overused. What is the difference between pathways and networks? Very often, these things are sort of mashed up together into one concept. This is what I think about those things and that's kind of personal and different people will tell you different things. On the one hand, we have an EGFR-centered pathway, which would be a very detailed diagram up to the point of molecular reactions where we know what each component does and how it activates or represses another component. And very often, these pathway diagrams also have information about the directionality. So, perhaps EGFR is activating some downstream targets under certain conditions. On the other hand, the EGFR-centered network would be something which is probably derived from a high throughput experiment rather than careful literature curation. And we often don't know how these different genes or proteins interact with one another or what that interaction means biochemically, but we believe that there is a certain interaction. It could be a genetic interaction, a physical protein-protein interaction or something that has been inferred by analyzing high throughput data, often called a functional interaction. And then you'll see that this EGFR-centered interaction network is actually part of a much bigger network, big hairball, so to say, which is often depicted in literature, but not really interpreted into a great detail. What are the different types of pathways in network analysis? The first one that we'll focus mostly on today is called pathway enrichment analysis or enrichment analysis of fixed gene sets. And there's a large body of tools that do that type of analysis. One of the most commonly used tools is called GSEA, which we won't talk about today, but instead we'll talk about another one called Gprofiler for some authorship reasons. And the main input to that is a list of genes that someone derived from a high throughput experiment. And then the output of that would be the statement that this gene list is enriched in particular cellular processes, maybe cell cycle, apoptosis, some hallmarks of cancer, and so on. So another one would be a then over subnetwork construction or clustering, where you may have a list of candidate genes that you're interested in. And then you try to draw a network between these candidate genes using existing network information. So you'll say, you know, 20% of my genes make up this big interaction network among themselves. And now the third type of analysis that we could use in this kind of classification is a pathways-based modeling, where we use the pathway structure or network as a scaffold. And then we make hypothesis about which gene might interact with which other gene may actually activate that gene or repress that gene in gene expression studies, and so on. And you'll see that perhaps the first part of this classification is the simplest part. It makes the least assumptions on available data. All you need is a gene set and the gene list. And on the other end of the spectrum, you actually need to believe that the data structure or the underlying pathway or network is very reliable. You may also need to have some sort of additional observation on data such as gene expression, transcriptome values, and so on. So to summarize, if you think about, say, cancer genomics data, some mutations, perhaps, you would ask what biological processes are altered by somatic mutations in my list of genes. In the second case, you would ask whether there are some new unknown pathways altered by mutations in this cancer. Maybe there are some clinically relevant tumor subtypes that are representative of particular networks. And in the third case, you would say how are different pathway activities altered in this particular patient? And maybe there are drug targetable pathways in this patient because we see that, you know, downstream genes of a particular pathway are always down-regulated. Maybe the pathway itself is down-regulated as well or deactivated. So as I mentioned, in this lecture, we will mostly talk about that part, the gene set enrichment analysis. However, further lectures will touch upon the others as well. So why would you analyze pathways rather than single genes or single SNPs or single proteins or something along these lines? So one good argument is that you increase your statistical power. So statistical power is the notion of recovering these similar results if you had a different data set and how likely it is given the number of tests that you do. When you focus on one gene at a time, you will probably have to go through 20,000 genes that are tested if you're thinking about human genes that are coding proteins. If you have a SNP array from mathematics, then you may look at the million SNPs. Now, if you look at the pathway and network space, you're probably going to look at a few thousand different pathways. Therefore, you make fewer tests and you will more likely find the same pathways again if you look at different data sets. So it's kind of like a statistical power issue, but it's also more reproducible. So you will combine your many genes into fewer pathways and you may be able to make a more reproducible research. Pathways are way easier to interpret than genes because genes, when you look at the list of genes, it's like alphabet soup of symbols and numbers and letters. On the other hand, if you look at pathway data, then it's more like textbook biology. So cell cycle gets activated, maybe there's a developmental pathway that you learned in the university about and so on. You may also find clues about underlying mechanisms in your experiment. If you do a case control analysis between healthy samples and disease samples, pathway and network analysis may tell you what's wrong with the disease samples or what pathway may be disabled. And you could also predict new roles to genes. So if you have some unknown genes in your set that they seem to behave very similarly to known members of a pathway, then maybe you have found another newly described member of that pathway. Before you go into pathway analysis, you have to think about a few different things. First of all, it's garbage in garbage out. Your data better be high quality because pathway analysis is a type of an analysis which may give you an answer even though your data is flawed. So one example that comes to mind is that you perform say transcriptomic assays, you do the next generation sequence in an RNA and in your samples an apoptotic pathway shows up. Why could that be? Maybe because you left the cells lying on your bench for for a while. So things like that are often very easy to happen and then the pathway analysis will give you an answer because it detected a regulation of apoptotic genes. Genes need to be normalized beforehand and depending on which pathway analysis you use, you may need to use a different type of normalization. RNA-seq requires different normalization compared to say microarrays. Background adjustment is also important. Sometimes you know that your candidate gene list can only contain one particular type of genes and nothing else and then pathway analysis needs to be adjusted to take that into account. I will talk about this a little later but you may need to make sure that the particular type of gene IDs that you use are also comfortable with the software that you use. So there's a lot of caveats but these days many of the tools are really easy to use. So what is the general workflow of this type of analysis? First you collect your genomics data, omics data, proteins, RNA, single nucleotide variants and so on. Then you normalize and rank and score them. Oftentimes that step already happens at the core facility so that's great. Then you generate a list of genes that are active candidates and then you use these various pathway network approaches to learn about cellular mechanisms, candidate genes, characteristic pathways and processes and so on. And within that green step there are many other different steps that are applied. So statistics is one, we rely on statistical algorithms and multiple testing correction to identify pathways. Visualization has a key effect because many times these pathways are very redundant among one another. You may have dozens of pathways highlighted by the statistical analysis but they're actually all the same thing. And then you can drill down to understand molecular mechanism perhaps going back from the discovered pathways up to words the genes that you found and then associate genes and pathways in a more detailed analysis and then ideally you'll publish the model explaining the data where you integrate some of the pathway network analysis. I'll get to that but the key question is there's a biological reason and the technological reason. There's a lot of cross talk between pathways you know the same gene can be part of multiple pathways but also the ways pathways are represented in databases is very redundant. We'll talk about gene ontology soon but gene ontology is like a tree where leaves are very very specific you know even reactions and higher nodes are very broad processes like metabolism and they are contained within each other. So what is pathway enrichment analysis statistically? This is essentially a Venn diagram with a little bit of statistics happening in the middle section. So we're comparing in each test we have a gene list from our experiment which may be your differentially expressed genes or genes with a cancer mutation or something like that and then there are genes from annotated in databases for instance every gene that's known to be involved in neurotransmitter signaling. Okay and maybe your experiment was about drug sensitivity in brain cancer and so then you compare those two lists of genes with one another and you can statistically compare them using something called a fissures exact test for example and then that will tell you whether the proportion of genes in your experiment and the proportion of genes of neurotransmitters whether that shared proportion was way higher than expected according to a statistical metric. So you test that across many many pathways and that is essentially pathway enrichment analysis and if indeed neurotransmitter signaling comes out in drug sensitive brain cancer cell lines then that might give you a hypothesis that perhaps that neurotransmitter signaling has something to do with drug resistance or drug delivery in in that brain cancer. An example of a tool that performs this type of analysis is G-profiler that I developed during my PhD in Estonia a few years ago and then this is an example output of that particular tool maybe not. All right so that colored block over here are your input genes and then the pathways go from top to bottom that are shown on the left edge of the screen and there are various p-values over here saying how unexpected the presence of so many genes of these pathways in your candidate lists was and then there are various numerical values and those colors tell you something as well that I'll tell you in a bit. Now this is really dark I hope you can see but actually one of the goals is to convince you that there's a lot of data on the screen that you don't necessarily want to see. If you have a rich dataset at hand this is a typical result you'll have hundreds of pathways that become out as enriched and that's partly because they're very redundant. So you have a well performing experiment you have a lot of characteristic pathways coming out of your say case control RNA-seq dataset and then you'll get dozens to hundreds of pathways that you don't want to analyze one by one and this is where we use a network visualization that's called an enrichment map where each node of this network so its colored circle represents one pathway and they are grouped together with similar pathways using these green edges and the motivation here is that you group together similar pathways if they share similar genes and if you do that consistently and systematically then those groups of redundant pathways those many redundant pathways will instead become subnetworks or network modules and then it becomes much easier to interpret so instead of having 300 pathways you'll have maybe 20 different groups and these groups are very often you can just look at them and give a three letter three word summary what that actually represents. What would you say are used to profiler and what kind of use GSEA? All right I'll give you a short answer comparing G-Profiler and GSEA actually there's probably like dozens of tools that do different aspects of this research the main difference between G-Profiler and GSEA is the type of a gene list you provide as input so G-Profiler will work with a list of 10 genes and it will work with a list of say a thousand or two thousand genes GSEA always requires you to input the gene list that's equivalent to the size of the protein coding genome so 20 000 is the standard input for GSEA and for some analysis it's it's perfectly fine you have a say a transcriptomic data set where you have a gene expression value for every gene and then GSEA is appropriate because you have a value for every gene in the G-Profiler context you would have pre-filtered that gene list to get the statistically significant differences and only then you go to G-Profiler so GSEA works in it's designed for gene expression data it doesn't really work with say proteomics data so G-Profiler is more general but it won't analyze the entire list of genes for you. So let's say you want to do a sort of list of genes. I'll repeat the question so if there is a a large gene list but it doesn't cover everything say 8 000 genes coming from an enhancer profiling experiment I would just take a step back and say what do you want to find what is characteristic of 8 000 genes what is the label you want to put it on a pathway context it sounds like if it covers say 50 60 percent or 50 percent of you know annotated genes all together then statistically you are not going to get a very large number of meaningful enrichments so I would attempt to rank that gene list according to say strength of an enhancer signal and then take the top thousand genes according to that if you don't have a good statistical measure of drawing a cut-off. All right so where we stopped was I suppose this long list of dark green lines moving to a network and I wanted to show you an example of a recent analysis where we investigated tumor heterogeneity in a particular central nervous system tumor called ephendomoma which occurs in children as well as adults and then researchers in that in the paper they were showing that there is no single class of ephendomoma but there's about nine different subtypes which have clinical characteristics and molecular characteristics and histology differences and so on so our task in this analysis was to figure out which different subtypes of those nine what types of pathways and networks and functions are representative of those so we used a technique similar to G-profiler to annotate these highly expressed gene sets for every different tumor subtypes and we ended up with this colorful visualization using the enrichment map where different groups of network nodes highlighted here represent different pathways activated in each of the subtypes and colors represent the subtypes so you can see that some of those areas of this complex map are annotated by multiple colors so multiple subtypes had an enrichment in that pathway and others that are are unicolor so they they were characteristic of only one subtype so there's a lot of omics going into this figure but yet it's quite easy to read and understand at least on a broad level so haven't seen this hopefully motivating example let's go into the details a little bit first is where do gene list come from and this is what you actually know the best because you know your experiment the best but broadly speaking we we do all kinds of molecular profiling and pathway enrichment analysis out of the box best works with the data where each gene was supposed to have a signal so you do a genome by screen for every gene and then every gene will have a signal and then when that's when pathway analysis works the best so the simplest type is just gene list which is doesn't have any meaningful order to it it's a list of genes and this is where you apply a tool like g-profiler then there is a second type of list which could be a list of genes along with values so maybe values of fault change relative to cases and controls and then if you have a full list of those fault changes you may want to look at the tool like gsa and then you can do all kinds of ranking and clustering and other custom approaches to analyze that data and then maybe you do a clustering analysis of your data and then each cluster becomes your gene list and you can analyze those gene lists separately a whole different way of doing pathway enrichment analysis is to focus on networks as input so you you could look at your favorite gene or protein and everyone every other protein that interacts with that protein and then do a pathway enrichment analysis on those proteins it can become more complex as we just discussed about enhancers you may look at you know gene regulatory regions of the genome or microRNA target sets of genes and the analyzed pathways in those cases they become a little special cases especially those that are out there in the large whole genome because not always will you know which enhancer regulates which gene so there's a tool called GREAT that i think will be talked about later which will account for distal regulatory elements and two popular enrichment analysis of those you can look at the genetic screen for example a knockout library and see which essential genes come out and analyze those in the pathway context and then you can also look at association studies GWAS studies look at single nucleotide variants copy number variants these further examples are often haunted by the fact that you don't know which genes those distal or non-coding variants regulate and there are many other examples i think it's worth mentioning right here that when you do a pathway enrichment analysis on an omics screen or an omics data set that doesn't cover every protein or every gene in the genome you need to be careful to set something that we call the background set so a background group of genes for a pathway enrichment analysis so an example of that would be a phosphor proteomic experiment where proteomics people will know that not every protein will be phosphorylated there's about 10 000 proteins or maybe half of the proteins or in the human protein that gets phosphorylated ever and therefore when you do pathway enrichment analysis on those type of data you have to provide those 10 000 as a so-called background set and that's important because otherwise every phosphorylation related process will have very highly amplified p values and it will make your interpretation difficult what do gene lists mean? Gene lists are coming from the experiment and they describe a particular aspect of your experiment they may represent a complex or a pathway or physical interactions sometimes they represent genes with similar functions that are activated maybe protein kinases when you treat them with a kinase inhibitor they could represent a tissue specificity when you're comparing tumor samples with adjacent normal tissue or they could represent chromosomal location for a copy number variant analysis which covers multiple genes so biological questions that you ask when you do a pathway enrichment analysis this is something that you should actually start your experimental design with so one very typical way of analyzing gene list is to say these are the characteristic biological processes active in that gene list another one is to perform differential analysis where you have cases and controls and you want to see the biological processes that are representative of cases relative to controls you may be after finding a regulatory gene or an RNA that controls your genes of interest or you may want to discover new gene function and these are questions that can be answered directly or indirectly using pathway enrichment analysis. I'll go through this again quickly pathway enrichment analysis allows you to summarize and compare gene lists network analysis looks at the interaction networks which is a slightly different beast but also can be computation more complex and you make certain assumptions and then there's a regulatory network analysis lecture coming on as well where your your goal is to find regulators of those genes either transcription factors microRNAs long known coding RNAs or so on pathway enrichment analysis has these two input components gene list is something that comes out of your data analysis and pathways or gene sets are those that are part of public databases and then there are these various tools such as Amigo, GSEA, G Profiler that provide an intersection between gene lists and pathways and retrieve enriched pathways other components include gene identifiers various gene annotations and where to get them so let's get over these protein and identifiers first identifiers are usually ideally unique stable names of numbers that track database records now it's interesting that many people will refer to their genes as gene symbols so a four letter code and a number it turns out that these actually change quite a lot so over the years maybe 10 of those names change so we should really watch out so you know social insurance number is an example of a number that doesn't change over time usually and that's a good one because you can always refer to a person through that number but the name can change it becomes more difficult further some databases refer to as the units of interest as genes other as DNA regions yet others as RNAs or proteins so when you're trying to integrate data especially then it turns out that this becomes a challenge because genes don't uniquely link to proteins or RNA or DNA so it's important to always recognize what types of identifiers are you working with and start to resolve these associations that are not one-to-one so here's just a short overview of different types of gene identifiers I believe G Profiler deals with hundreds of different types of gene identifiers the good ones are those that are highlighted in red so for example when you're working with genes these ensemble gene identifiers uns G and the large number of digits these are usually stable they don't change over time and so are untrained gene identifiers on the other hand Hugo HGNC or human genome nomenclature symbols BRCA2 is an example over here these sometimes change over time and these are kind of up to researchers to name their genes in publications and rename them and often there are these nasty examples of things going wrong I'll have another example here software tools sometimes recognize only a handful of types of identifiers so they may fail to recognize certain identifiers and that can be problematic especially if you have a mixed list of different types so the best way is to deal with that preemptively and try to organize your input gene list into one single type as soon as you can and then that can be done with certain tools but sometimes you won't be able to curate your list automatically 100% and then you need to review some some gene symbols that look weird so you need to avoid one too many mappings in cases where one symbol points to many different different genes there's a lot of ambiguity for example p53 which is probably the most studied gene on on earth has also many names due to the fact that it's so much studied so you better use the official symbol or even just one of those database identifiers that are claimed to never change don't use excel or if you do be very careful about it oct 4 is not october 4 in the context of the genes and there are some other examples of september and and so on so if you have to use excel then then there's a way to paste your lists as text and i i'm surprised that i think this is not the default option you just have to do it every time so here's a nasty cautionary example where people published a nature paper maybe 15 years ago and then published the retraction because they had looked at their own gene symbol so don't do that huh so they they claim in this uh sort of retraction note that they were thinking that they were analyzing a particular gene but instead they it turned out it was a different gene entirely i haven't really followed it but yes i mean there's there's two different genes called has one and then the symbol had changed i think it's worth looking into that if you're really interested here it just serves as an example of caution so to practically address this there are different tools gconvert out of gprofiler is one tool where you can select your target choice of a gene id and you can paste in the mixed set of ids and it will give you this comprehensive table of what maps to what along with little descriptions that allow you to read if it really makes sense gconvert is actually based on the ensemble database where we automatically pull these biomark tables of what identifier maps to what other identifier in order to provide these comprehensive mappings and to be honest this is not perfect either because besides official identifiers there's a jungle of aliases and deprecated gene symbols and things that don't really go into the system and if they would they would mess up things even more so on the on the right edge you'll see this example of different human gene identifiers that can be mapped and this list goes on and on so besides databases they could also represent experimental platforms such as um microarray probes that identifiers and many many other things okay if you're working on proteins and genes then it's a little easier because you don't have to worry about alternative splicing or or well protein isoforms then you can map everything to Andre gene IDs or official gene symbols watch out for official gene symbols because they map to date sometimes and they also sometimes change if you really want 100 coverage you should manually curate missing mappings using multiple databases gene cards is one for example where you can look up your gene names and then remember to format yourselves as text before you paste so quick summary of what we've learned so far genes and their products have many different identifiers some tools handle that automatically others don't others are restricted to certain types of IDs and when you do genomics then you often need to convert one type of ID from another especially if you do as a multivariate genomics analysis of multiple different datasets and use standard commonly used IDs as soon as possible in order to avoid that chaos later on so the second component that we should talk about is pathways or gene sets or processes that can be difficult or well not difficult but diverse the main the main source of those is gene ontology as well as pathway databases such as reactome yeah uh doing do you mean translating between the identifiers of that model organism or between different organisms to translate between different organisms you first need to find the homologous genes there is a tool in G profiler which I won't be able to cover but it will map from one species to another species but within the species there are other databases obviously so there's a mouse genomics database and the yeast genomics database and they also deal with different symbols and different database identifiers and the problems are similar and and those the G profiler tool set works with many different species so hopefully it applies to those problems most of the things are identical but there are some genes that do function differently in species so you have to think about the functionality and also that the existence of one gene doesn't necessarily need to save the whole being the existence of protein so just doing a simple gene name to gene name population don't argue is dodgy dangerous and best reminded sorry I'll be teaching you tomorrow sorry you're in that's okay also a single gene in human may have multiple copies in another species I think it's mostly vice versa so a single gene the worm may have like a family of genes in human okay so pathways and other gene function attributes so I'm I have pretty liberally used the word pathways but this actually reflects in this context any set of genes with some sort of functional relationship which is well defined so all of these sets of genes are available in databases in terms of pathways our primary sources are the sets of genes corresponding to biological processes in gene ontology and then various pathway databases reactome is a great resource for human databases for example as is keg and the different species will have their species specific databases of pathways and then there are these other annotations that are not necessarily corresponding to the classical definition of pathway for example in gene ontology there's another branch of the tree called molecular function which is more about biochemistry and how things react to one another and then the third ontology is the cell location or cell component ontology which is about organelles and cell parts you could also interpret the gene set as a list of co-located genes so something that are all in a particular chromosome region or an arm you could also group genes according to their disease association a very common one is like the cancer gene sense of set of genes you could look at some DNA properties for example do they share common enhancers or common transcription type of binding sites or do they share microRNA binding sites in their in their untranslated regions and similarly about protein interactions or protein properties such as do they carry a particular protein domain so the same type of simple statistical approach of association applies to many other things besides you know bona fide pathways what is a gene ontology so a gene ontology is a structured dictionary the technical term is a directly the acyclic graph which almost looks like a tree but it's not a tree because there are additional connections in the tree so the root of the tree is something very very general the one important root of gene ontology is the biological process and then the leaves of the tree are very specific processes maybe they have only one known gene that is involved in that process and then intermediate branches of the tree will represent more general and specific parts of what we know about biology it's important to know that that there's one gene ontology and that represents the the biology of bacteria the biology of humans the biology of I don't know whales everything is supposed to be a part of that ontology and then what matters are the gene annotations so humans will have certain gene annotations and plants will have others and bacteria yet yet others so here's an example this is one certain part of the gene ontology the topmost term is biological process then it goes further down to more specific parts there's a cellular process there's cell death there's B cell apoptosis at the very bottom and it goes on and further and then the different these are the different annotations and genes get assigned to them and then these go terms are associated with different types of relationships one term could be a part of another term or it could be a more specific representation of another term so that describes levels different levels of detail for gene function and the important part the important technical note here is why this is not a tree is that terms can have more than one parent or child so B cell apoptosis is a is a type of an apoptosis but it's also a type of a B cell homeostasis so this is how we describe biology formally what does go cover for pathway enrichment analysis probably the most important part is the biological process for example the biological process here is a cell division process so that's what cell go under other parts of go are cellular components so here's a cell with its membranes for example and then molecular functions the detailed molecular functions such as glucose 6 phosphate isomerase activity where do go terms come from go terms are added by human editors at the european bioinformatics institute as well as gene annotation databases and you'll see that this is a very large and detailed effort because the go is supposed to represent all kinds of biology out there and there are also expert developments where major branches of go are rearranged or deleted or or created it's important to note that this is not a static entity it evolves quite rapidly over time and then this graph is a little outdated but you'll see that you know cell component grew over 30 percent between 2012 and 2015 and that is as we learn more about biology especially in these days of high throughput omics data we these annotations grow rapidly but also they're underlying vocabulary grows rapidly okay so annotations are actually the things the associations between a particular go term and the gene of interest and these are these happen as papers get published and as expert curators look at these papers and they decide that the researchers found an association between the gene and the process and genes will have multiple annotations because they belong to multiple processes found out in different studies over time and it's also important to notice that these annotations have various standards of quality so some gene annotations happen automatically an algorithm goes through large databases and assigns these gene annotations and in other cases there there's a team of experts reading the paper and deciding what that gene is doing so not not all of those annotations are born equal and that's something that you may want to pay attention to when you do practical analysis so and why do genes have actually multiple annotations and we had a question earlier where does all the redundancy come from here's the technical reason for redundancy you see this entire tree that starts with biological process and at the bottom at the most specific level that's the B-cell apoptosis there's the aurora kinase B or KB which is a particular kinase and then the researcher associated that kinase to B-cell apoptosis now according to the rules of this dictionary automatic associations immediately are assigned to every parent node of this B-cell apoptosis so basically aurora kinase B annotation gets propagated all the way up to biological process so it becomes associated with B-cell homerstasis apoptosis program cell death and all these different annotations so that's why when you do a pathway enrichment analysis on a rich dataset you'll get these hundreds and hundreds of results all right i think we went through this already but the key point is that you have to be aware of the origin of that annotation for example human will have a fewer experimental evidence of annotations compared to mouse where we have way more ways to study mice due to ethics and so on and then many of these mouse annotations are then propagated to human because there's evidence from model organisms and that is coded into the various types of annotations that are represented in pathway analysis tools that will tell you how did that gene get annotated to that process here's just a brief overview you don't need to know them really but there are some evidence codes that will tell you that this gene was associated to a process thanks to experimental evidence all the way to knock out experiments and so on and then there's this lower level evidence which is called IEA inferred from electronic annotation where it was just an algorithm that was matching up the genes and annotations based on homology for example or or various different approaches so you would say you probably trust the IEA is a little less than say functional experiments in cell lines in g-profiler the colored boxes represent the evidence codes and then visually the evidence codes that are blue and green are weaker evidence and those that are in the dark red range are more functional experiments more mechanistic experiments so when you perform this type of analysis you'll see the screen and it will tell you whether it's mostly red and mostly reliable or if it has a lot of blue colors then it's probably derived from computational analysis. Genontology and the annotations actually cover a surprisingly large amount of genomes and many of them are annotated through homologist genes besides human we have information about all major eukaryotic model organisms as well as bacterial and parasite species and then those types of information are curated that species specific databases and consortia and and so on as I mentioned earlier there's a lot of variable coverage this is a chart that shows you how many annotations there are for genes per species and on the far left the largest is human obviously and mice and rats and on the other hand there are yeast and E. coli and you can see also that there's a large number of experimental but way larger amount of non-experimental annotations telling you that not all of that information is very high quality and there's a big list of different contributing databases that generate these annotations to genes but also contribute to creation of the genontology structure. The GO resources themselves are freely available so you're welcome to download either the tree structure as a file or any of the annotation files and there's a very large community of bioinformaticians that are developing tools and approaches for these GO analysis pipelines. There's a study that we performed one of the first studies in my lab was we asked whether genontology annotations as such have a best before date and it turned out that they do. One of the most common tools that was used in the recent years is called David and I'm sure that many of you have heard of that tool so it was known that David was out of date way before we published this this paper but basically we went out to PubMed and counted the different GO analysis tools and how frequently they were cited and David came up by far with the largest number of citations and in this plot we correlated the number of citations to the number of times or the most recent update of that software so it turned out that David when it was cited by thousands of papers in 2015 the information into David into going into David was from 2010 so five years or even more at that point we wondered how much effect does that really have on the interpretability of these results. It is very fair to note that when we put out our paper into bio-archive then David suddenly woke up and was updated very quickly so since then I think David is no longer updated seven years ago but it's updated more recently but you should still pay attention if you're using David or not like when where when was the last update of that software. Pardon? GSEA that's a good point for GSEA you need to provide your own data of gene sets so all these tools over here they use their own gene ontology on the web when you run GSEA you have to download a data set and put that into GSEA so that will your analysis will depend on how fresh data you are using right there's a resource where you go and download and to be honest I think that the MCDB resource is what you referred to that is pretty recently or frequently updated. Yeah I'll just go ahead about the MCDB they're waiting for funding so they're actually not as updated as they usually are. All right so there's a bit of a holding pattern until new funding comes along and then this is this is Yuri that I can have a view over this later about MCDB is a highly used resource over there it's unusual that it's actually falling into the same categories once you get the funding. So it may be maybe it's one year or two years behind but it's definitely not lacking by five or ten years. So obviously this is not only because tool developers are lazy it's also because tool developers run out of funding the graduate students leaves there's a lot of different reasons why tools are not doubted and that becomes a whole different discussion. Yes but there's another person developing it so that's that's fine. Right so as a user you need to pay attention and then there's a quantity of evidence that you need to pay attention so when we compared annotations that were at that point six years old and we performed a pathway enrichment analysis of a particular set of genes from brain cancer and we asked how much information is an outdated database missing then that that amounted to 75 percent so you know one out of four pathways will be found if you use this earlier outdated dataset and if you use a very recent dataset so from 2018 you'll find 100 percent and then some of those are very clear why these wouldn't come out from these earlier databases for example drugable pathways there were no drugs to drug those pathways back in 2010 but now there are drugs to drug those pathways so they show up in a pathway enrichment analysis so there's a lot of reason for that we have these new omics technologies that update the genontology annotations very rapidly there's also a lot more effort to collect them because there's more data therefore it needs to be analyzed more often and there's also the technical reason of as the go tree gets wider and wider each gene will naturally accumulate more annotations because of these you know annotate or propagate rules up to the top of the tree more about pathway databases pathway commons is a resource developed in in Toronto that that has aggregated information from many different pathway databases and then it becomes a super resource where it can look up sets of genes and see whether annotated in different in different individual databases and then as i mentioned before this analysis is very general it doesn't need to apply only to biological process and molecular pathways but you can use all these other annotations and that there's a large number of those annotations it doesn't mean that for every task you should consider them all because you'll overwhelm yourself with the results and there's the other reason of if you analyze too much data at the same time false discovery rate will become much more stringent than you may lose results so when you look at these other types of annotations usually we recommend that you look at pathways first and see what comes up and then maybe selectively look at some of those other annotations depending on what you're after so for a proteomics experiment it maybe doesn't make sense to look at transcription factors for example or vice versa or if you don't really know much about microRNAs then why consider them in the first round just try looking at say reactant pathways and genontology biological processes first and then dig deeper into these other annotations whose quality is also variable so what have we learned pathways and other gene attributes come from databases databases need to be up to date gene ontology is one of the major resource gene ontology itself is like a dictionary of biology and the this dictionary has structure and then gene annotations or links to that dictionary are contributed by many groups each gene will probably have multiple annotations and sometimes that's a large number and those annotations have different quality because sometimes they're human curated and sometimes they're machine curated some genomes are more annotated than others human researchers have a privilege to that to that extent while others may have more trouble especially if it's a very exotic animal that no one ever sequenced besides you and sometimes there's a there are representations of go that are not as redundant so there is a version of go called go slim which can be used annotation as annotation is variable quality some tools allow you to filter annotations so g-profile allows you to filter electronic annotations to get a more confident snapshot of the pathways that are part of your gene list and many gene attributes are available one example of availability is the database of ensemble which is updated I think every four months and the ensemble biomark allows you to go and download these large tables that have all the information if you need it so here's the workflow that we showed earlier you collect some data you you perform statistical analysis on that data you generate the gene list and then you perform pathway enrichment analysis that allows you to learn about the mechanism and visualize and and publish but well it's not often as simple so first of all like the first layer of collecting data can mean a myriad of things analyzing that data is also not a single step but it will depend on what what you're actually analyzing but that the the fact that you generate the gene list that is very often universal so when you do a model when you do a many of these omics experiment you end up with a list of genes or proteins or maybe genomic loci but that also becomes a list of genes eventually and there are many different tools and approaches to analyze that data I think a common rule of thumb is that a very simple analysis such as a gene list analysis also makes the least assumptions while if you want to do a very complex say modeling of of gene expression levels in a pathway then you make more assumptions because you you have to know the pathway very well and your you know observational data also needs to be high quality so the things that we discussed today are pretty generally applicable while you can always go more complex so finally when you're interested in pathway enrichment analysis of omics data we recently published pre-published in bio archive a very comprehensive protocol paper about how to perform this type of pathway enrichment analysis starting from a list or gene set all the way to analysis or visualization with the enrichment map and that covers both gsa and g profiler and some of it is r-code and some of it is step by step a clickable manual and it's it's close to 100 pages I believe and it's currently in revisions so after I finish this lecture I'll go back to looking at the text but it's already available so you should definitely have a look if you're interested