 So the first lecture is more about the background of pathway network analysis and why you really want to do that. So the learning objectives are the following. You want to identify the situations where pathway network analysis would be useful and then understand the main two components. So first of all the gene sets, the functional knowledge about genes and then the gene list, something that you have derived from your experiment. Bioinformatics oftentimes is about understanding gene identifiers and gene identifiers are really important in pathway network analysis. You want to make sure that you work on your gene of interest, not something that sounds really similar in the identifiers. And then we also want to understand how annotated gene sets such as biological processes or pathways, how they are really derived from the literature and where do we get them originally. So I'm sure that in the age of omics every one of you has encountered this particular situation that you've done this new fancy screen or maybe you've done some next generation sequencing programics, what have you and you produce the large list of genes or proteins. And maybe it's in the dozens then you're lucky you can do some sort of literature research but it could also be in the hundreds of thousands and then there's no meaningful way of manually understanding hundreds of thousands of genes unless you're really keen. So in a very typical experiment we often use the most simplest approach or most common approach of RNA, mRNA interpretation here but then keep in mind that pathway network analysis is applicable to many other omics data sets. So for example you may have done a microarray experiment or these days an RNA-seq experiment derived a lot of samples and performed some initial statistical analysis maybe ranking your genes or clustering your genes and you have a gene list and then you also have a big question mark what do you do next. And then you really want to use some automated techniques in order to understand your gene list and in order to interpret your experiment through that gene list and eventually publish something interesting. So there's a myriad of analytical tools and approaches that all go under the umbrella of pathway network analysis that essentially deal with that situation. So having your computation analysis of experimental data and your gene list at hand you will use automated tools to analyze the existing knowledge about these genes, their pathways, networks and processes in order to interpret the gene list, find something about the mechanism and tell something new to the scientific audience. So the classical approach of this involves the PubMed database. So you have your rank gene list, you go through that one by one, some of those genes you probably know because you worked on the problem before and some of those genes are novel and you can totally go to the pathway, the PubMed database and start going gene by gene and then you surely be buried in the scientific literature very soon because each gene will have maybe dozens of papers but maybe hundreds or thousands of papers if it's a more commonly studied gene. So you don't want to do that and instead you want to do automated approaches. And then this is a large overview slide of how complicated the problem really is. As you all are having access to these genome-wide or protein-wide technologies you can query essentially snapshots of cells of everything that's going on to certain limitations and then you want to take advantage of decades of existing literature that has been studied about these genes and proteins to be able to explain what you see in the data. And then various parts that go into that analytical pipeline would be previous experiments and predictions, databases of biological knowledge, literature and also exploration. So maybe you have good collaborators to talk to about your experiment. Partway network analysis very generally is an any analysis that involves pathway and network information. We'll try to define pathways and networks later on but as you find many people will have their own definitions so it's sort of open. Most popular type of pathway enrichment. Most popular type of this analysis is a pathway enrichment analysis but many others are useful. So enrichment analysis essentially deals less with maybe the interactions between the genes and it treats the pathways as a set of genes. And that often causes a little bit of confusion because people when they think about pathways they always have this diagram of balls and arrows in their head. The simplest type of pathway enrichment analysis really deals with pathways as sets of genes. We'll go through that a little bit today and tomorrow we have lectures where people talk more about the network aspect. So pathways and networks, how do you really compare them? So this is one of the potential take-homes or messages or ways of interpreting pathway enrichment and pathway and network data. So this diagram shows you two ways. It always centers on the EGFR protein but then we can talk about the EGFR related pathway or the EGFR related network. So one interpretation is the following. Pathways are collections of highly detailed information that have been collected over time in careful small scale experiments and then interactions in pathways are often defined in a very specific molecular way. For example protein A phosphorylates protein B under certain cellular conditions. And then this is pathway information. Pathway information is often more restricted because not all of the irrelevant experiments have been done in such careful way but then again it's pretty high confidence. In contrast network information is slightly different. So here you see different types of edges between these circles, circles and proteins and edges could be inhibiting edges or activating edges or maybe just physical interaction edges. So the information in networks is often more vague and it's also often this little piece of a network is part of a much larger much more complicated network. So you'll see that the networks you have way more information but that information is less reliable. So often times these networks are constructed from large scale maybe proteomics or genomics experiments where every individual edge wouldn't be scrutinized as careful. Okay so types and paths and network and I'm sorry some of this slide has been cut off but in general these are the three major types that we can talk about. The simplest analysis is enrichment of gene sets. So instead of interpreting a particular pathway as a complicated set of interactions we could just say we only care about the genes or proteins in that pathway and let's see how many of these genes or proteins are present in our experiment. And then the second one is de novo subnetwork construction and discovery. So as we have a gene list of interest can we build new networks within that gene list that represent maybe some sort of biological knowledge. And then the third most complicated area of tools over here is pathway based modeling in which case we would have a certain pathway as a scaffold and then we would use our experimental data to verify whether that scaffold holds for example can we build inhibiting and activating edges using the data that we have derived from another experiment and also the pathways that we know from public literature. Okay so for example in the context of cancer genome analysis the applications of these three different techniques would be the following. If we just do pathway enrichment analysis then we could ask what biological process in the pathways seem to be altered in this specific type of cancer. Which cancer genes are involved in these pathways and is there a statistically significant overrepresentation of these genes. The second one do we detect any de novo pathways or novel pathways that are altered in cancer and do these de novo pathways correlate some with some sort of clinically relevant cancer subtypes. And then in the third case which uses the most information and also depends on the best quality of the cancer pathway data is whether other pathway activities altered in a particular patient and other drug targeted pathways in this patient. So here we model the existing experimental information together with the pathway structure to make sure that the pathway and the experimental information are in agreement. So in this lecture and the next one we really mostly talk about the simplest kind of pathway enrichment analysis that just deals with gene sets. It is the simplest but it's also the most broadly applicable and it has the best coverage of data so probably when you're doing your practical analysis of your own data this is where you want to start from. However I totally encourage you to look at the other ways of analyzing pathway data as well. So a little bit of motivation why do we want to analyze pathway data instead of just focusing on single genes. So single gene analysis is also very important but you may want to move to the pathway level because of these reasons. For example a pathway analysis improves a statistical power so when you analyze your data gene by gene or protein by protein you're likely analyze tens of thousands of units right or maybe if you do proteome wide it could be hundreds of thousands. However when you look at pathways you can restrict your search base to a narrow set of maybe hundreds of thousands of pathways and then therefore you don't treat your data to so much multiple testing correction and maybe you'll find out some more interesting results that are near the threshold. Then it's more reproducible as well because instead of looking at single genes across experiments for example you look at gene signatures across experiments and then by virtue of having more genes you're more likely to capture the same things even though if your experiments are a little bit variable. Also pathway information is usually easier to interpret because instead of looking at an alphabet soup of gene symbols which can be really confusing and boring sometimes you look at familiar concepts from a biology textbook such as cell cycle or apoptosis or differentiation and then interpreting your data like that is better because you you get to your hypothesis more quickly. Also ideally pathway enrichment analysis or any pathway related approach will perhaps help you in identifying the biological mechanism perhaps you'll be able to identify different regulators of your genes for instance or find out the reason for a particular upregulation of a pathway by interpreting your initial experimental design. And also we can predict new roles to previously unknown gene this is called the guilt by association principle. So genes never act alone and if you see a cluster of famous genes in your data perhaps you can associate some of the function of these famous genes to some other genes that have previously not been described that well. So if an unknown gene seems to hang out with the many known genes then maybe it's just like a previously undescribed member of the pathway. But before you go into the pathway analysis it's actually essential that you treat your experimental data really carefully because in pathway analysis the principle of garbage in garbage out holds very well. If your data comes with confounding factors or there's problems in the data then you will you will receive apparently very interesting data from the pathway analysis that could actually be caused by our experimental artifacts or data production artifacts. So an example that my my post-doctoral advisor likes to tell is that they were analyzing data from mouse tissue and they saw a signal from apoptosis coming up really strongly which was interesting because apparently that was related to cancer genomics and you know it's one of the hallmarks of cancer. But instead what turned out was that some of the control samples had been left standing on the bench top for a little longer and they started to die. So the apoptosis signal was not really an effect of something cancerous going on in the cells but rather an effect of an experimental processing. So something like that always keep in mind and then anything you do with the data normalization will actually contribute to pathway enrichment analysis downstream. Okay so this is generally the main pipeline you collect your genomics data or any other omics data. You carefully normalize and score and interpret it on a single gene level and then you generate the gene list. This gene list can be longer it can be shorter. Oftentimes rank gene lists are better than flat gene lists and then you broadly learn about the underlying material mechanism using pathway network analysis and that involves statistical analysis because you want to make sure that the signals that you're getting are statistically supported but it also involves visualization because we humans as visual creatures we tend to learn very much from seeing images and then we can also feed that pathway information back into our literature to search to better understand what the genes might be doing and then ideally you'll you find out a new exciting model about your experiment and publish it in a high impact journal. Right so what is pathway enrichment analysis? It can be actually visualized with this really simple band diagram. On the one hand you'll have your gene list from your experiment that you're working with and for example there could be genes that were down-regulated in a drug-sensitive brain cancer cell line okay and on the other hand you have annotated genes from various databases and then instead of one band diagram circle there are hundreds of potentially thousands of them each one of them representing a particular facet of biology that where that people have curated over time. For example that could be a gene set unknown as a neurotransmitter signaling so that would contain all the ion channels and the receptors and things. And then you run statistical tests you usually run them more than once and then you ask do my gene list does my gene list contain more genes of that ion channel family than we would expect by random chance. The standard test here is the fish's exact test although other other tests are potentially useful in various scenarios and then if that test returns a statistically significant p-value then you have reason to believe that the genes in that list that were enriched in the neurotransmitter signaling family they were not enriched because of a random chance but they were there because they explained some sort of biology. And then you build a hypothesis that maybe drug sensitivity in brain cancer has something to do with the reduced neurotransmitter signaling and then you proceed with analyzing the literature whether something like that is already known. Many problems obviously emerged as these pathways are not of equal quality when you do many many tests you're more likely to find out something important but we'll talk about these things in the in the coming lectures. Okay so practically speaking G-Profiler is one of those tools that allows you to do pathway enrichment analysis it accept the gene list and then spits out these long longer or shorter lists of different pathways. I have a conflict of interest because this is my PhD work and there will be more slides about how to use it then we'll have a tutorial as well but basically you'll see that you pasted your gene list up here somewhere and then as a result you receive the list of pathways. Pathways are all often hierarchically related to each other you'll get the information about the genes involved in these pathways you get an enrichment p-value and this little colorful grid tells you what is the type of evidence that supports each gene in that pathway. The problem with pathway analysis I mentioned hierarchy a little bit and I'll get back to that soon is anytime you have a very meaningful gene list with lots of rich information in it you will have many pathways that come out of the statistical analysis. So this is just an example lots of pathways here and that list continues all the way to the floor and beyond and the problem is that they are not distinct they're not independent for example here you'll see a lot of pathways of the same kind information response. They're slightly phrased differently but they generally represent the same underlying facet of biology so how do we deal with that? Obviously visualization is one feature or one option to proceed with this analysis. This tool is developed in the beta lab where I did my postdoc so I've used it many times and find it very a very good intuitive way of interpreting pathways. This is called an enrichment map and then the input to this enrichment map is a list of significant enriched pathways so the statistical analysis has already been performed and then they have been laid into this network topology where each node or circle represents a pathway and every time a pathway is connected to another pathway that means that there are many common genes. So if a pathway shares many common genes that it also probably shares many many common biological features. So that network analysis allows it to compress this very long list of sort of similar pathways into a network where all these sort of similar pathways are grouped together into clusters in the network space and that helps you to visualize and interpret the data in a much better way because you don't go through the list you go through these network modules. Just a little case study about that appendomoma is a pediatric and adult brain and nervous system cancer which is used now which is primarily diagnosed and associated through pathology so no genomics has been involved so far. However I was involved in a study where the researchers collected various kinds of omics data about the appendomomas including methylation and gene expression and then they performed the various ways of clustering that data to figure out that actually a appendomoma is not a single disease but is comprised of nine different subtypes. So this nine or eight or ten is arguable but it shows strong heterogeneity within the tumor and it also has the different clinical features so one of the subtypes performs better than the other so it's definitely helpful for diagnosis and clinical treatment. Now the task that we were involved in was basically to interpret the biology of these different subtypes in contrast to each other and obviously we perform path to enrichment analysis with an enrichment map and then this figure comes from the paper that was published and then this enrichment map basically tells you what are the common features and distinct features of these various appendomoma subtypes and I'm showing you it here because it's a good example of how to use enrichment maps to interpret complex data. So instead of looking at one gene list here we were looking at nine gene lists and each one of those nine gene lists is represented with a different color so you'll see that you know the red subtypes seem to have some associations to cell cycle and then the blue subtype has some associations to neurotransmitter signaling. There are specific signaling pathways that are representative of one or more subtypes of a appendomoma over here and so on so this rich visual representation allows you to understand the heterogeneity of that particular type of brain cancer. What did you actually build? This is the enrichment map. Is it the same as the previous? Yeah. But I guess with different settings and stuff. Yes. So the trick here is that in cytoscape yeah okay so the question was how did I build this map and then the answer is that the cytoscape has an app called enrichment map which will be part of one of the tutorials today and then the only major addition here is cytoscape has the ability to put various charts or nodes and in this case it's a pie chart. So if it's a single color it's a single color pie chart but you can see that sometimes this is a pie chart of multiple colors of equal sizes and each color will represent one cancer type or subtype. So it's very well doable. You just need to do a few extra tricks. So that is an example or case study of how to use gene expression and methylation and whatever analysis using pathway and network approaches especially pathway enrichment analysis. So as I mentioned in the beginning there are two major components of this pathway enrichment analysis. One of them is the gene list and the other one is a collection of gene sets representing various processes and pathways. So first where do gene lists come from and then actually you guys will know better where gene lists come from because I'm assuming that every one of you here is here because of they want to analyze their own data so you are responsible for generating your own gene lists. Just a few examples. You do molecular profiling of some sort of experiment using omics technologies. Very commonly people do RNA sequencing of cases and controls for example. You can do protein, proteomic studies for example using mass spectrometry or bioID is very common in Toronto these days. In the first case you identify a gene list and then sometimes it's just a fixed gene list. You draw a threshold and say these are my interesting genes. I want to analyze these interesting genes compared to the rest of the genes. A more sophisticated approach is also give you gene lists and values and then you can rank them. My strongest top number one gene has the strongest score and then the scores start decreasing. Actually most of the times I would recommend using rank gene lists in pathway analysis because that gives more information to the analysis procedure and you'll be able to capture more subtle effects. And then you can also do various kinds of ranking and clustering algorithms and they might give you gene lengths of different sizes. On the other hand you can also study interactions. So a classical case is the protein-protein interactions. You're interested in a particular protein. You have done a screen to measure all of its interactions and then you'll have these interactions as your search space for pathway analysis. In interaction studies many times you won't have a clear cut ranking. You may have just the interactions of the proteins and the non-interactions of the protein. So in that case you may want to deal with just a flat list. Other ways to do that, genetic screens, for example, knockout libraries, CRISPR is the big new technology these days. So you can, for example, study the pathways involved in essential genes. You can do genome-wide association studies, single nucleotide variants. When you analyze cancer genomes you find driver genes and perhaps want to identify driver pathways that have many mutations and so on and so forth. So gene lists come from a variety of places and the beauty of pathway enrichment analysis is that in most times the analysis is the same regardless of the omic data that goes in. What you have to care about is gene identifiers, for example. What do gene lists mean? This is also a question that every one of you will know better than I do about your particular experiment. But essentially when you do an omics experiment you want your experimental results to reflect some sort of a question or a function that you're studying. So maybe you are interested in a particular set of genes. Maybe you're looking at genes that encode for protein kinases. Maybe you're looking at a particular cell type or disease and then what you have to care about sometimes is that when you restrict your search space very stringently then the pathway analysis needs to be tweaked somewhat. One example is that if your gene list will a priori only contain protein kinases then you have to adjust your pathway analysis such that it treats the protein kinome as your background. That will come out in a few later lectures as well. But the take home message here is that if you have a genome-wide analysis and genome-wide results so any gene in your list could be potentially coming from anywhere in the genome you're safe and the standard the pathway analysis holds. However if you restrict yourself to say 5 000 genes and the remaining 13 000 will never have a signal. In that case you have to worry about your pathway analysis a little because the statistics would be very much biased towards your search space. So the biological questions over pathway analysis what do you actually want to accomplish with a gene list of interest and this is better be part of your experimental design because not always can you rescue a bad bad experimental design at the level of pathway analysis. So usually you want to summarize biological processes or other aspects of gene functions in your list of interest. Sometimes you want to perform differential analysis to have samples from diseased individuals samples from healthy individuals you want to compare one versus the rest. Sometimes you have a time series and you want to figure out what are the genes that seem to be upregulated towards the end of your time series. Other times you may want to find a controller for a process for example you're looking to find a transcription factor that acts as a master regulator of the genes that you discovered or a microRNA. Sometimes you're interested in detecting whole new pathways or at least finding new members of a previously known pathway and then you can through that you can discover new gene function using the guilt bias association principle. And sometimes you want to correlate a disease phenotype to the pathways that you find and maybe prioritize candidate genes for further experimental validations. So the biological answers that can come out of your analysis in pathway enrichment analysis you summarize the functions that you get that potentially compare functions that you get from one experiment and the other experiment like I showed you the pundamoma example. In network analysis you can predict gene function to new genes or maybe find even new pathways or functional modules. And in regulatory network analysis you're interested in how these genes came to be or whether there's a master regulator that controls these genes of interest. So as I mentioned there are these two major components that go into the pathway enrichment analysis, the gene list that you have constructed from your previous analysis and pathways that come from scientific literature. And then there's a not a black box but an orange box over here which will give you enriched pathways and then there are various tools that do it in different ways. I will talk about G-profiler but very neat later on we'll talk about GSEA. These are somewhat different approaches with the same underlying goal. And then the techniques that you need to worry about basically is how do you make sure that the gene identifiers are good. Where do you get the pathway information and how do you select pathway information for your analysis. So first about gene lists and specifically gene protein identifiers. Identifiers are ideally unique stable names or numbers that help track database records. For example you know from the real world everyone has a social insurance number and government agencies use these numbers to track us down. Genes also similarly have various numbers and identifiers but the main problem is that there are many different databases. Pardon? Yes! Well tracking down for example did anyone receive tax forms recently? All right so some social insurance numbers are unique and then that's one of the primary numbers we need to worry about. The phone number is another way. The problem with genes is that there are many different databases so genes will have many different names and besides databases they will also have names that people have given them and these names change over time so that causes a lot of trouble. And then what you ideally want to do is you want to select the gene identifier that has is stable and will be stable for times to come to be able to future prove your analysis. But that is kind of a problem because you know some database commit to having stable identifiers others don't and if you only deal with things like gene symbols then these are these are potentially traps because gene symbols will change they're nominated by a committee and the committee seems to change their mind all the time so if you take a gene list from say five years ago there's a good chance that you know five or ten percent of these genes have changed their names. In a good chance you'll just get an error in the pathway database saying that this gene doesn't exist but in a bad scenario you'll get a different gene. So here are some different types of gene identifiers commonly used for human and other species and then the red ones are recommended. So ensemble IDs are relatively stable on drives genes are stable but then sometimes like p53 which is a symbol tp53 is not often stable because people find out their new function and they assign some sort of a short name to that gene. So the problem with so many IDs is that somewhere there's a master table that connects them all but not all pathway analysis software will have that master table so pathway analysis software will maybe deal with only a handful of different genes. Gene identifier types and then ignore the rest and then you have different uses you can just if you have one gene it's easy you just search until you find the right identifier but if you have a hundred or a thousand genes you don't really want to do that one by one so in that case you want to use maybe a software that does that for you but in most of the cases you still have to look at the remaining genes that didn't fight an identifier and map them one by one if you want to achieve a perfect coverage of a gene list. So when you map different identifiers one of the main challenges is to map them correctly be aware of genes that have multiple identifiers of the same sort because they could point to whole different genes. Here's an example of the problem when both started gene p53 which is called tp53 or p53 or grp53 or all these other things depending on the organisms. Another aspect is that if you happen to work with spreadsheets then excel has this nice feature of converting anything that looks like a date to a date and there are many genes or at least a handful of genes that look like dates so oct4 is a stem cell regulator that consistently gets converted into october 4 and there have been systematic studies in recent literature that have pointed out that a very large fraction of high impact publications have errors like that embedded in their supplementary tables. As I mentioned before you'll probably have problems reaching 100 coverage when you have these large omics datasets and just when you do pathway enrichment analysis pay attention to the error messages and maybe if there's only a small number then go back to common databases cross-check these references and pick the one that's the right one. Here's one embarrassing example of a nature study about 15 years ago or 10 where they had this whole big study about an interesting regulator, an interesting gene, only to find out later that they had confused the gene symbol and they were talking about an entirely different gene. So the nomenclature is different and it's difficult to work with especially that that some genes have been historically called something else and then that something else is now assigned to a new gene so it's a big mix. Fortunately there are various systematic ID mapping services one of them is is incorporated in the ng-profiler set it's called gconvert and then this essentially downloads the entire bio-marked ensemble database seeks up all the identifiers in that bio-marked and construct these big master tables to be able to associate a mixture of different IDs from various databases to any given one single database and then there's probably hundreds of different identifiers across the species that are covered so there's just a list over here but regardless of this automated procedure it will still fail to find some symbols because people have been using various aliases over time so in this case pay attention to the error messages and if there are many then just go and double check them recommendedly in multiple different databases until you find the very right symbol. There's a helpful tool in g-profiler that will tell you that it found ambiguous ID mappings and then it will give you a list of these ambiguous ID mappings and potential choices so you can check the bullet points of these most likely choices and maybe ignore some of them so at least you won't introduce additional errors into your into your popular analysis pipeline so a few recommendations when you have protein lists or gene lists and you don't really care about the supply size of forms then map everything to the untrained gene IDs which are numbers or official gene symbols when using a spreadsheet so official gene symbols are a little bit more problematic because evidence shows that they still change over time. If you do need 100 coverage then just manually curate missing IDs and use multiple resources you know gene cards, ensemble, species specific IDs, UCSC genome and so on and be very careful over Excel spreadsheets because they tend to introduce particular types of errors you can paste gene lists or any text to force no conversions and it's a good habit to do so okay so just a little summary what have we learned genes and the products have so many different identifiers and unfortunately it's only half a job but like the majority of bioinformatics you often spend mapping different identifiers and merging tables there are ID mappings available and you should always try to use the most commonly used identifiers for your gene lists okay so the second component of pathway enrichment analysis is obviously pathways and pathways is a very broad concept everyone has their own ideas about it here we mostly talk about gene sets and these pathways are our gene sets there are many resources for them but one of the primary ones is a gene ontology there is that's the databases lots of databases for example the reactome every species will have their own pathway database so there is a lot of resources to navigate among so pathways is just one aspect of various gene identifiers and gene information that you can gather and most of the time we really talk about pathways as biological processes or molecular pathways but there are many other annotations that you may be interested in for instance gene ontology besides talking about biological processes also has information about molecular functions or cell locations depending on what you do you actually may want to treat chromosome location as a feature of your genes or disease associations for databases such as OMIM DNA properties may be very interesting in the context of you know the framework of pathway analysis for example whether there are transcription factor binding sites or encode beaks around it or that you could look at protein properties as labels or annotations of genes does that protein have a particular domain or a particular post-translational modification site or interactions can also be used in the same framework however we mostly talk about pathway analysis in the context of biological processes and molecular functions so what is the gene ontology the gene ontology is basically like a dictionary it's a dictionary of of common biological phenomenon and importantly it's a hierarchical dictionary so at the top of the hierarchy there's something very general like the biological process and then it spreads out to more and more detailed biological processes towards the leaves of the tree so to say and examples of this would be protein kinase which is a particular protein with a particular function or the process apoptosis or the cell component membrane and then while the gene ontology is designed to be agnostic of various species so the dictionary will will cover the biology of plants and the biology of unicellular organisms and humans and everything in between so an ontology is a formal system of describing knowledge and in this case the knowledge is everything biology this is a visualization of the ghost structure so this is something a computer scientist called a directed acyclic graph which is like a tree but instead of being a tree any any leaf in the tree can have one parent or more parents and then this is a good way of representing knowledge because we have a more specific terms of biology towards the bottom of the tree and very general things in the top of the tree so here we have B cell apoptosis which is a part of apoptosis which is a kind of program cell death which is a kind of a cellular process in the top so this is how we represent knowledge that has been accumulated over decades of research and then gene annotations will be associated to that ghost structure. Gene ontology covers three major branches so the most interesting for us in the context of pathway analysis is biological process then there's also molecular function and cellular component and each one of those three branches will have thousands of terms associated to them so here are the examples cell division is a biological process, glucose 6-phosphate isomerase activity is a molecular function and then there are various cell components here like the inner membrane and outer membrane and so on so these are the types of keywords that have been built into the GO tree and we can annotate genes to them and by using these gene annotations we have access to pathway enrichment analysis. Where do GO terms come from? They obviously come from big human efforts. GO terms are added by gene ontology editors the headquarters of gene ontology is at the European Bioinformatics Institute and also GO terms are maintained there and additional information is created. Oftentimes that information is also coming from species specific research groups as I mentioned GO is supposed to be species agnostic so various research groups will contribute to the GO tree. And then this is a very alive and developing organism so to say. You can see how just vocabulary has been increasing over time and then this is really important because you don't want to use outdated resources. For instance between the last three years or 2012 and 15 that was about a quarter of increase in some areas of the of the GO tree. It also reflects how science is progressing you know we publish more and more papers every year there are more and more new technologies to explore life and in particular we have all these omics technologies that give us more power in in analyzing many things at the same time. Okay so only one part of the GO is the tree so that's only the pure dictionary but what is more interesting to us is how we use that dictionary to describe gene function. So this is called annotating genes with the GO information and genes are linked or associated with GO terms by train curators at genome databases. So these train curators read a lot of literature and every time there's a claim in a paper they evaluate it critically and then if if the claim is solid then they will basically draw an arrow between a particular gene and a particular term in that vocabulary. And then this process is ongoing and as you know scientific literature is growing at a great pace so more and more of these gene annotations emerge and genes will have multiple annotations per gene. One reason is that genes are rarely involved in doing just one thing but they often do many different things and then the other reason was the hierarchy aspect because GO terms are structured hierarchically any any genes associated to a particular term will be also associated to all the terms above it. And then GO annotations will have various quality labels to them. For example if a gene was associated to a process in a knockout experiment it will be more strong evidence than if it was associated in a data clustering experiment. So in gene ontology there's something called evidence codes which will essentially give a quality label to how well the curators trusted the data. So in terms of hierarchical annotation that I just mentioned this is an example, aurora kinase B is known to be involved in B-cell lab octosis. So the direct annotation that the curators gave to that gene was they draw an arrow between aurora kinase B and B-cell lab octosis. But due to the nature of the GO hierarchy all these other arrows were added as well because if a gene is part of that particular specific biological process then by definition it is also part of all these other parent processes that are more general. So this is how you see why a pathway analysis will give you hugely redundant results if you have a rich gene list because besides B-cell lab octosis showing up in your significance analysis all these other guys will be tested for enrichment as well and they will show up. So this is where their redundancy starts to come in. The other aspect is the annotation quality. Oftentimes literature is curated by human curators but obviously they have the limitations the teams are not as large as they should be. So other times electronic annotation occurs and then these annotations are given to genes through algorithmic means and they are often not even validated one by one by curators because people have time limitations and so on and so the thing that you can do about it is pay attention if your favorite gene is only supported by by electronic annotation be more careful about it and if you're doing a large scale analysis you can also choose to opt out from these low quality electronic annotations that are actually quite prevalent among human genes for example. So the key point is be aware of how your gene was associated to a particular process and you can oftentimes even track it back to the original source or all the paper. So here are a couple of different evidence types the one to pay attention to is inferred from electronic annotation or it's called IEA sometimes and then there are these various experimental codes that reflect what experiments the original researchers were doing. Maybe they were doing cell knockouts or maybe they were doing evolutionary studies or they were looking at genetic interactions so all of that is ideally captured in the genontology annotations and sometimes it's just based on on literature curation where an author has stated something about a gene in the paper and then there's no concrete experimental evidence about it. So this is a landscape that you want to study in very specific cases where you you want to study where where your gene annotation is coming from. As I showed before evidence codes are shown with this nice colorful legend in gene profiler so the darker redder colors usually represent stronger evidence from from biology while the lighter or bluer tones show that it's mostly computational evidence that was feeding into these annotations. So as I mentioned earlier genontology is designed to be species-economic and then it will actually depend on how well a particular species is studied to analyze the various gene annotations that are involved. Genontology also does a lot of cross curation so if a particular gene is strongly conserved and evidence only exists in worm then some of that evidence may be carried over to human gene annotations because you know maybe it was involved in a very core biological process and that is reflected in in the various evidence codes. So in essence every every species database would ideally contribute to the genontology and the annotations would show up in the master genontology table and there's always new species annotations and development. What you may want to know is that obviously all the species have different coverages to them. The human genome and proteome is studied the most and obviously it has the most annotations followed by common model organisms. Obviously human experiments will have fewer direct experimental evidence and more computational experimental computational evidence inferred from other species because there are certain types of experiments that you won't be able to do in human. So you can see actually how much of that evidence is coming from computational inference in in say human so whenever you do the pathway analysis it's quite likely that most of the information going into that analysis has been derived from computational analysis rather than direct experiments. Here's just a list of various databases of the different species and you don't need to know that unless you're directly working with one of those species. And then Go software tools are very abundant. There is probably dozens of tools that only do pathway enrichment analysis the simplest kinds and they will have their different advantages and disadvantages. One of the main things that you need to pay attention to is how frequently they get updated and we had a study last year where we wanted to know how frequently different tools are updated and how that potentially affects interpretation of the data that you get when you analyze your data using these tools. And then the question we asked was do gene annotation databases have best before dates and it turns out that yes they do this this figure tells you how frequently a particular tool has been updated using the color scheme and then on the y-axis you show we show how many citations that tool has. And then the big elephant in the room is called David which had 2,500 citations last year yet all of these citations were based on data that was five years old because David was updated in 2010 at the time and everyone kept on using it but they were missing out on a lot of recent discoveries. So we really wanted to quantify how much people were missing out in these 2,500 papers and then we performed you know various pathway enrichment analysis in particular this is analysis of glioblastoma driver genes, glioblastoma is a fatal brain cancer and then we found that when you use out-of-date software particularly from 2011 compared to a recently updated software in 2016 then the intersect is 20 percent. So 20 percent of the pathways that you find today you would be able to find with an outdated tool and then the remaining 75 percent is new pathways that you only find when you use up-to-date software. And then there's about five percent of pathways that change definitions or names that were captured by these old tools but no longer were captured by the new tools. So the take home here is that these 2,600 papers are completely out of date when they were published and that is kind of a problem. As a user of these tools you want to check when the data was last pulled in from the genontology. As a developer you want to pull in data as frequently as possible and as a funder like a grant agency you need to create grants that allow to support software. So I think this is quite important. Another message is that David was updated after a Twitter storm so at least we we changed something and maybe they keep on updating the software and there's a reason why David is very popular because it's quite intuitive and a lot of people use it. So also the other tools, I guess with the exception of different eyes there, which one would you suggest for like you know I mean now David is updated too but which one would you suggest? So Panther is a tool that's maintained by the genontology consortium so I think that is a good choice because they know what they're doing and I think they update like very frequently. Geontology itself I think it's updated daily like the vocabulary and then you know the gene annotations depend on the species but they're also very frequently updated many times a year at least. And in terms of like how intuitive it is and stuff, how does it compare to the data? Panther is pretty good. I think you just paste in the gene list and it speeds out the results. So so far I've mostly discussed genontology as your primary source of pathway data. There's obviously other other pathway tools probably if you count all the various ones there's more than a hundred. Pathway Commons is a resource developed at the Beta Lab which is an aggregate resource or a meta resource of various different pathways. Oh wow, it actually lists more than 500 different pathway databases and you can collect these meta sets in very intuitive text-based formats where you get all the interactions within a pathway with the one click that have been aggregated across these various pathway databases. I will only briefly tell about you know I actually told about already the various different functions that you want to use as gene labels to do a similar type of pathway enrichment analysis. Molecular functions and cell locations are probably relevant to many experiments and then all the other things that you can study as well. Basically the trick is the same. You will derive a list of genes that have a particular gene association or maybe a chromosomal region and then you test whether your list of genes from the experiment has a statistical enrichment. So the framework is almost always the same but we generally recommend to start with pathways and biological processes because it's very easy to overwhelm yourself if you choose a hundred different types of databases into your pathway enrichment analysis. You will have hundreds of thousands of results that are difficult to interpret so start with them the most intuitive ones first and then progress into the areas where you feel that your experiment needs the most interpretation. Okay so what have we learned? Pathways mostly come from gene ontology and dedicated pathway databases. Gene ontology in particular is a classification system and a dictionary for all biological concepts. Annotations have contributed to by many groups. Genes will usually have more than one annotation because of hierarchy but also because of functional redundancy and function multiplicity. Some genomes are way more annotated than others especially if you happen to work on an exotic model organism you may have trouble or you may have all the annotations that are coming from the next closest model organism. Annotations come from manual resources which are stronger obviously and then electronic sources which should be treated with a grain of salt sometimes and then the gene ontology also has a version called ghost limb which I didn't really talk about but that ghost limb is a slimmer version or narrow version of the entire gene ontology which should cover the main concepts and these ghost limbs come in various flavors. You have a ghost limb for yeast and a mammalian one and so on. And then there are many other additional attributes that you can potentially use for pathway analysis and you can pull them from genome databases such as ensemble or encode tcc or so on. So I showed you this diagram earlier where step one collect data, step two process data, step three have a gene list, step four learn about the underlying mechanism using pathway and network approaches. However this is actually you know a part of a way more complicated diagram where in blue you have the various omics techniques. In orange you have the ways of analyzing these omics results and then it really spreads out into various areas of pathway network analysis where you can identify interesting pathways using these these enrichment approaches. You can identify interesting networks by linking genes in your list to other genes because of known interactions and then you can drill down to the potential mechanisms often using various visualization techniques such as the enrichment map. So with that I'd like to conclude.