 Right, so I guess during the past few days you've learned a lot how to construct gene lists from your experimental data. The learning objectives of this module are the following. Is this going without my intervention? Yes, I can't speak that fast. That's better. So, by the end of this module, you should be able to identify situations and places where you can use pathway enrichment analysis or pathway network analysis. You should be able to understand the main components of the simplest pathway analysis, these being gene lists from your experiments and then pathways that are also often represented as gene lists. One major hurdle, which is a bit boring but you have to do it nevertheless, is to manage all these different gene IDs and names and symbols and so on. We have a quick look at that and then we go through how those gene lists corresponding to pathways and networks are actually created or where do they come from. So the premise is the following. You've performed just your new super cool screen that nobody has even heard of or maybe you sequenced a lot of genomes or you did some RNA sequencing in order to understand transcriptomics and then that screen or experiment produces a hundred or a thousand genes and your PI or your collaborators tell you, so what is interesting about these hundreds or thousands of genes, how did they fit our experiment or why should we care about those hundreds or thousands of genes. And then probably the first thing that you want to do is you want to go to the PubMed database to see existing literature about those genes, which is fine if you have a couple of genes but you don't want to do that systematically when you have hundreds of thousands of genes because it will take so much of your time. So generating gene lists depends on what type of data you're looking at. It may be proteomics data or sequencing data and the techniques are different to extract gene lists from those different sources of data. However some techniques can be the same such as ranking or filtering or clustering. They all end up in lists of genes. And all the pathway network analysis is how to go forward from these lists of genes back into the biological functions or processes that are involved. So you can use a wide variety of analysis tools that are publicly available to perform pathway enrichment analysis to understand the gene lists and to drill down to some biological mechanism and maybe if you're lucky you find something new and exciting about the biology. Pathway enrichment analysis will help you to find candidate genes to follow up with. You explain some causes of disease, understand some biological functions and so on. And as I mentioned, you don't want to do this manually because every gene has been, when many genes have been studied very comprehensively, there can be hundreds of papers available for any particular gene and instead of curating the literature manually, pathway enrichment analysis or network analysis will provide a shortcut to analyzing gene function because much of this information has been systematically annotated into databases and you can access these databases using statistical means rather than going through papers one by one. Pathway analysis and network analysis may provide a missing link between interpreting the genotype and phenotype, explaining phenotypes using genotype information. So on the one hand, you can collect information from the genomes through, for example, whole genome sequencing or exome sequencing. On the other hand, you have some sort of phenotypic information. Maybe this is proteomics. Maybe this is patient survival curves from a clinical study. Maybe this is inheritance trees. Now pathways are the complex component in the middle, which actually synthesizes knowledge from previous experiments, multiple different databases, literature and the expert curation. And you can predict the effects of variation in the genome on the pathways and try to explain your phenotypic observations. So a bit more formally, pathway and network analysis is generally any computational or statistical technique that takes advantage of network and pathway information from previous studies or previous literature. Most commonly it is applied to interpret lists of genes. The most popular type is pathway enrichment analysis, but there are many other more complex types covered during this workshop. And it helps gain mechanistic insights into omics data. So omics data could come from diverse sources, but generally if it boils down to a list of genes, many pathway and network tools are available to understand those data. So there's a bit of a difference between pathways and networks. I'll try to give my perspective into it and every other person you talk to that will have their own understanding. So my understanding is the following. Pathways and networks both include systems of genes or proteins or molecules, transcripts and so on. And there are systems in the sense that there are interactions between those components. So genes or proteins interact with one another. But there are also somewhat, there's also some differences between pathways and networks. Pathways often being called small scale systems that are fairly well described. They may be a consensus synthesis of many years of knowledge and research. The interactions in pathways are often biochemical reactions. So they're very detailed, there's a lot of knowledge about them. They're collected into a specific pathway databases. Maybe they contain dozens of genes and proteins, maybe hundreds, like say a few hundred, but no more. Networks on the other hand are usually a large scale, they're more noisy. They contain perhaps hidden information that's not well characterized in pathway databases. There are simply their abstractions of simplified cellular logic. The edges in the pathway in the networks are not that clear what they often mean. Maybe they're physical interactions, maybe they're genetic interactions, maybe they're some sort of activating or inhibitory patterns. But they may contain stuff that we don't really know about yet. Networks often are derived from high throughput experiments or maybe even statistical integration of multiple types of data. So network data may be available for all genes in the genome, pathway data is available for a much smaller fraction. So here's a busy slide trying to summarize multiple types of pathway network of an analysis in a sort of a simple way. There are several, there are many, many different techniques and you can summarize these perhaps in three different categories. The first one which is the focus of this seminar is the enrichment of fixed gene sets. So this is shown here. And this is basically identification of pre-built pathways in your gene list of interest that are somehow either significantly enriched or significantly modified, but this is based on existing information about pathways from the literature. Now the second type of analysis is called de novo subnetwork construction and clustering. So this refers to still your gene list of interest from your experiment, but instead of looking at predefined pathways, we look at large-scale predefined networks and then try to extract some certain areas of the networks which connect your list of genes. And then the third most complex type of analysis is called pathway or network-based modeling in which we collect some pathway rules of activation inhibition, for example, from pathway databases and then we see if our gene list is consistent with these rules or maybe instead our gene list represents some new unrecognized pathway rules. So I've been looking a lot into cancer genomics and perhaps an explanation of when you analyze cancer genomes, then these different types of analysis try to answer the following questions. So the first analysis of fixed gene set enrichment tries to tell you what types of biological processes are represented or enriched or altered in this particular cancer type. So the second type of analysis tries to figure out whether there are new pathways or new associations that are relevant to this particular cancer and maybe there are some clinical subtypes where different pathways are activated or inhibited. And the third most complex modeling which evaluates those pathway rules sees if there are pathway activities altered in a particular patient and maybe there are some drug-targetable pathways available in this patient because we see that some rules, some well-known pathway rules are broken in that particular patient. So in this particular lecture we're mostly looking at the enrichment of fixed gene sets because this is most applicable, you don't need a lot of detailed data to it and you can apply to all kinds of different experimental data sets. So there are multiple benefits of analyzing data on the level of pathways rather than on the level of the individual genes. So the first idea is improved statistical power. So for example when you look at gene expression data then you're probably focusing on say 20,000 human genes or maybe 6,000 yeast genes. When you do pathway enrichment analysis you can probably focus on say 2,000 pathways or maybe 500 pathways which is much smaller number and therefore you do less statistical tests and you need to apply the multiple testing correction in the less strict manner. So you will have more chance of finding biologically significant results if these are present in your data. So viewer test is generally good if you're doing high throughput analysis. Also pathway data may be more reproducible. So for any particular sample you may be seeing some genes being up-regulated, some genes being down-regulated in the gene expression data set. If you're looking at the level of pathways then you're analyzing many components of the pathway at the same time. So if you encounter a new set of samples and the biology really involves that pathway you may see the pathway over again. Because maybe different components are ordered but it's still the same pathway so that increases your reproducibility. You're basically doing a set of tests again for all the pathways when analyzing your restricted set of genes. Right? Every single gene. So it will increase power by that speed. Yeah. Although I do understand the question and I agree, in such a set up you apply multiple testing twice. First for all your genes to filter out your gene list and then for all the pathways. So in a sense, yes, you're testing totally or testing more. But when you compare it to say testing your individual genes and your gene list being the final result versus your pathway list being your final result then you're still encountering better power. Yes. So where was I? I think the next point is about easier interpretation. And this is more comfortable because when you're looking at pathway data you can use concepts from cell and molecular biology in order to explain your data. For example you can say that in my experiment cell cycle genes are up-regulated and that's much easier to understand than dealing with an alphabet soup of different symbols being up-regulated, down-regulated and modified. And then pathway enrichment analysis also allows you to get a step closer to the mechanism because by conducting that type of analysis you'll be able to identify processes that may be responsible for the changes you see in your experiment. And then you can also use guilt by association or birds of a feather principle in order to predict new functions to your genes. For example if you see a large enrichment of a particular pathway in your gene list that you derived from an experiment and then there's a few other genes that are not well known to get associated to that pathway you may be able to predict that maybe these new uncharacterized genes also have a role in that process or pathway. And that can point you to some pretty detailed experiments what to do next to maybe find out about the role of those genes. I'm sure that this before analysis has been covered well in the previous lectures but I still need to emphasize the importance of it because pathway analysis is sensitive to the garbage in, garbage out principle. So if you haven't performed high quality analysis of your input data or the experimental data pathway analysis may reveal stuff that looks really interesting but is actually an artifact or a technical or biological artifact of the way the data were processed. So it's important to normalize your data, perform background adjustment for microarrays or proper sequence alignment for RNA-seq. It's important to look at quality control, make sure that the samples you're analyzing are really the samples you're analyzing not due to some say sample mislabeling. You need to use specific statistical tests that depend on which type of data you're looking at whether it's count data or whether it's normally distributive intensity data. Genes list size also matters. Okay, is it bad in the back? Is it better now? One, two, no? I'll try to speak up. So gene set size also matters. Personally I find that there's a sweet spot between say up to beginning from low hundreds to a thousand. If your gene list is in the thousands then maybe something's wrong then you need to look at the gene list in a different way. Lists, very small lists like dozens of genes or a dozen of genes doesn't often result in high quality pathway analysis and so on. And gene IDs need to be comfortable with the software. This is a big problem because there are so many gene IDs and there are more and more coming out every day. Here's a quick overview of how pathway analysis looks like. First you collect your genomics data, you rank and normalize and analyze and filter your genomics data and you generate the gene list. And the gene list is the main input to pathway analysis where you use statistical techniques in order to distinguish which pathways are important regarding your gene list. You can use visualization techniques in order to summarize these pathways in a smart way. Then you can focus on particular pathways and then try to drill down to the mechanism, link them back to the genes, understand what's known but then you don't need to do hundreds of PubMed queries but maybe dozens or even less. And you can use pathway analysis in order to give you further follow-up experiments and then hopefully you'll be able to publish this really well. It depends. So what is the reasonable size of a gene list? I would say if you have, say, 20 genes, then you're likely that the pathway analysis won't highlight a lot. If you have a large gene list, say, hundreds to thousands, then you need to have a rank to that. So if you just have a gene list with 5,000 genes, it's probably not going to highlight anything in the pathway analysis unless you allow it to be ranked. So the first genes would be more important, the following ones a little less important and then you use a prioritized ranked list analysis. So if you just compare one list to another, then probably, like, say, low hundreds would be around the sweet spot. So from a statistical viewpoint, the simplest type of pathway enrichment analysis is a comparison of two sets of genes conveniently shown with this Venn diagram. So on the one hand you have a list of genes from your experiment. For example, you're looking at brain cancer data and you're analyzing the genes that are down-regulated in drug-sensitive brain cancer cell lines. And on the other hand, you have another list of genes and potentially many list of genes that correspond to public knowledge about gene function. So for example, that could be all the genes that are known to be involved in neurotransmittal signaling. And then you compare those two gene lists with a statistical test, such as the Fisher's Exact Test, or a test based on the hypergeometric distribution, to determine whether there are more annotations or more neurotransmitter function in the genes in your gene list than would be expected by random chance. And then you choose a pre-defined cutoff. Very often this pre-defined cutoff is 5% p-value. And then if you see that there are more than expected number of neurotransmitter genes, then you propose a hypothesis that may be drug-sensitivity in brain cancer has something to do with a reduced neurotransmitter signaling. You perform this analysis over all your pathways and then apply multiple testing correction and anything that has a significant p-value after this multiple testing correction will be deemed significant and will be added to your potential pot of hypothesis. There are many tools that perform this type of basic analysis of gene lists. One of them is GProfiler that I developed during my PhD. This is a typical output of GProfiler where you have enriched pathways from top to bottom and then your input gene list is shown from left to right. And then there's a certain amount of information that's being displayed. These are the pathways that were found. There are some numbers involved, how large is the pathway, how large is the gene set, and how many genes are in common. There's a p-value to it and then there's a matrix of annotations showing which genes are associated to which pathway. I'll be talking about that tool a little more during the next lectures. So this is an excellent way of summarizing public literature and public databases and public knowledge in order to characterize gene function. So instead of going through all these papers, you have a kind of gene set approach to storing pathway information. However, this can quickly grow overwhelming. So you have this well-performed experiment that gives you all these neat pathways. And then you get this massive list of highly significant pathways that characterize your experimental gene list. And this is for multiple reasons. There's a lot of redundancy in gene function biologically, so genes do multiple things. And on the other hand, there's a lot of redundancy in how people store biological data and databases. So part of that is because functions, biological functions are different, but it's still quite similar and then genes are annotated to those functions in a redundant way. So instead of staring at these tables endlessly in, say, Excel spreadsheets, we can use visualization techniques that allow you to compress all these highly similar but somewhat different pathways into network-related maps. So this technique is called the enrichment map. And this is a network of pathways, so to say. So each node here in this network represents a gene set that's a pathway or a process. And these pathways or processes are linked to one another if they share large numbers of genes. So if they're similar in a particular way. And when we apply network visualization techniques, then we can group together those groups of pathways or processes that share many genes. And then instead of looking at the list or a spreadsheet of all these pathways, we can look at the network and visually or manually identify major functional themes that are present in your gene list. So this is a great technique to do after you've done your pathway enrichment analysis and you want to know how to summarize the results. There's a bit of a longer motivating example how to characterize the genetics of autism spectrum disorder. This was a paper where our lab collaborated in the analysis. So autism is an autism spectrum disorder, so ASD, it's a highly inheritable disorder. In monosagoric twins, the genetics, they share a diagnosis in 60 to 90 percent of cases depending on the stringency of the diagnosis. About 5 to 15 percent are known to be single gene disorders and chromosomal arrangements. Yes? This one? What is the edge? Yes. So the edge means that one pathway and the other pathway, they share a large number of genes. Yes. So that pulls together the pathways that are somewhat similar because they involve similar genes. The edge definitions, we will also go through that later. You can set the way the edge is defined, whether it's like 30 percent or 50 percent genes shared and the higher the percentage, the more granular the map will become. So if you want to have a large map with many different, with small but large modules, then you define your map by a linear edge definition. If you want the stringent edge definition, you will have many different small functional themes that represent the biological pathways in your data. More of that will come during the last session of this tutorial. Any other questions on the previous slide? Yes. Exactly. So it's a network of pathways or a network of gene sets. So the larger nodes represent larger networks. Okay, I'll repeat all the questions. Right. So what was your question first? Right. The question is whether the way the networks are connected also represents redundancy. It does. Because when you see that, for example, the blue, big cluster here, it's quite likely connected and then there are these outliers forming almost another new cluster, which is not exactly related to the bigger one. That means that the genes shared within here, it's probably one big gene set that's shared among all these different pathways, and then it branches out to a smaller different pathway where the gene sets are less shared. Does that answer? It's a bit of magic because the moment you change the threshold of the edge definition, it will visually change a lot in the map. And it's quite likely, for example, that this cluster here will become its own cluster because the weak edges will be removed as you change the threshold. Right. So how to keep the threshold? In most cases, you try the default threshold first and then you see whether you are able to interpret the map right away. By interpreting, I mean you eyeball the cluster and see if you can assign a single functional theme to it. For example, can you say, this is the apoptosis cluster? If you cannot say that, then you probably need to fine-tune the way the weights are defined, the age weights are defined. For example, maybe it's partly an apoptosis cluster and partly a differentiation cluster, then that doesn't, you can't put a single label to it. Then you try to split them a little bit apart. Or in other cases, maybe there are so many different apoptosis clusters all over the place and then you reset the threshold for the edges and then they all will merge because they are all based on the same gene set. So is there a way to set different types of edges? Absolutely. But you need to tweak that a little bit more and it's all going through the cytoscape software so you can add more edges to it and relabel the edges and do a lot of things but it's like artistry. Any other questions? So why don't we, so about the example of how to generate the really fancy enrichment map, I guess the motivation to study autism is that not a lot is known about the genetics. There's about five to 15% of the cases that come from known gene alterations or copy number alterations and then there's also so-called de novo copy number variation that's not apparent to the parents but that has emerged in children. And in this particular study about 2,000 individuals were profiled for copy number variations, about 900 cases and 1100 controls of European origin using a SNP chip from Illumina and they produced some highly high quality, rare copy number variations with a high validation rate and it turned out that on average a person had two copy number variations with a median size of about 280 and many, well a good fraction of ASD individuals had at least one de novo copy number variation and the top 10 genes already were quite convincing because they were known to be related to copy number to autism spectrum disorders. And here's a pretty nice looking enrichment map with a lot of detailed information you can see and it greatly highlights all the different aspects of a neuro system development for example that are potentially altered by all these copy number variations. And this is a great leap from the standard enrichment map that you can produce because there's all these different additional visual aspects that reflect aspects of data. Yes, but it's not the default output of the plug-in so many, many other things have been added for example nodes of different shapes and sizes and we can zoom in into the cluster and then this gives you an idea what the enrichment map is supposed to do. On the one hand you call it a major functional theme such as the central neuro system development which represents this cluster but you can highlight individual members of this cluster in order to say well these are the specific pathways and networks that are part of this greater functional theme. So again instead of looking at the table of pathway output you visualize that as a network and then you can choose which information to highlight and how to summarize it and represent it as a figure. Do we look at the size of the base of the pages? So all of that information is important. So the question was what you should look at when you look at the enrichment map. The default enrichment map, the size of the node says how many genes in your experiment are related to that pathway and then the weight of the edge tells you how many genes are shared between two pathways that are connected. So that information already tells you something. On top of that for example you may be able to say which genes specifically are part of that pathway that's been there as a node but for that you need to do a little bit of manual look up in the cytoscape. I think that by default whether there is a score that allows you to rank or look at the pathway enrichment map to rank which ones are on the top. I think the brighter colors by default tell you the higher enriched pathways in the map. Any other questions? Okay, I'm moving on. So where do gene lists come from? And I'm sure that you guys have even a better idea than I do because everyone is working on their own experiments and they generate their gene list. It may seem that it's very different and difficult to analyze gene lists for different platforms and data sets and techniques but there are all these statistical ways of say ranking and clustering and network analysis that all produce lists of genes. So for the simplest type of pathway enrichment analysis you just work with a list of genes. No strings attached, a plain list of genes. However, many biological experiments give you naturally ordered lists. For example, when you analyze gene expression data some genes have a higher fold change relative to the control. They can be ranked first. You can also quantify these gene lists and for each gene attach a number to it and there are advanced pathway analysis techniques that take these numbers into account and then perform qualitative analysis or quantitative, sorry. And whatever type of data set you have you may be able to use some of the more well-known statistical techniques that are broadly applicable. For example, ranking, filtering, clustering, principal component analysis that all can produce some types of meaningful lists from your data. You can also look at already pre-existing data sets such as networks, protein-protein interaction networks, microRNA, target gene interactions, transcription factor binding sites. All these data sets provide meaningful ways to analyze to create lists and analyze them in a pathway context. You can study a genetic screen such as a knockout library of all yeast mutants of genes and then plenty of information can be extracted from genome-wide association studies. So what do gene lists mean? This is probably one of the major topics of this lecture. Gene lists could mean a biological system that's been altered or affected by your experiment or maybe a biological process that's affected in a group of deceased individuals versus controls. A gene list could mean genes with similar functions such as transcription factors or kinases that all become activated or deactivated due to some condition. Gene lists could mean genes apparent in a particular cellular location such as all nuclear proteins. Or they could mean all the genes that are co-located in a particular chromosomal region maybe due to a copy number variation. And then all the biological questions that you want to answer when analyzing things with pathways. First of all, you want to understand what you want to achieve with your biological experiment. And this is hopefully even before you go and do the experiment. Now you want to, perhaps you want to summarize the biological processes that are apparent in your data or maybe you want to perform differential analysis. What pathways are there in deceased individuals that are not there in healthy controls? Maybe you want to find a controller for a process and that's also part of a pathway analysis. Maybe a microRNA, maybe a transcription factor. You can use the birth of a further principle to assign functions to genes and perform validation experiments or you may want to prioritize genes for those. In the first lecture, I will cover pathway enrichment analysis, just comparison of gene sets. But further, we can also do network analysis to pick gene function or maybe find new interactors for a specific list of genes or regulatory network analysis where we find transcription factor binding sites, for example. So as I mentioned, there are two major inputs to pathway enrichment analysis. One of them is the list of genes and one of them is a group of gene sets corresponding to pathways. The first topic will be dealing with gene identifiers. There are so many different gene identifiers and it's often kind of overwhelming to deal with them. And ideal identifier is a unique, stable identifier that links a particular gene or a protein of interest. By unique, I mean that it hasn't been used for a different identifier at a different time. And gene and protein information is stored in many databases and many databases focus on their particular aspects. For example, Uniprot is clearly a protein database and they don't really provide information about genes. And therefore, as there are many databases, each one of them has their own type of ID and it becomes difficult to convert between those two, between them in general. And it's important to understand about what different databases are about. For example, Andre gene doesn't really store the sequence of that gene. It just provides a pointer to a sequence database. And as these pointers change, you can also see how things change and the protein or gene IDs are not stable over time. So here's a few common identifiers that are used. It's a long list, but believe me, the list is even much longer. There are certain IDs that belong to genes. Others that are focusing on RNAs. Others yet that are focusing on proteins. And then there are all these species-specific databases that have their own IDs. For example, either the human genome database or mouse or rat or so on. And then there are yet other types of IDs that relate to experimental platform, for example, affymetrics or Illumina or so on. The problem is that most software tools only support their limited lists of IDs and others are not supported. And then you need to map your gene list or ID list to a standard ID list. And the main use is finding your favorite genes and locating other resources that are available for analyzing genes. You need to translate things, even the biological sense. Sometimes you need to translate things from probe set IDs and microarrays to protein IDs in order to perform interaction network analysis. And you need to merge data from different sources. You need a common reference. So the problem with many IDs is that there are one to many. So one ID could be many actual proteins. There are ambiguities, often coming from historical reasons. For example, TP53, a known cancer gene, has all these different symbols in different organisms and different platforms. So you need to, most cases, you need to use one standard symbol that's widely acknowledged. Then if you're using Excel, then this is a notorious case where the stem cell regulator oct4 is converted to october4, which you don't really want. So you need to paste and copy and paste this list carefully. And there are always problems reaching 100% coverage because things change over time. Here's a cautionary example where a particular nature paper was retracted because they thought they were analyzing one gene, but it turned out to be another gene that was regulated by a microRNA in their experiment. So be really careful about those types of situations. Here's an ID mapping service provided by the G profiler tool. It's pretty straightforward. It's based on the Ensembl Biomart database that gives us all these indexes between converting IDs. And all you need to do is insert the gene list. It can be a mixed set of different IDs. And select the type of output that you wish. There's a long list of them. This is for human. And you'll get back a table. And when you go to Ensembl Biomart, you can actually download all these comprehensive lists of translation tables. It's really useful. So as I said, the question is about losing genes while converting between a rat and human. So one question is, if you were using that particular toolset, then I wonder whether you'd want to So if you're doing all ortholog conversions, then I assume it's expected that some genes are lost. But I would say I need to understand more about the problem so maybe we can talk about it offline. Yeah, so for example, you can have a gene list coming from your microarrays. And some of the map identifiers won't be mappable by any tool at all. Because some they will either be missed or there will be some aliases that are ambiguous. And then in order to get rid of those missing aliases, you may need to do some manual work. For example, go check that aliase is in multiple different gene databases and make sure that you understand to which gene it's actually pointing to. So the example with the nature paper, things like that happened where exactly the same symbol has been used for completely different genes and different chromosomes with different functions and so on. So this work will probably never be precisely done by any computational tool. OK, so on top of that ambiguous mapping, there's now in G-profile, there's a dialogue that allows you to manually check if there's ambiguity which gene to choose to proceed with. So a couple of recommendations, how to deal with these IDs, one of them being focused on a particular well-identified or well-acknowledged set of IDs such as Andre G's IDs or official gene symbols. And when you really want 100% coverage, use a spreadsheet, make sure that the copy paste works well so you don't get October 4. Then you need to perhaps look at the multiple databases such as gene cards, ensembles, species databases in order to make sure that your missing symbols are mapped correctly. And I mentioned the Excel thing already. So to summarize this part quickly, many IDs have been invented for different genes and proteins and molecules and so on. And when you do any kind of genomics analysis, you may need to convert from one to another repeatedly. And this is a task that all bioinformaticians sort of despise. But it's required to do, and not always, the software tools will do all the work for you. And they use common IDs such as symbols in order to do your mapping. So meanwhile, I heard that there was a question about defining a pathway. In the context of this lecture, I would like to define a pathway as a list of genes that has been previously annotated to a particular function. So in this case, pathway doesn't involve the interactions between the genes. But we're talking about a set of genes such as these are all neurotransmitter genes. Let's call that the neurotransmitter pathway. In further lectures, people will talk about how to consider pathways when there are interactions between. So one pathway member regulates another pathway member. In this particular gene set, enrichment analysis is a pathway, is a list of genes that's associated with a common function. So the main purpose of analyzing pathways is that we don't need to go back to the literature to understand primary research that was performed to understand gene function. But instead, we can go to a pathway where all that information has been stored, gene by gene, function by function, and then analyze that using computational techniques to better highlight the biology in our experiment. Pathways are mostly collected into several resources. The more famous one is called the genontology, which we will focus on mostly during this presentation. Genontology has biological processes, cell components, molecular function stored in it. There are many other pathway databases, about 500, that store information about pathways. For example, reactome and keg are commonly used pathways for multiple species. And then there are all these other annotations, such as so chromosomal positions, or protein domains, or disease associations, and so on and so on. All of that can be used systematically in order to perform gene set enrichment analysis. In most common cases, people look at the genontology and perhaps a few pathway databases. So the genontology is like a dictionary. It's like a book when you open up and you see definitions of different terms. It's a dictionary for biological phrases, such as a protein kinases, which is a type of a protein, or apoptosis, which is a type of a process, or membrane, which is a type of a cell component. So genontology is a dictionary of terms, but it also has terms and definitions. It's a formal system for describing knowledge. And it's also like a hierarchy. So there's a system how those different terms are arranged between one another. So this is an example of the hierarchy. For example, we can talk about B cell apoptosis, which is a type of an apoptosis, which is a type of cell death, which is a biological process. So all these terms or ideas are arranged in the ontology in a hierarchical manner, using relationships such as part of or is a. And it describes multiple levels of detail of processes and pathways. So this is where the redundancy comes in. All these terms can have multiple parents or multiple children. So it's a structured way of representing knowledge. So genontology covers three major trees or three major structures, one of them representing cell components, the other molecular functions, and then biological processes. The last one is usually most convenient for analyzing your experimental lists for biological pathways and processes. So where do go terms come from? They are added by editors at the European Bioinformatics Institute in collaboration with multiple groups who are annotating genome data. Terms are added by requests, and experts help with those major developments. And here's a quick graph showing that go is a live and evolving structure of knowledge. For example, between now and a few years back, it has grown 16%. So 16% more new terms were added to the structure. Another part of go is the part of annotations, where genes are linked or associated with go terms as the scientific knowledge improves. And these are known as gene annotations or gene associations. And importantly, every gene can have multiple annotations. And some of those are created by manual curation by experts, and others are created automatically. So you can already see that there's a difference in quality how these annotations are created. Here's an example how genes and proteins get annotated to the tree. And this is important to understand when you look at pathway enrichment analysis, because you see a lot of overlapping processes and things coming out. For example, there's a kinase called Aurora kinase B that is known to be part of B cell apoptosis. And then in the process, it is actually automatically added to all the parent terms of that process. So it's not only associated to B cell apoptosis, it's associated to apoptosis in general, as well as cell death, as well as death, as well as biological process. So it's sort of growing up in the tree annotations. Annotation sources? Yeah? Yes. So the question is whether this hierarchical annotation up to the top of the tree is also used in pathway enrichment analysis? And yes, that's correct. So this is where the redundancy comes from. This is why you always see not one cell cycle process enriched in your gene set, but dozens. Because all these terms in the bottom of the hierarchy are basically the same thing with subtle differences. And they get annotated up the tree so they become more and more general, but also more and more redundant. I'm sorry, I didn't get that. You can. Yes, you can find out which go term is associated primarily. In most cases, you need to go to the original table to look it up. So the question is how large can a go category B? And it can be for a single gene if you manage to convince the go editor that it's really novel and it has to be there and so on. And then in the future, more genes will be added. In the pathway enrichment step, you can actually set filters of how large gene sets you want to look at. And that helps you to better fine tune the results. So annotation sources are diverse. They're basically based on what types of experiments were used in order to link that gene to that function. And manual annotations are curated by scientists. But the manual annotations are unfortunately a minority because it takes a lot of time to curate bio-primary literature. And then instead, the go team also uses reviewed computation analysis where they run algorithms and they validate that the algorithms are fine. But there's also a fraction of go annotations that are made purely electronically by downloading data from other databases or by annotating gene functions from different species. And then that is the part of go that is less annotated, less taken care of, and also probably a little bit more noisy. So when conducting pathway enrichment analysis or any other pathway analysis, you should be aware of the type of annotations that were used in order to link genes and pathways. And just that helps you to also validate and prioritize your findings. So there is an indicator in some tools, including G-profile that we cover, what type of evidence was given in the annotation process. So go itself maintains all these evidence types that you can see here. Here are some experimental evidence types, such as mutant phenotype. Here are some computational evidence types, such as sequence similarity. Then there are author statements, such as author X said that gene X is related to process Y, and so on. And then there's the electronic annotation format. In G-profile, we use colors in order to indicate which type of evidence was used. And then the more red tones represent more stronger experimental evidence. And then the blue-artones represent electronic evidence. So you can also just quickly look what is the type of evidence that was used in order to infer the pathways for the gene list. Go covers a lot of species. And the species information about the species is initiated by species-specific databases. There's a lot of information about human, a lot of information about model organisms. There are some bacterial and parasite species, from tiger and so on. And then the ensemble database takes good care of annotating further species, for which they have genome sequences available to have even more data from go. Pardon? So the TP53 in rat is a TRP53, I believe. So yes. No. So the question is whether, when you do species-specific analysis, whether you have to convert manually between IDs from human to other species, no, you don't. But you may have to take care that you're not giving the wrong input yourself. So at least in G-profiler, all the IDs are mapped. Right. I would say that the question is about IDs and capital letters. I would say many tools convert the input IDs to capital letters anyway, so you shouldn't worry about capital letters. Yes, you're right. Yes. So the question is what do you do when you have several names for the same gene? You should choose a standard one. For human, I'm sure there's a list of genes that are called standard, so the nomenclature, you should always refer to that. Yes. Right. So maybe we can continue the discussion in one of the next sessions because I really need to finish this deck before you go on a break in 15 minutes. Yes. That will be a practical part where you can go and test these gene ID tools. So to get back to the point how a lot of that information is experimental, but much, much more is still computational, here's a chart that shows you how different species, how many annotations are available and what fraction of them are actually experimental and what fraction of them are derived from bi-computational means. So you can see that most species, while all of them actually have a majority of the information coming from computational predictions, which is fine because there are certain types of experiments that you can never do for human, but you can infer them from mouse, for example. There are many contributing databases, there are species-specific databases, and then there are these large-scale databases such as Ensemble that try to incorporate information from many species. And there's a type of gene ontology called the ghost limb that attempts to minimize and trim the entire GO tree for some certain uses where there's clearly too much information, for example. If you have this complex tree of different terms, it's very difficult to draw a pie chart of things represented in your gene set, but it's more feasible to do when you look at the ghost limb set, while the vast majority of detail has been pruned in order to provide a simplified version of annotations. There are many software tools allowing you to analyze GO. GO itself and all the associated tools are publicly available for people to use without restrictions, and then many other groups have created gene ontology analysis tools for various tasks. So, to access GO, there are many tools. Some of them are developed by the GO Consortium. Here's one, it's called QuickGo. You can go look up your terms of interest in the structure. You can visualize this tree. You can look at different associations that are part of the tree, and you can look up individual proteins and their functions in the tree. GO is not the only ontology that uses systematic representation of biological knowledge. Here's a cell type ontology that represents different types of cells and their organization. And besides GO, you have all these pathway databases. There are more than 500 of them, so going through them one by one is not very optimal. There's a database called Pathway Commons, which is a meta database of databases, and it merges the major pathway databases in a smart way. And then besides pathways, you can use any type of gene annotations in order to perform pathway enrichment analysis. For example, you can use protein domains or transcription factor binding sites or chromosomal positions, depending on really your application what you're interested in. Some ways of deriving these attributes. Ensembled biomarkers are a really convenient way of extracting information from Ensembled. You can get similar type of features from Entregene, and if you're looking at the particular model organism, you may find that model organism database your best source of knowledge. Here's a quick overview of how to browse the Ensembled biomarker. It has a pretty neat user interface. It's updated every three months or so. Here you can just first select the genome that you're interested in and the set of annotations for that genome. You can select from a large variety of filters in order to determine which genes you're interested in and then select attributes you wish to download, that they could be all types of attributes, including protein level features or regulatory features or sequences or whatever. Summarize the points of the second part of pathway analysis inputs. Pathways and other gene attributes are available in different databases. Genontology is one of the major ones that people use, but there are also many databases for pathways. Some of them are species specific. Genontology has a wide coverage of different species. It is the genontology classification system and dictionary for biological concepts. So there are two parts. It's the dictionary, so biological terms, and then the annotations to these terms associations with different genes. Keep in mind that every gene can have multiple annotations, and it will have multiple annotations due to the hierarchical rule of more general and more specific terms. Some genomes are more annotated than others, and some annotations are better quality than others. And for particular reasons, you may look at the Go Slim data set that provides simplified structure of Go. Here's a small overview of the pathway enrichment analysis diagram, which starts with a raw data, goes to a gene list, looks at pathways in a statistical way, and then looks at pathways in a visual way. You can choose some of those pathways to drill down to mechanism, bring them back to the genes, and then perhaps perform validation experiments and publish an experimental paper. However, no diagram like that is always as simple as it seems. There's way more to it. You can analyze pathways just by gene set analysis, but you can also look at interactions within pathways. You can look at, instead of small-scale pathways or gene sets of a common function, you can look at large-scale interaction networks such as protein-protein interaction networks covering all proteins. And there's more to it, so a lot of this will be repeated over the coming lectures and a lot of new information will be added as well. So maybe we can continue all the discussions that we had, and I'm a bit early.