 So my name is Veronique and I worked at the University of Toronto in Gary Bader's lab and in the field called Pathway and Network Analysis. So I'm a biologist and I work for people at OICR, Cancer Stem Cell in Toronto that are also biologists. So the tools that I'm going to show are very accessible to everyone. And so this lecture is like a transition from the lecture that you had this morning with Daniel. So you got like a gene list from your variant and to the lecture that is going to be presented this afternoon where we are going to use also pathway and network analysis. So the learning objective of the module, so to understand the basic concepts of pathway and network analysis to be able to recognize different gene identifiers and gene attributes and to understand how simple enrichment analysis tools work and we are going to visualize our results using cytoscape. So when do you want to interpret a gene list? Usually it's when you've done like a large genomics analysis and like Daniel showed you this morning you got your gene list, your variants could be also from gene expression data and you have like quite a lot of genes. And you want to know what is interesting about these genes. And one question that we are going to answer in pathway and network analysis is all these genes and rich in known pathways. So first we should define pathways, what are pathways and when we do like functional analysis pathway and network analysis mainly there are signaling pathways and metabolic pathways. But very often we are also interested by drug association or disease association. And sometimes I will talk about gene set, so what is a gene set? We say that it contains all genes in a different pathway or in a part of a pathway. So I will use pathway or gene set. So what are the advantages of pathway and network analysis? At first it saves time compared to your traditional approach. So the traditional approach would be to look at your genes one by one and to do search, like PubMed search to know what these genes are and try to make associations between these genes. But it's really time consuming. And the other thing if you do it like manually you don't have time to look at all the genes so you are going to focus sometimes on your favorite gene, oh your favorite gene is in the list. So it's biased. But the bioinformatic approach will be save time and less biased because you look at everything in your list. And also I think it's an intuitive way of analyzing your results. At least for me because I'm a biologist so I don't know. But usually we have like a good knowledge of the cells and you can imagine the cells and the nucleus and the membrane and you have a receptor and you have the signal that comes in your receptor and drops the signals and you have all these signal pathways in your mind. So pathway and network analysis will draw a map of these pathways that you are going to analyze. So it's much more intuitive. So you will have your gene list, your pathways that come from a pathway database. You will find the overlap between your gene list and your pathways and we will see it as a network. Another advantage of this kind of analysis is that it can enable you to find overlap at the pathway level. So I don't know if you got this problem. I got this problem quite often. You do two types of genomics analysis like RNA-seq, chip-seq, whatever. You have your hits. You compare your hits and you don't find any overlap, any direct overlap. There is no genes that are in common between your hits of your first analysis and your hits of your second analysis. But you cannot say there is no overlap until you have done the pathway and network analysis because you may have overlap at the pathway level. And this is one of these examples. You do CNV, you found mutations in patients. So patient one has mutation one, patient two has mutation two, patient three have mutation three. So they are all different, but they talk at the same pathway and all these patients are diabetic. So this is a pathway level. And usually when you find like another overlap at the pathway level, you should be very happy because that's what you are looking for. So now I'm going to show you an example of the use of pathway and network analysis. From this paper, so I don't know if you saw it like yesterday or the day before, genomic and transcriptomic architecture of breast cancer. So they use like a large coat of patients, breast cancer patients, and they had also normal samples. They did copy number variants and they did at the same time Illumina gene expression. So copy number variants were on a few snippets and gene expression on Illumina. And they did this EQTL expression quantitative trait lucky to find association between the CNV and the genes that were dysregulated. So at the expression level. And they found what they call trans-acting aberration hotspot. They defined the hotspot by more than 30 mRNAs with dysregulated expression associated with one CNV. Okay, so in a subgroup of patients, they found a CNV and they defined a gene list, say about 100 genes that were also dysregulated in this patient that had the CNV. So now they want to interpret the gene list and they use pathway and network analysis. So one figure, so this is the output. And this is like the CNV and the association with the expression level, the EQTL result. So here, this is like the chromosome, the whole chromosome, the whole genome, sorry. And you have like two lucky TR-A and TR-G that have a deletion, deletion here, deletion here. And here the horizontal hexes, you have like a few genes, maybe 100 genes that have dysregulated expression associated with this two deletion. They took the gene list, they did pathway and network analysis and this is the output. So this is a map, each node, each circle of the map is a pathway. The lines are edges, they are genes that overlap between pathways. And when you look at the titles of these pathways, they are all related to like T-cell function. And the CNV, the deletion were in the T-cell receptor, so from the immunoresponse. So the CNV were corresponding to T-cells to mature T-cells because when the T-cells become more differentiated, they rearrange their T-cell receptor. And here it's also related to T-cells. So what does it mean in this subgroup of patients? They had in the breast tumors, they had T-cells, infiltrated T-cells. And the prognosis of this patient is good. So they are a better survival than the other groups. And they think because the T-cells gives an immunological response. So the T-cells fight the cancer. Another figure, so this is the association matrix here. They have a deletion at chromosome 5. And they have 100 genes associated with this deletion. And what does it mean? So they took these 100 genes, two pathway network analysis. Here is the map. So once again, one circle is a pathway, one line is an H. And you have the title of the pathway. And you see it seems all related to cell cycle. So in this subgroup of patients that have the chromosome 5, they have mRNAs disregulated and this is in correlation with cell cycle. And when you look at the survival index and the prognosis and all this information, you will find that this subgroup are indeed, have indeed a higher mitotic index. So their cells are known to cycle more. And this is the same gene list that is visualized using the reactor MFI plug-in that you are going to see this afternoon. And now each node is a gene. We have a gene-gen network and you have functions associated with this network. So for example, the genes here in purple are all corresponding to telomerase, the gene in yellow to cell cycle. So two different pathways where one is a pathway-pathway network and one is a gene-gen network. So what do we need to do pathway enrichment analysis? First, we need our gene list. Second, we need gene attributes that are coming from a pathway database. And then we are going to use tool to find the overlap between your gene list and these gene attributes. So some recommendation before you start a pathway network analysis. Try to clean your data as much as possible. Because if you input a true positive, then you are going to be confident about the results of your analysis. So you may have more true positive garbage in, garbage out. If you are not confident about your results, if you don't have your true hits in your gene list, then you will have less confidence in the output of the results. Yeah? So it depends on the tools you are going to use. Some very simple tools, you just put the gene, like the gene names. Some other tools, and they are very interesting tools. You can rank your genes using the confidence. So this is something you can look at when you choose a tool. If you can have a value associated with your gene, like a confidence score, and you can rank this list, and your list is a beginner, then you can choose these tools to do it. The confidence score of the sample? It depends on your list. It depends on your experiments. But it could be, if you do a gene expression data, it could be the p-value, it could be a score, the number of hits per gene, the number of mutations per gene, so it's really case by case. So yeah, your gene list size is important too. So for simple enrichment tools, like David, if you know, I would say 50 to 500 genes are optimal. But if you have few genes, 10 or 50, you still can do it, but just choose other kinds of tools, like gene function prediction tools. And if you have a large gene list, more than 500 genes, then try to rank your list and try to use tools that choose this ranking. And make sure that your gene IDs are compatible with the software. So yeah, so where do gene lists come from? So they can, like the pathway network analysis concept can be very general, so it's a case by case, it could be gene expression data, it could be protein interaction data, genetic screen, association studies. And because gene lists come from different sources, they are very different, so it's important that you know what you want to answer. And to understand that your experimental design has been done correctly to answer the question you want. And so you choose the right tool to answer your question. And yeah, you can summarize the biological process, find differences, find a controller for a process like a transcription factor, microRNA, find new pathways, so really think about your experimental design and think about the question you want to answer. So we have a gene list and we need gene identifiers so that the pathway database recognizes your genes. And so maybe it's a basic concept, gene identifiers, but if you work with a large amount of data, then you have to be very careful. And so identifiers are unique, stable names for a gene and like interest in ID or RefSeq, but we have many, many database that store information, so we have many, many gene IDs. So you need also to recognize that this gene ID, these identifiers, they don't recognize, they don't store the same information. Like if it's a protein database, then the gene ID is for protein and this is a gene ID is for a gene sequence and it stores the gene sequence record. So it's important to recognize the right gene ID identifier and this is the common identifiers. So maybe you know some of them. I would recommend to use the most common ones for these tools or the pathway analysis tools like Ensembl, Entrezgene, RefSeq and my two, the one I use more often are the Entrezgene ID and the official gene symbol. And one tool that I like is gene cards. So if you want to look for, you have a gene name and you want to access very rapidly some of the common identifiers and you can use gene card and you also have all the names and much more information about this gene. So my favorite Entrezgene identifier is Entrezgene ID. It's a numerical value, so it's very easy to manipulate. It's also stable, so even if the gene hasn't been studied and don't have a gene symbol, it has an Entrezgene ID. So you don't have to update your list all the time if you use Entrezgene ID. Entrezgene ID. Entrezgene ID. And yeah, so Entrezgene is a database, is a retrieval system and it has many, many connections between all these different database. So even with Entrezgene ID, if you scroll down and you have links to the other databases. One other one is Refseq. So if you scroll down this page, then you are going to find the Refseq identifiers like NM for mRNA and NP, P for protein. So because sometimes, for example, we got our data from Illumina and we have this Illumina prob ID and we need to convert from one type identifier to the other one because we want to have the identifier that our tool is going to use. So we have tools, web tools that we can use to convert from one type to the other. And we are going to see this later. So when we have many, many genes, we have to be and we manipulate all these data, we have to be careful. So yeah, there are some ID challenges. So gene names. So that's why we prefer to use Entrezgene ID and not gene names because sometimes there is some ambiguity. There are many gene names for one gene. So be careful and try to use the official gene symbol. If you use Excel, you're going to have also trouble if you use the gene names because many genes like octfo, september 4 or septfo are going to be changed as a date. And if you have thousands of rows, you may not notice it. So be careful if you use Excel and it's very, very difficult to obtain a 100% coverage. So you are going to have missing value. So if you really need 100% coverage, try to use different sources and try to correct, to add manually the missing annotations. It could be due to this problem, but sometimes you have Entrezgene ID that don't have a gene symbol and things like that. So it's not like 100%. The databases are not overlapping at 100%. So that's why you have missing values sometimes. I mean, all the times, you have a few missing values sometimes. And depending on the tool that you use and the version that you use, then you may have this missing value. So if you use Excel, so this is the example of all these genes that were converted to date. So if you use Excel, open first Excel, then open your file. You will have this text import result. Select your column with your gene symbol and set the column as text. And this is another example from the, you have to be careful with the gene names. It's a paper that has to be retracted. They were working on HES1. But this name has, these genes, there is another gene in the database that has the same name. It's quite old, but still. Two names, HES1. So the researcher thought he was working on Harry and instead of Plit1, but it was working on the other one, human homologous HES1. So all this, this paper was false. Yeah. Okay. It's quite old. So I think now the databases are a little bit better and it's more standardized, but yeah. They had a nice paper, but not the wrong, but the wrong gene. So and this is like a nice and easy tool that we are going to use in the lab, the signer geyser to convert from one gene, one identifier to the other one. And you also can use Biomart and we will also use Biomart. So our recommendations map everything to interest in ID. If 100% coverage is needed, then try to manually add the missing annotations. Be careful of Excel auto conversions and what have we learned. Genes and their products and attributes have many identifiers. Genomics often requires conversions of IDs from one type to the other. But there are tools that exist that exist and yeah, use common ID like interest in ID, Refseq. Okay. So now gene attributes. So gene attributes come from the pathway databases. This is and it will store all the functional annotations. And so when we speak about pathway network analysis, we are more interested by function annotation. But we may be interested by other features like chromosome position, disease association, DNA properties, protein properties and all these features. You can find them in a genome browser. And this one, the function annotation in the pathway databases. There are a few pathway databases. Generally you know gene ontology, there's also keg, reactome, biocarta. I'm going to talk about gene ontology. And this afternoon you are going to talk about like reactome, biocarta and keg. So what is gene ontology? So it's the largest database. It's updated very regularly. It covers many organisms. And it's freely available. It covers three major aspects of gene function. Cellular component, molecular function, biological process. Okay. So plasma membrane would be a cellular component. This enzymatic reaction would be a molecular function. And cell division would be the biological process. So when you do pathway network analysis, normally you care about molecular function and biological process. And so GO is like a dictionary. It contains terms. And each term in the databases is related to each other. And it's like a hierarchy. And at the top of the hierarchy you have the more general terms. And at the bottom of the hierarchy you have the more specific terms. And you have two kind of relationships either or part of. Like this red is part of and this one is A. So here at the bottom you have the term B cell apoptosis. So B cell apoptosis is part of B cell homeostasis. But it's a type of apoptosis. Which is a type of programmed cell death. Which is a type of cell death. And so we called parents and children and child. So this child can have multiple parents. And this parent can have multiple children. And now GO is going to associate one gene with GO terms. So you can have multiple associations. And this is an example of how GO associates information to this gene, Proc1. So it's manually curated. This is a paper about these genes. Proc1 describes the function of Proc1. So receptor like kinase would be the molecular function. So GO term association. Integral membrane protein would be the molecular component. And wound response would be the biological process. This is a manual creation. And you can have also electronic creation that are not from papers, but are from prediction, bioinformatics prediction. So you want to know the association, how the associations were created. So these genes were associated with this GO term by IC, inferred by curator. This is what we called evidence types. And these genes, PSMD4, was associated with this GO term by TAS, traceable author statement. This is the one I have presented here. You have this IEA, inferred by electronic annotation. This comes from predictions. So depending on your case, sometimes the tools offer the choices. And you can remove this IEA if you don't feel confident about this annotation. And when you have a GO term and you want to have more information about this GO term, then there is like a tool. There's different tools. But one of these tools is KRIGO. It's a very simple web tool. And you can enter your GO term number. And you have a lot of information, like the term information, ancestral chart, ancestral table, child terms. And we are going to use it during the lab. And the other database is KEG. I think you know it. Biocarta, reactor, ingenuity, which is commercial. So this is the only one that is not freely available. And pathway comments that regroup all this database are going to be presented this afternoon. And this I've mentioned already. All these other attributes, you can find them using genome broader. So Ensembl Biomart is also a web tool that is really easy to use to retrieve all these gene attributes. So you have your larger gene list. You can copy and paste your large gene list and retrieve these attributes. And we are going to use it during the lab. So what have we learned? Gene attributes define functions, characteristics of a gene. Many genes attributes are stored in databases, like GO, KEG, reactor. And many gene attributes are available from Ensembl and Entress Gene. And this is just for your information, different URLs and source of attributes. OK, now we have our gene list with the right identifiers. We have our functional annotation. And what we want to find is the overlap between the two. And we are going for this, we are going to use enrichment tools. So there are many tools that exist. And we can define these tools into three categories. The first one is functional pathway analysis. The second is class scouring. And the third one is pathway topology. This first one represents the most simple one, like David, if you know. It's ideal if you have a gene list from, I would say, 50 to 500 genes. You just have the gene names. You don't have any values associated with your genes. The second one, a class scouring. So if you have a larger gene list and you are able to have scores associated with these genes and you can rank your genes, then you can use these tools. And one example is GSEA. And the third one is pathway topology. So pathway topology, it uses the functional annotation. But in addition to that, it uses the relationships between your genes to build the network and to score the significance of your results. So let's say in your gene list, you have 10 genes that are from a given pathway, pathway A. But in these 10 genes, you have five genes or inhibitors of the other five. So it means you have five genes that activate, five genes that inhibit these genes that activate. It makes sense that if you just had 10 genes, that go in the same direction to activate your pathway. So pathway topology uses this information to build the network. And you are going to see an example this afternoon with the reactome-fi plugin. So you are going to see that in much more details this afternoon. So what is Genset Enrichment Analysis? You break down the cellular function into Gensets. So different Gensets. And you are going to find the overlap of your gene list and these Gensets. And you want to see if this overlap is significant or not. So does it occur just by random change or not? So the tools that, in general, what they are going to do is calculate the overlap. So here, this is my pathway A. And here, this is my 100 genes that are significant. And let's say I have 30 genes that overlap between the significant genes and the pathway. Is this overlap larger than expected by chance? How can I do that? Then I will select randomly 30 genes out of the genome. Let's say 100 genes, because I have 100 genes. If I select randomly 100 genes, what are the chances that I get 30% overlap? And I'm going to do it many, many times to be able to build the significant score. For the simple tools, so the first category that I showed you, they usually use the Fisher's exact test. So I have here five genes, four black, one red. And the background population here is my genome. In my genome, I have 4,500 red and 500 black. What is the probability or what is the chance to get four black and one red in my gene list? So the first, the null hypothesis that my list is a random sample from population. But if I reject the hypothesis, it will be, yeah, well, it's not by chance. I have more black genes than expected in my list. So first, the Fisher's exact test is going to build the null distribution randomly. So what is the probability? This is my genome, my genome universe, to have five red balls. So I take five genes or balls randomly. What is the probability to take five red? 57% because we have more red than black. What is the probability to take four red and one black? 35% and so on. And what is the probability to take four black genes and one red is very low because you don't have that many black genes in the genome. And let's say that the black genes are one particular pathway and this is apoptosis. So then your cutoff would be this value. And the p-value of the Fisher's exact test is going to be the sum of the p-value that are equal or less than my cutoff. So the probability is 0.001, which is less than 0.05. So you can be confident that it's not expected by chance. So to have in my list four genes that are belonging to the apoptosis pathways is not by chance. So all these tools are using this kind of concept. We usually test for the over-representation of a pathway in our gene list. But you can also test for the under-enrichment of your pathway, of a particular pathway, but it's very rare. For this kind of test, you need to choose your background population. Normally, if you use a genome-wide experiment, you don't need to set the background. But if you work with an array that is not representative of the whole genome, then you need to set your background population. So what we did, we have calculated this p-value for this particular gene set. We need to do it for all the pathways that we are testing that we have in our databases. So we test many, many pathways. And then normally the output of an enrichment and pathway analysis is like a tabular format with all the gene set we have tested and the p-value. And we ranked from the most significant to the less significant. And because we are testing so many, many pathways, we need to correct for multiple apoptosis testing. So that's why when you see the output of an enrichment analysis of the p-value, and after the p-value column, you usually have the FDR. So the FDR is the false discovery rate. It corrects for multiple apoptosis testing. It's the expected proportion of the observed enrichment due to random chance. So if you have an FDR of 15%, it means you have 15% chance to have a false positive. What is the best cut out for the p-value to say that's a random count? Yeah, usually people say 0.05. It's arbitrary. It's always better to rank from the most significant to less significant, because 0.05 is from an equation. So it's a theory. But usually it's 0.05. And so usually the FDR is calculated. The most common is the Ben-Germany-Hauberg correction. And it's often called the Q-value. So what we learned, the typical output of an enrichment analysis is a table. And the minimum information that you will have are the pathway names, the number of overlapping genes between your gene list and your pathways, the number of genes in the pathway, like apoptosis has 500 genes, the p-value associated and the corrected or adjusted p-value. And this is usually the output. It's not very clear. It's difficult to interpret. And also there are many, many pathways that are related to each other because they are genes in common. So that's why we use network visualization to output the results as a network. And we use the cytoscape software to do that. So cytoscape is an open source software used to visualize complex networks. It's open source. And there are a lot of apps that we call plugins for different tools. So first you have to download cytoscape, and then you have to download the plugins you want. Or you can create your own plugin. The advantage of network, the major advantage is that it enables you to represent relationships. So you are going to be able to represent the relationships between the pathways if you do a pathway network or the relationships between the genes if you do a gene-gen network. So two basics. First, you need to understand the concepts of nodes and edges. So again, two network possible, gene-gen network, each node or each circle node is a gene. So this gene is related to the other one by an edge. If you have a pathway network, each circle or node is a pathway related to each other to the other one by an edge. Here the association between gene-gen could be, OK, we know that they physically interact. Or you can have an arrow to say this gene is an activator of this one. On the pathway network, it could be the number of genes that overlapping between the two pathways. And the second thing you need to know is the automatic network layout. If you don't have any layout, we have something like this, like a hairball. You don't see anything. So you take the output of your annulment result and you put a network with that layout. It looks like this. So you cannot make any conclusions. So you need to make a layout. And cytoscape has a different automatic layout. So I think the most common is the post-directed layout. So nodes repel each other, and edges pull. So if the nodes are very connected to each other, then the edges are like springs. And the nodes are going to be close to each other, like a cluster. But because nodes are repelling each other, they will not overlap. So you can see each one of these nodes. And OK, so I hope you installed all cytoscape. And this is just a basic introduction, in case you haven't done the tutorial. So when you open cytoscape, you have three parts. The first one is the control panel. The second is the data panel. And the third one is the results panel. Each time you create a network, you can save your session and open it later. To navigate through the network, it looks like we have like a large network. You don't know how to navigate through the network. You go in control panel to network, and you click on that. You have like a blue square. You can click on this square and move it around to navigate through the network. Then the layout. So you go to the menu, layout, cytoscape layout, and you can choose the one you want. You can play and add a lot of visual features. You go to control panel, vizmapper, and you have all these choices. And you can modify the shape of the node, or modify the node size, and many other things. And you can do this beautiful network. Nodes are in different colors. Edges are also in different colors. It could be also the thickness of the edges. If you want to prepare a figure for publication, then you can use this visual feature. But it also helps you to define clusters and make the things more interpretable. So what have we learned? Networks are useful for seeing relationships in large data sets. It's important to understand what nodes and edges mean. Automatic layout is required to visualize the networks. Visual attributes enable multiple types of data to be shown at once. I'm just going to show you two examples of cytoscape plugins. The first one is Bingo. And it does like an enrichment analysis, like David. So using the Fisher's exact test, exactly as I showed you. So all in cytoscape. And he used the gene ontology database. He uses Go. So the output is a tree that is related to the Go hierarchy. So it's the general terms at the top of the tree and the more specific terms at the bottom of the tree. And the color indicates if it's significant or not, like a significant enrichment or not. And another interesting plugin could be clustering. Sometimes you have a large network and you need to cluster your network. And this one is MCOD. And MCOD enables you to cluster your network. And you see all these genes that have been clustered in your network. And at the end of my slides, I have a few cytoscape tips and tricks. I think that you can, I put them in the lecture for you so you can read after the workshop. So I won't read it now, but you can read it. So network, this one is a root graph. Network reviews, sessions, login, memory, cytoscape directory. This is an active community. So you can go to this website and you are going to find many tutorials and a list of all the plugins that are available. And it's a community that is growing. And that's it. And now we are going to go to the lab.