 So, at the end of the lecture of the lab, my objectives are that you will be able to play with some gene identifiers and gene attributes, that you understand how simple enrichment tools work, and that you get familiarized with the search escape software. So, the starting point of a pathway and network analysis is a gene list. So, it could be like a medium-sized gene list or a very large gene list, and it can come from proteomics or genomics experiments. So what you want to do is to interpret your gene list. You want to know what is interesting about your gene list in an automatic and unbiased way. So, for example, you want to know if your gene list is enriched in some known pathways. So, first, what you are going to do is to find the overlap between your gene list and some gene annotation that are stored in some pathway databases. So, for pathway enrichment analysis, what we use is prior knowledge of the genes that are stored in pathway databases. So, this information is collected regularly, and we need... We say that the pathway are... That we curate the databases. And we visualize the results as a map because it's a more intuitive way to interpret the results. So, now, what do you mean by pathway? So, for biologists, we are more interested by signalling pathway and metabolic pathway. So, signalling pathway that would be like a receptor at the surface of the membrane that is activated, and there is like a downstream signalling cascade. And a metabolic pathway that would be more like a reaction, like more the biosynthesis of vitamin, for example. But we can use... For enrichment tool, we can use also databases that have like a broader content, like drug or disease association, and this can be very useful. And another term that you may hear a lot during this lecture is gene set. So, what is the definition of a gene set? A gene set is a group of genes. So, it's more specific than a pathway. So, a pathway would be... For example, you can have like a big pathway that is divided into sub-pathways, and each sub-pathway could be like a gene set. So, what are the advantages of doing a pathway and network analysis? Well, it saves time compared to the traditional approach. So, the traditional approach would be to read information about new genes one by one. And actually... So, it takes a lot of time to do that, and sometimes it's also biased because you finally you end up looking only on your favorite genes. And pathway and network analysis is also an intuitive way of analyzing the results. So, we all know that within a cell, we have a lot of signalling pathway and metabolic pathway that are interconnected. And when we create a network, it's like drawing this map that we have in our mind. And on this map, we put our data. So, it's much easier to interpret the results. The other advantage is that we can add different layers of information. So, we can overlap different information at the pathway level. For example, you have RNA-seq data, and you have mutation data. And you take your top hits of your RNA-seq data, and the top hits of your mutation data, but they are not the same genes. So, you don't find a direct overlap. It doesn't mean you don't have an overlap at the pathway level. So, maybe these genes belong to the same pathway. So, this is an example. So, you have patient one that has mutation one, patient two has mutation two, and patient three has mutation three. So, you cannot find the same mutations in different samples. So, the mutations are mutually exclusive. So, when you directly compare the patient, there is nothing that is similar. But all these mutations in these genes are belonging to the same pathway, and all these patients are diabetic, for example. So, this is a first example of real data using mutation data and pathway network analysis. So, this is a Nature paper in 2012. So, they had this chromosome 5 deletion that they found in breast cancer samples. So, they look at the genes within this chromosome 5 deletion. They extract the gene list, and they did an enrichment map. So, this network. So, on this network, one circle is a pathway. And what they found is that the chromosome 5 deletion genes are enriched in genes implicated in cell cycle. So, that's the same data, but now it's not what we call an enrichment map. But it's a network created by the Reactome-FI plug-in that you are going to see this afternoon. And now, each circle is a gene. And the different colors are the functional modules associated with this network. Now, another example that we are going to see in the lab practical. So, the same data as we used before with Daniel. It's come from this paper that gather glioblastoma samples, glioblastomizer cancer, brain cancer. And one major result of the paper is that they found 71 significantly mutated genes. One of them being EGFR, so Epidomal Growth Factor Receptor. And another result, major result of this paper is that some of the mutations were found mutually exclusive. So, now, can we add all these different layers of information in one network? So, all this information were available at separate tables in the paper. Can we add them all in the network? So, this is the network that was created with gene mania. And the layers of information are the frequency of mutations, so the size of the node. The shape is if the mutations are mutually exclusive related to EGFR. The pink color is the pathway that was the most significant pathway in this data set. And we also see the physical interaction and the genetic interactions. So, now, when we add these layers all together, I think we can see that one set of genes can attract our curiosity. So, they seem to cooperate or to work together. So, if I had to work on that project, I would not only focus on EGFR, but probably on these other genes that are related like PIK3R1 or PIK3CA. So, I think it was obvious that these cluster of genes work together. So, a last example how to overlap two layers of information. So, now, this network is a pathway network. So, each circle represents pathways that were enriched in one subtype of the glioblastoma called pronural. And the yellow triangle is the 71 significantly mutated genes. And now we can see where the mutated genes are located in the pathway map. So, we have genes belonging to some pathway that are significantly enriched in this subtype of cancer. So, what are the three elements of pathway network analysis? Then, first, we need our gene list. We need gene attributes that store the functional annotation of the genes that are stored in pathway databases. And we also need enrichment tools that will calculate the overlap between the gene list and the pathway that we are testing. And these tools are also going to tell us if this overlap is significant or not. So, let's talk about the gene list first. So, first, some recommendations between you start a pathway analysis. So, try to get like a clean list. So, as clean as possible. So, try to normalize, to adjust for background, to remove outliers. Because what we say sometimes is garbage in, garbage out. So, try to get as much as possible a majority of true positive in your gene list that you are going to use for pathway network analysis. The gene list size matters too. It can also help you to choose the right tool. So, if you have like a very small gene list, then you can use function prediction tool. If you have like a medium size tool, like 50 to 500 genes, you can use simple enrichment tools. If you have like a very large gene list, more than 1000, you want to try to rank your genes by significance. And use tools that use this ranking as input. And make sure that your gene IDs are compatible with the software that you are going to use. So, where do gene list come from? Then they can, from multiple sources and enrichment analysis can work with any of this gene list. So, you have to know that these tools have been developed first for gene expression data. So now, new tools are being developed for more specific case, like if you have methylation data, you can use, for example, the tool rate. Before applying pathway network analysis, it's important that you have a clear idea of the question you want to answer. So, you want to answer one biological question. You make sure that your experimental design to generate your gene list make it possible for you to answer the question is important for you. And then it will help you to choose the right tool and to correctly interpret your results. So, gene identifiers. Gene identifiers are ideally unique, stable names on numbers that represent your genes. They come from multiple databases. And these databases store slightly different kind of information, like gene databases will store gene sequence, protein databases will store protein sequence. So, it means that one gene can have multiple IDs. And also, because these databases are slightly different, they don't completely overlap. So, here is a list of common identifiers for you to know what they can look like. In red are the ones that are recommended. And with the three pink smileys are the ones that I use on a daily basis. So, on a daily basis, I use ensemble gene IDs, interest gene IDs and official gene symbol. Gene count is a tool that I like to use. So, gene count is like a, it's a gene database. And when I want to retrieve my interest gene ID very rapidly, I just enter the name of the genes. And I get like a lot of gene identifiers. And if I scroll down, many more information. So, we are going to use this tool in the lab later. So, interest gene ID. This one is my favorite gene identifiers. So, why? Because it's stable. It's a numerical value, so it's very easy to manipulate. And for genes that are, let's say, new or not yet annotated, they already have an interest ID, even if they don't have like an official gene symbol. So, if you use interest gene ID, you don't have to update your annotation all the time. Okay, so we have seen that one gene can have multiple IDs. So, that can be an issue. And you have to be sure that your software uses the identifier that you have on your gene list. So, you may need to convert from one identifier to the other identifiers. And that could be a challenge. So, you have to be careful when you do this conversion. And I think the most frequent mistake is when we use Excel. Because what Excel does is automatically change some gene names into dates. Okay? So, like October 4 is going to change in October 4. And, yeah, I see that every day in my work. So, here on the side. So, at the right side, you have the correct symbol. And on the left side, you have this same gene name that are transformed to date. And if you don't do anything, if you save it again, they will be just numbers. And the problem is that your enrichment tool is not going to recognize these symbols. So, it's just going to ignore that. So, there is a trick. I mean, not a trick, but there is an option in Excel to format your column as text. Okay? So, if you format your column as text first, and then you copy and paste your gene symbol, they should not be changed. But I know it's difficult. So, I think the best advice that I can give you is look at your gene list. If it's not too big, look if you have dates. And if you have dates, then look back at my slide and try to change to the correct name. So, fortunately, we have some tools that have been developed that help us to convert from gene identifiers to the other gene identifiers. And so, one of them is Gconvert that we are going to use in the lab that is really easy to use. And I would like to mention also Biomart Ensemble. It's not as easy to use, but it has a lot of features. So, you can copy and paste your gene list and retrieve any gene attributes you like, like gene identifiers, but also sequences and orthologs. So, our recommendation is that you need to map everything to interest in ID using a spreadsheet. If you need 100% coverage, then you need to manually create your missing annotations and be careful of Excel auto-conversions. The last tip is that when you have your table of results, try to keep at least two common identifiers of the terms or even more. So, one tool uses one, and the other tool uses another one, and you have all your data ready. So, what have we learned? So, that gene and their products have many identifiers that bioinformatics requires conversion of IDs from one type to the other, but ID mapping services are available, and please use standard and common used identifiers. Okay, so now we are going to start the second part, which are the gene attributes and the pathway database. So, when we use pathway and network analysis, we are more interested by function annotation. That's all the biological process, molecular function, and cell location. These information are stored in different pathway databases, like the gene ontology, keg, or reactor. So, today, this morning, I'm going to describe the gene ontology, and I think you are going to see reactant this afternoon. So, gene ontology is the largest database. It's updated regularly, and it covers many organisms. These many enrichment tools use just GO as the reference databases. So, GO is divided in three major parts. One is cellular component. The second part is molecular function, and the third part is biological process. So, let's say the term plasma membrane, that would be a cellular component. This isomerase activity that transforms the glucose into fructose that would be a molecular function. And cell division that would be a biological process. So, what is helpful to understand is that the GO database is organized as a hierarchy. So, you have more general term at the top of the hierarchy and more specific terms at the bottom of the hierarchy. So, here, you have large gene sets and small specific terms. And what you need to know also is that one term can have one or more parents' terms and one or more child's terms. So, when you take one gene and you want to retrieve the GO terms associated with these genes, you retrieve multiple GO terms. So, that's why, because of these relationships between the terms. So, genes are linked or associated with GO terms by traders. These associations are called GO annotations. So, this is an example of manual curation focused on the PERC1 genes. And so, how these GO annotations have been related. A receptor like kinase would be a molecular function associated with the PERC1 gene. Integral membrane protein would be the cellular component associated with these genes. And wound response would be the biological process associated with this gene. So, already three GO terms for one gene. So, there are different ways to get these annotations, not only manual curations. And you can find out looking at these evidence codes. Here, IC, IDEA, TAS, TAS, IEA. And the one I just showed you is probably a TAS, a traceable author statement. So, if you run an enrichment analysis and you have one GO term that is very significant and you want to know more about that term, then you can use tools to get more information. Like quick GO is one of these tools. You can get the child terms and the parent terms and the definition of the term. And a very related tool is AMIGO. This slide is just to remember that we're focusing on the function annotation, but there are many other gene attributes that are available. Like chromosome position, this is association protein properties. And all these other gene attributes can be available from Ansible Biomarkt. Is there differences between quick GO and AMIGO? I couldn't find any differences. And they relate to each other. So, when you're on quick GO, you can access AMIGO. When you're on AMIGO, you can access quick GO. Myself, I couldn't see any differences, but maybe there must be some. Otherwise, why? Why two tools, but I couldn't see any. So, what have we learned? So, gene attributes define functions, characteristics of a gene. Many genes attribute are available in the databases and I've just presented the gene ontology databases, but there are other ones like Kagan and Reactome. So, now we go to the third part, which are the enrichment tools. So, we have our gene list on one side and we have our functional annotation. Now, the enrichment tools, what they are going to do is to find the overlap between the gene list and the pathways that we are testing. And they are going to tell us if this overlap is significant or not. So, many enrichment tools exist. So, they can be classified into three groups. The first one is overall representation analysis. The second one is functional class scoring. And the third one is pathway topology. So, the first one is the most simple tool and this is the one that I'm going to explain later. And it's really good for mid-sized gene lists. So, maybe if you have 100 genes to 500 genes or up to 1,000 genes. One famous tool was David. So, I don't know if you heard about David. So, you can still use David. It's a very good tool, but it's not up to date anymore. So, the pathway database underlying the tool is not up to date. So, that's why we don't recommend David anymore and we recommend a G provider now. And Yuri is the developer of G provider. So, if you have questions during the lab, feel free to ask him. And then the second class of tools is functional class scoring. So, I would suggest that this tool is more for larger list and if you can rank your genes from a significant to less significant one. And one very popular tool is GSEA, Genset Enrichment Analysis. And the third way, the third one is the pathway topology tool. Reactome FY is one example that you are going to see this afternoon. And this one is more complicated. So, it takes into account the relationships between the genes. So, for example, you have a pathway and you have 10 genes that overlap between your gene list and this pathway and 10 genes are activators. They are on the same activating branch of the pathway. This is going to get a better score than another case where you have 10 genes, but 5 are activators and 5 are inhibitors. So, it takes into account the topology of the network in addition to the size of the overlap. So, tips for other represented tools. Again, if you have like a gene list, a small gene list, you can use to like gene mania. If you have like a small, medium sized list, you can use to like gene profiler. If you have more than that, try to order your query. There is an option in gene profiler to order your query. So, you can do that in gene profiler. And if you have like a large gene list, try to order your gene list using GSE or Wiccoxon rank test. So, what is gene set enrichment analysis? So, the first step is to break down the cellular function into gene sets. So, here are the four gene sets that I'm going to test. So, I'm going to test nucleopor, ribosome, cell cycle and P53 signaling. I'm going to find the overlap between my gene list and these four gene sets. So, how many genes are in common? And then, I'm going to see if this overlap is significant or not. So, here is a very general and conceptual slide that can be applied to many enrichment tools. And that explains the significance of the results calculated. So, let's say I have a gene list that contains 200 genes. And my pathway, let's say apoptosis, contains 100 genes. And the overlap between my gene list and the pathway is 20 genes. So, that's the first result. So, I have 20 genes that overlap. Is this overlap larger than expected by chance? How do I assess that? I can do some randomization. So, there is two ways to do the randomization. First, you can imagine that you can randomize your gene list. So, you replace your 200 genes by random genes. And you do it 2,000 times. So, how many times, if you take these 200 random genes, are you going to have an overlap of 20? You can do it the other way around. You can take your pathway that contains 100 genes and randomize this pathway. You do it 2,000 times. How many times, if I take 100 genes randomly, do I have 20 genes that overlap with my gene list? From this randomization, you are going to get a p-value that tests the significance. So, if you have a low p-value close to zero, it means it's probably not by chance. So, this overlap of 20 cannot occur by random change. So, it's highly significant. So, many of these enrichment tools, like David or G Profiter, what they use is a Fisher's Exact Test. So, let's say you have your gene list that contains four black genes and one red ones. And your background population will contain 500 black genes. So, that would be the pathway that you are testing. And 4,500 red genes. So, first, there is like the null hypothesis. List is a random sample from population. Or the alternative hypothesis, you have more black genes than expected in your list. So, the first step of the Fisher's Exact Test is to calculate the null distribution. So, let's say you have 45 red genes in your background population and 500 black genes. What is the probability to obtain five red genes by random change? That the probability is 57%. What is the probability to get four red genes and one black genes? Then the probability is a little bit lower. It would be 35%. Now, this is our case. We have four black genes and one red genes. What is the probability to get that? Well, the probability is very low to get that by chance. So, it's very close to zero. Now, the p-value that you are going to obtain would be the p-value at the cutoff. So, our cutoff is four black genes and one red genes plus the p-value that are less than our cutoff. So, if it's close to zero, it's highly significant. We don't think it can happen by chance only. That's a good question. If we should differentiate up-regulate genes and down-regulate genes when we do enrichment analysis? Yeah, so that's a good question. So, it depends on your input. So, for Fisher's Exact Test, you cannot differentiate. Put your 200 genes. It will not tell you if they are up and down because you don't give that information. So, what you could do if you use these tools is to separate yourself, the up-regulate genes to the down-regulate genes. If you want to see if the up-regulate genes are enriched in some pathway and the down-regulate genes are enriched in some pathway. This is what I usually do. You can use tools like GACA when you ranked all your data from the up-regulated to the down-regulated. And then you have two types of results. You will have like a positive score for the up-regulated genes and like a negative score for the down-regulated genes. So, you can separate the two. No, it's up to you to decide. If you want. Because I think it's easier to interpret if you divide this separately. You're up and you're down-regulated genes. So, you see, okay, this is the enriched pathway in my up-regulated genes and in my second list, this is my pathway enriched in my down-regulated genes. But the output of a pathway analysis is really... There is a modification of the pathway, but because we don't know if these genes are activators or inhibitors of the pathway, we cannot conclude too much if the pathway is up-regulated or activated or inhibited. So, you have... You know something is going on with the pathway, but at the second step, after your enrichment analysis, you really need to look at the genes that were in this overlap and look if they are activators and inhibitors. So, the third class of... The third class of tools that I showed you are trying to answer that question, but not the simple tools that I'm going to mention. Sorry, I don't understand how these numbers are... 500 black genes and 45 red genes, and then that's we have 5 red and 45 black. So, that's a second slide. So, the five genes... So, the background population, let's say it's... So, the background population would be your whole genome. So, the whole genome. So, let's say the total number of genes that you have, and the black genes would be the pathway that you are testing. So, just an example. So, let's say you have 4500 genes that are red in your population and only 500 that are black. That would be the cell cycle pathway, for example. And in your gene list, coming from your data, you have four genes that are belonging to this cell cycle pathway, so black genes, and one gene that is coming from other pathways. And now we want to answer the question, is it possible to get four black genes out of five just by random chance? And we know that we have much more red genes than black genes in our population. So, already we can figure out that it should not be bad genes only. If you pick five genes randomly, what is the chance of getting four black genes and one genes if you have this gene in the universe? Very low. So, your p-value calculated but the fissure's exact test is going to be close to zero. No, I just had different numbers here. I had 5 random 45 black, instead of 5 random black. This type of distribution seems to be combined and seems to assume that the gene size is equal to that. So, that's, yeah. The idea is that you're going to do the same thing if you're good. So, say... So, it depends. So, if you are really like a... So, let's say you have your mutation data and everything is ready, you have your gene list and the simple question you want to enter is that are these genes in my gene list belonging to the same pathway? Then, yes, you can use this enrichment tool to give you additional information. So, yeah. Yeah. I'm not sure but I think if you have if your gene list is 10 genes you have 8 black and 2 red, I think the p-value will change. It's the same proportion. Yes, we have 4 numbers that are taking into account the universe size for the total number universe, your gene list size and the number of black and red. So, yes, that would change. What does it make sense to change? Like, is it biologically significant? Yeah, there is pros and cons. I think this is really like a simple method. There is, actually in G-propiola there is also one option I think that's very interesting when you can order the query. I mean, I think Yuri can talk about that later. So, it's the same p-value but if your gene is ranked number one by the significance it's going to be more important than if you are like a gene which is genes number 10. So, I think it will differentiate this kind of of cases. If you like. You said that it does iterations, right? Yeah. Re-samples, it does this over and over and over. In this case, the official exact test calculated the null distribution first based on the equation. But you do them one or the other. You either compute the p-value analytically or you do the re-sampling. So, the re-sampling actually that I showed that, yeah, is more general. Okay? There are different ways to do it. And the official exact test calculates the null distribution, the hypergeometric distribution. This is the way this test works. So, you understand that for the official exact test then the size of the background population change the results. So, most of the case you are going to do like a whole genome experiment. So, your background is going to be all genes in the genome. But in some, yeah, in certain cases you can have like custom chip where you just have, for example, kinases or immune genes. So, in this case the background is going to be smaller and you need to input your background list. So, in general, many of these tools there is an option to upload the background. And that's for that reason. Because the background is taking into consideration to calculate the p-value. What is kind of the bigger question here? What is kind of the signal that we're trying to calculate? Is involved, is activated in this patient or is perturbed in this patient? So, perturbed would be more accurate than activated because, as I said you really need to look at the function of the genes that overlap between your gene list and the pathway that you are testing to see if they are activators or inhibitors. And you may have like a mix of activators and inhibitors. But what it's answering is that you have 100 genes. 100 mutated genes. And maybe, I don't know, 30 of them belong to a cell cycle event. Okay? And that or another pathway that you don't know about. So, it would have been very difficult to know that information just looking at all these genes that you don't know. So, that's the simple question these tools are trying to answer. So, are these many of these genes in your gene list known to be related to each other in any of the databases that you are testing? And even test like a drug database. So, are they to target of the same drug, for example? Are they known to be associated with the same disease? Or are they known to be associated with autophagy? So, that's really the question. So, it's the overlap. Yeah, so it's more that we know the association. So, if we trust the pathway database, we know the association. We know that gene A is in cell cycle. But the Fisher's exact test is trying to give you a p-value to get the confidence. Can we trust these results? Or can we... because we have many false positives in these kind of results. There's what are the most significant to the ones that are less significant. And the way to do it is to as I say to try to randomize. So, by chance only, can we get that by chance only? So, it's really to get the significance out of the results. Not the overlap. Because the overlap, we know. For sure, these 20 genes belong to the cell cycle. But are my data clean? Can we trust my data? Let's say we're carrying over from yesterday's Surab's single nucleotide variants. How much confidence did we have if there was only a few tumor beads that had that mutation? And then now that's become our gene list of mutations. But this is adding a layer of confidence. I'm kind of just struggling with the confusion that I made. I suppose I we found out that yes, this is a significant result. And so these mutated genes are part of the pathway. Yeah, so that's the beginning of everything. Then, yeah, so then I think this is only telling you that given a list of 100 genes any list of 100 genes what is the probability of 10 out of those will be in the cell cycle pathway? And the example is 100 genes but you can have 1000, 2000 genes and you cannot do it manually. So you need to do it automatically using these tools and you need to rank all these results by a p-value for the most significant one to the top. Because sometimes you have many, many, many significant results, many, many pathways that came up significant. So you want to distinguish the more likely true positive from false positive. And then that's the beginning of everything. Then you think, oh, the first pathway, the top pathway is autophagy or cell cycle. So now I need to look at the genes and to understand how it works but then you need to do it manually. Or you need to go back to the wet lab. Sorry, so the data set that we're using we're using from that database as our background what are the mutated genes. In our sample we look at the fish-rex exact test but the p-value that we're deriving is based on the previous data how commonly these genes are mutated in different experiments and then it gives me the test. Is that what you're saying? For example, if I have like 20 apoptosis genes like B-cell 2 and whatever. But then this data sets we'll look at the previous probability based on the previous data that we have and then gives me a p-value even though like if I have 10 genes mutated in apoptotic aspects for example 10 genes so 10 out of like all apoptotic aspects there are like 100 genes and then I'm choosing them from the data but how do I know my data like MDM2 is more important in my data set so it will not answer that question it will not you can add different layers of information when you create the network but for the simple enrichment tool it will not answer that question it's really if you database that you are testing in the pathway database it will just take the pathways okay it will give me just the path how does the p-value come from that so let's say you have so maybe we can take the example of the lab practical so they have this 291 gliomoblastoma samples from different patients and they did this old exam sequencing and they looked to find mutated genes and they looked to find the genes that were drivers they're more likely to be the driver genes and if they end up with 71 genes so these are the 71 significantly mutated genes in glioblastoma so that's their data that's their data that's the omics experiment all the pre-processing has been done and we think that in glioblastoma we have 71 genes that are more likely to play a role in the development of the disease so that's not enrichment yeah that's why just running a muted significantly mutated gene software they ran a muted 6CV basically then they came up with the list of 31 genes that are significantly mutated in that and that that's the data that's not the pathway analysis that's all the pre-processing that you may have seen during the workshop and now you end up with this gene list and you want to interpret this gene list so you can look at the gene one by one and try to understand the function or you can try to do the analysis at once to try to understand the relationship between the genes and the results in this case would be that a few genes are part of the same PI3K pathway and you can do that in 5 minutes instead of looking gene by gene and this is the only information simple tools like this one or G-Profiler will answer for you if you are testing these pathway databases is it the point of having the database to give us a high confidence set of genes that are in pathways already like we don't trust it this sounds like this is a test for membership so the test of significance is high confidence that this is part of this pathway but if our database is already this Q-rated resource that tells us to be able to look up the genes like just count I have a mutated gene X what pathways is it nominated in my database of choice you can do gene by gene if you want so there is two different questions one is to relate one gene with this known function and the other one needs to analyze as a group of genes so now a summary of the different steps of the enrichment analysis so first the overlap is tested with each gene set present in the pathway database so here we show the example with one gene set but actually we can test many many gene sets more than 3,000 gene sets so now we need to order the results by the enrichment p-value for the most significant gene set you want the lowest p-value and then because we have tested so many genes we need to adjust for the multiple hypothesis testing and this is what we call the FDR and we usually use the Ben-Germany-Horberg method so the FDR corrects for multiple hypothesis testing so if you have like FDR of 15% it means you have like 15% change that your result is a false positive so 85% change that is a true positive and why statisticians want to do that they say that because we are testing so many pathways at the same time we are increasing the chance of making one decision so they think that the p-value are overestimated so like p-value of 00001 estimated we use the Ben-Germany-Horberg method to correct that and maybe now our adjusted p-value is going to be 00001 so a typical output of an enrichment analysis is a table that contains at least the pathway or gene set names that we are testing the number of overlapping genes the number of genes in the original pathway the p-value that test the significance of this overlap and the adjusted p-value and the typical output is a long table with many many gene sets row by row and it's usually very difficult to interpret and it's difficult to interpret because the gene set they contain genes that are redundant so they are not completely independent this gene set so maybe your gene set 1 that is very significant has 50% of genes in common with your gene set that is number 10 so that's why we use network visualization so we are going to our last part of the lecture that is the intradiction to network visualization using cytoscape so cytoscape is an open source software for visualizing complex networks and there are lots of apps that are available so networks, the advantage of networks is that it can represent relationships between the data so you can discover new relationships and also you can visualize multiple data types together so two important words are edges and edges so nodes are the circle and edges are the lines that connect the nodes and so if you do pathway enrichment analysis you have two types of network one that I will call gene-gene network which is actually more like a protein-protein network but one circle would represent a gene and the line would represent the association between these two genes and the other kind of network would be a pathway network where one node represents a pathway and the line between two nodes represents the genes that are overlapping between these two pathways so one another important information is the layout so the network layout so if we don't apply any layout it will look like a hellball so the most common layout in site escape network is the force directed layout in the force directed layout the nodes are repelling each other but if they are connected by edges the edges pull the nodes together like springs so you have two forces repelling and pulling so from it was a bit difficult to understand so I could see it as videos so if you are interested you can do at the workshop go to these links and see the video of the force directed layout and here are like a few snapshots so that would be step one it's a little like a hellball and you see that some nodes are overlapping but then we are applying the force directed layout and the nodes are going to repell each other and you see that the distance between the nodes are increasing but this one for example here they are highly connected by edges so they kind of form a cluster then we go to step three and we see now that it's spread out and here I would say this is the end result so there is no more overlapping between the nodes so for us it's better to visualize but still the nodes are highly interconnected then they form these clusters so that's what we call the force directed layout so before we start the lab an introduction to cytoscape so when you open cytoscape you have three different panels on your left you have the control panel at the bottom you have the table panel and at the right you have the result panel and so you can save your session so you have when you have done your network you can save your session to reload it later you can save the network image as any format as you want pdf, png also another feature is to be able to navigate for the network so you have your network on this window you go in the control panel you click on network and you will see like a blue square and you can move this blue square to navigate for the network so force directed layout is the most popular layout but you can try different ones like the sickle layout you can play with colors and you can play with shapes and you can end up with a beautiful network so what have we learned it's useful for seeing relationships in large datasets that they are useful for integrating several datasets and types together it's important to understand what the nodes and edges mean automatic layout is required to visualize the networks and visual attributes enable multiple types of data to be shown at once so it's useful to see the relationships between the genes and there are many applications available for site escape more than 200 so you can go to the site escape app store and click on the icons to see a description of the applications so they have different categories so apps involved in pathway analysis some involved in gene expression analysis or literature mining or pathway comparison so depending on what you want to do maybe there is an app that is available for you and it's a software that is in constant development it's an active community with more than 5000 downloads per month Slicer from Gary Quaid, Lincoln, Daniel myself and we are ready to start the lab practical