 Okay, now we're going to talk about something completely different. Yes So this is the slides that I have to put it So we are talking about pathway network analysis and what I'm going to cover is basically four different things It's introduction to pathway network analysis the sources on pathways and networks Overview of the enrichment analysis. It's probably the most important part of this talk And they're going to show you a very small and cute example of how we are analyzing their large-scale cancer genomic data So the first question is why do we need this pathway analysis and The pretty straight answer is that it helps us to reduce the data size Of our high throughput studies like a very simple example we just sequenced about 250 genome hold zones in the esophageal adenocarcinoma and It happened to be that's about 15,000 genes are carrying at least one mutation in those 250 patients I mean, how can we generate any hypothesis based on the 15,000 genes? It's it's pretty much impossible So that's how Path analysis is helping. So I don't think you should submit all 15,000 genes into their Passive enrichment analysis. You probably want to reduce those number of genes by selecting those genes that are mutated in five more samples or apply different kind of other field rings but that's why we need it and Basically, the second point is the consequence of the first one that increases our statistical power So we are not analyzing 15,000 entities. We are not even maybe on the handy number of 10 or 15 Next point gene is very rare operates on its own average almost every gene on the human genome has a activators inhibitors in the transcriptional activation is Regulated by somebody so we cannot just look at the gene as a one single separated entity and As an example and I'll show it in my slides Probably, you know that the majority of those genes are Mutated at the very low level. So majority it's like 3% Patients are carrying those mutations 2% and lower. It's a very small number of genes That's mutated at the level of 30% like p53. So we have to make sense out of these genes and of course Passive analysis helps us to generate a meaningful hypothesis. So it's it's hard to generate hypothesis If you have a zing finger 418 mutated but if you see edge of our signaling Passway is enriched. It's it's kind of the way to think what is my next step in my analysis So to do a pathway analysis, you need three different things the first one. It's a biological Hypothesis this is optional, but it's very helpful the second one is the actual list of your genes of proteins That are came out. We will find later and sources of pathways and and networks So biological question of hypothesis It's actually what do you want to accomplish with your list? so when I Participated in the number of workshops and sometimes I was asking this question So what is your biological hypothesis and people were saying something like I want to identify all genes that upregulated in my microracing This is not a biological hypothesis It should be something like I believe that EG of our signaling is Altered after I put so stimulated myself with a drug. Yes. Yes so summarized biological Processes in your study Perform differential analysis what pathways are different between samples or between treated and treated cell lines Find a controller So if something happened to you you want to find a transcriptional factor that's responsible for all these changes Maybe identify new pathways Maybe you can find something new Publish it and send a mail to react to omega-cobra act on later that's nice. I made this discovery and we can catalog a new pathway Discover new gene function and to find any kind of correlation with your disease or clinical data So the second point that we need for the pathway analysis. It's a gene list. So what comes from? The first thing is the most obvious. It's from your high throughput studies sequencing gene profiling and so on so forth Second one as Junjun showed you right now from public data portals like ICGC TCGA Cosmic and so on and The third one it's less obvious from the manual or automated literature reviews If you want to describe any kind of process you might read a hundred publications and find Number of genes that associated with this new or a process and for example the pop tater is By a curation tool that helps you to do it in automated way. It's a very cool software And the last one it's a pathway and network information that we need to do the possible analysis so in this Lecture gonna cover Geo, it's a gene ontology Passway databases and a little bit about the network databases So what is the gene ontology? Any questions guys? Okay, so what is on what is ontology in general? Ontology, it's a data model that Represents knowledge as a set of concepts like for example, what is a berry? Berry is a strawberry blueberry blackberry so on but at the same time berries food Yes, at the same time Berry is plant and So on so like this you can create a very complicated set of terms the same thing Gene ontology consortium is trying to do with the biological Terms or biological entities so it's a dictionary and It deals with the biological phrases of terms like Protein kinases apoptosis and membranes and trying to establish those connections like Berry is a food Berry is a For a It's not static. It's constantly updated and it's not only covering the human genome. It's covering all kind of genomes and The very ambitious tax task is to synchronize these terms across all genomes And It's publicly available for free. You can go there download use it. It's for you So what geo covers there are three Major sections and probably when Junjun did his enrichment analysis you you saw it already It's a cellular component molecular function and biological process. So cellular component is basically covering the localization of the genes that you're talking about Molecular function. It's it's more about so I can reactions the molecular interactions between ligand and Receptor the catalytic reactions and so on and biological process. It's like cell division and similar so like for my research biological process is the most useful one But of course you can run enrichment analysis Against all of them and just use the information that you're gaining So what is the structure the structure as I said is very complicated and in majority of the biological In the enrichment test, you're not gonna see it. You're not gonna use it So you just need to keep in mind that that's what the gene ontology is about so Like for example, I don't think you can see it here, but you should have you should be able to see it in your printouts so we see like cell death is a parent of the program cell death and program cell death is a parent of the apoptosis and Apoptosis is a parent of the B cell apoptosis But all these belong to the biological processes. It's one of those terms that I showed here So it describes multiple levels of details of gene function and Different kind of terms can have different number of parents and our children So this is the structure of gene ontology questions So now I'm gonna talk about the pathways So it's pretty similar, but it's not the same For your enrichment analysis, you won't see any differences, but there is a conceptual Conceptual differences between those two resources. So possibly databases advantages of those it's usually highly curated You can see the biological view of this process that you're interested in So the cause and effect captured like if you're stimulating the With eGFR what gonna happen at the very end and most of those resources have this Very nice pictures that is very understandable and even could be used in the publications or the teaching these advantages very sparse coverage of the genome and why Because everybody wants to study p53 and nobody wants to studies in finger 418 That's why we have a lot of papers about the first gene, but probably no papers about the last second gene yes, so it's your fault of mine and Different databases disagree on the boundaries of the pathways That's not funny. That's already our problem So basically if you are downloading p53 signaling pathway from Kev, Punter and Reactome They might they will overlap, but not hundred percent so every curator Unfortunately has his own Representation of what p53 signaling is about So today I'm going to talk about the reactome This is one of the pathway databases, or we are what we call it knowledge basis and Right now it's the richest Passway databases that exist in the world. I'm talking about the human First thing it's hand curated and we have a very rigorous curation standards every reaction that you see in the reactome is actually traceable To the primary literature So right now we have about close to 2,000 human pathways and they're covering Close to 9,000 proteins and it's open access So why it's important because the second biggest database is keg and Unfortunately about three years ago. It became commercial so we can still use it for free but to download data You have to pay a license and maybe for big institution like mine is not a big deal But if you're working in the small lab, it's it's it might be a problem. Yes So just several screen short how Reactome is organized. I just randomly selected one of the pathways. It's a g1 as DNA damage checkpoint whatever it means and It's the old passways are organized in their hierarchical Way So this is our pathway of the interest and it belongs to the cell cycle checkpoints and So cycle checkpoint belongs in It's turn to sell cycle. So it's pretty similar like what we saw in Genotology Then we have this very nice representation of the actual Interactions, so the bluish ones It's a protein complexes and the greenish one it's Proteins itself and it's capture all post-translational Modification if it happens in this case and it has a lot of small molecules like ATP ATPD So this is the representation of the reaction For example the p53 after it's phosphorylated. It's actually activated the expression of the CTK and a and this is my Below each pathway, we have a short description of this possible with some reference links and this is my favorite feature it's actually was Adopted from the protein human protein atlas and shows you the expression of your genes of interest across different human tissues So like we see p53 is actually pretty Even the expression is pretty evenly distribution whereas the CTK and 1a it's Going up and down in different tissues And you also can able to download this data in different kind of format and use it wherever wherever you want Which so and the last point is? networks So possibly versus networks, what is the difference? This is the the beginning of the eGFR signaling This is basically the same thing only this is possibly and this is the network the differences are pretty obvious so This part is very detailed height there is a Lot of details like for example after interaction eGFR with eGFR the eGFR is getting dimerized and then after phosphorylation, it's Active of this relation is activated by this protein In the network the value of each gene is equal There is no inhibitors. There's no activators, they all pretty much the same and they look the same so In here we are capturing by chemical reactions here the overview is basically a very Abstractional in some of the network you can have a directed or indirect Directed reactions, but in most of the cases it just The capture the interaction So passways are pretty small scale the networks are huge so this network is a part of this huge network over here on the corner and All passways are Retrieved from literature whereas the networks is from literature and Omics data, so it's let's say less reliable So what kind of networks we have? So as I said the networks can be generated either automatically or through curation or both and it's Definitely if it's generated through curate through automatically it's covering significantly higher portion of the genome then for example reactome and The relationships between genes are actually more tentative than in their pathways And there are there's a list of their most popular sources of curated networks like for example biobreed Intact means and our favorite one is reactome function interaction network So it's based on the curated data and machine learning and right now consist of 11,000 human genes and 180,000 interactions So that's how it looks like and it's only 5% of this network so this little dots are proteins and this little interactions it's actually interactions between proteins and The network analysis in a very simplistic form looks like this. This is the network and This is list of genes that were upregulated downregulated in your any kind of study So you're basically taking those list of genes and projecting into the network and After that the the software is looking for the interactions between your genes So most of the cases and is Spinding this big cluster of genes and some of the genes are not connected and some of the connected between each other's Optionally you can add the subset of so-called linkers Linkers are the genes that do not belong to your list, but helps to connect all genes on your list in one big sub network and This sub network we usually call to disease specific sub network pristine prostate cancer specific subnetwork And so and then you're removing this background and you're playing with this subnetwork so what we can do you can learn the clustering analysis you can calculate some Property of the network like protein degrees between an S closeness and It's basically we need another lecture to cover their graph analysis So take away message. There are goal pathway and that would be ways to analyze your gene list and The best way is to do all three So now we are switching to the enrichment test so this is the very usual output from their enrichment test and There are a number of the software some line that is doing it and I just using I'm just using the screenshot from the ICCC portal so what we see here So I submitted the gene list That consisted of 130 genes. It doesn't really matter in this case What is all this hundred thirty genes about and here we see the list of pathways from reactome that are enriched on this side of the blue Highlighting it's a pathway IDs that gonna lead you to the reactome website and actual representation of this Passway number of genes that belong to this pathway. Let's choose something meaningful signalling by error before 294 genes and eight genes of those are belong to my list So I'm skipping this part. This is Junjun covered in his lecture and which is important for us It's the p-value and adjusted p-value So the question is how this p-value is calculated To calculate the p-value we need an enrichment test So for enrichment test, we need three different things your gene of list your list of genes the pathways reactome and So called background list. It's basically a list of genes that were tested in your set Then there is a black box where the calculations are going on and the output is just it looks like The screenshot that I showed you on my previous site So enrichment testing majority of the cases is done using so called hyper geometrical test So how does it work? It's pretty simple. It's pretty simple So let's assume we created any kind of micro array That consists of Thousand human genes, you know in all days. There were a lot of those customized arrays With not that know we're not covering the whole genome site in our days But just a particular subset of genes. Yes And this is our background list. That's exactly what I was Talking here out of those one thousand genes 100 belong to any kind of pathway x e.g. Of our signal for example We did any kind of a say and we found out that five genes are actually enriched or Five five gene five genes are actually Significant they are highly mutated or they are upregulated downregulated whatsoever and of those five genes Three belong to the e.g. Of our signaling. So is it significant? To calculate that We use this formal but you know like Don't get afraid. You're not going to see this formula anymore probably in your life. Yes so this formula consists of three combination functions and I'll try to explain it what it's doing and the values for those variables actually Show here so and Equal one thousand it's the total size of our microarray and It's a number of easier of our signaling Genes that belong to this pathway and is actually not number of genes that we selected for analysis and Hey It's the number of genes that actually belong to GF or from our list Four numbers very important. Yes, and we are putting all these four numbers in this formula And what is this formula is calculating? It's actually like for example here number of combinations That exist to number of ways to select and out of big M Like how many times? We can select those five genes out of 1000 genes The same thing is here how many ways exist to select three genes Here out of 100 EG of R related genes and pretty similar over here and minus M and Minus M. It's 900. It's actually genes on my array that do not belong to each of our signaling and And minus K. It's two genes that I selected that also do not belong to the age of our signaling and So the calculating of this formal is actually giving you this P value, which is as you see pretty significant So I told you about the background list. So all genes that were tested So as you see the formula 1000 it plays a significant role What does it mean? So if you're doing the microarray that selected all genes, you don't care about that It's gonna be what 25,000 how many genes we have in the human genome depends on the rate. Yes If you're doing any kind of customized array So this number will be reduced or for example, like in our days a lot of people are doing target sequencing So they're selecting any kind of gene panel of let's say hundred to hundred genes and sequencing Deeply only those genes. So doing possible analysis. You should keep in mind. Yes so Hypogemetrical test is a very very useful tool and it's not only applicable for password analysis It can be useful in many many different ways So and to do it, you don't need to use that complicated formula. We just need to Google Hypogemetrical test or Hypogemetical test calculator and that's what I found and it's basically the same thing Population size 1000 the letters might be different. We can be careful about that number of the successes 100 to the number of HFR signaling genes sample size It's the genes that we selected as they enriched and number of successes in sample three and if you press their Calculates button you're gonna get approximately the same probability. Yes But this is not the end of the story P value should not drive your decision On what pass we to select for further validation? Why? Because there are a lot of Passways, so we should always think about their multiple test correction thing and what does it mean? So if we are randomly draw So many times five genes out our out of our packets At one point in every 100,000 Draws we're gonna select the subset of genes that all of them are belong to the each of our signaling and that's not what we want right so It means that we have to Panelize like I just mentioned that there are 2,000 pathways in just the reactome. There are some resources that doing enrichment analysis not only Against react on databases, but all reactomes all pass with databases together and altogether it could be 10,000 Passways so some of the pathways that came up a significant on your list might be just due to random chance so So there are different ways to to fix this problem and At this point I had like four five additional slides, but Michelle cut them down and she said that I'm talking too much And I just need to introduce the the concept of them The multiple test correction. There are different ways to do it the most A Famous one or a usable one. It's Bonferroni correction and FDR false discovery rate probably all heard about it. Yes, and I prefer FDR because when Bonferroni might be a little bit too strict And sometimes it's just washing out the information that might be useful but what is what is FDR FDR is false discovery rate. Yes, and In most of the cases people establishing at level of 0.1 So what does it mean? It means that 10% of your enriched pathways Could be due to random chance But for example, 10 pathways are significant and FDR is all point one So one out of 10 will be due to random chance, but not after 10 true positives perfect if you're selecting if you are increasing FDR And putting at the level of 15% you're probably gonna get no, let's say 20% It's easier to calculate. You're probably gonna get 15 Enriched pathways, which is kind of looks better But then the false discovery rate is all point two multiply by 15 You're gonna have three pathways which is false positive, which is not what you want So you decide it's a philosophical questions, but I really don't like papers where FDR is 30% or higher It's kind of it's nonsense So take away message Hypogemetric test is a powerful tool Don't forget multiple test correction and keep in mind and now bogeens proteins in your total population It might influence your final study and the fun part it's a Passive analysis of the large scale cancer genomic sets So I mentioned that already This is a very very usual graph for cancer genomics so we took 52 pancreatic cancers and We found 200 recurring mutated genes and they plotted it here So on this scale we have a number of patients and he's a percentage of the patients and this all these genes So that's that's called long tail of cancer genes So there's a K of us a very interesting gene that mutated in about 95% of pancreatic Cancer patients p53 it's about 30% another Sojourn channel But the rest of those is just very low So how to make sense out of this information? So to do this we generated this workflow Of pathway network analysis So you generate your list of gene you run your enrichment analysis You can use icgc portal or reactant portal or G profiler G profiler is actually a very advanced and it's taking into account your background list As well so you can browse through your significant pathway in reactant or whatever Tool we are using and then they are building the protein interaction subnetwork the way I showed you in one of my slides you can run the clustering algorithm and Then run enrichment analysis of each module of each cluster separately Drilled out to understand the molecular mechanism Validate your model usually it's in the web lab and publish one script. Yes So and that's how yeah majority of those steps Could be done using reactant functional interaction network cytoscape plugin who have heard about the cytoscape before Yes Yeah, it should be your best friend for this type of analysis So and that's how it looks like and I think it's beautiful. So basically All genes it's a pancreatic cancer specific subnetwork the blue dots it's genes and Lines it's interactions between genes and genes are clustered into the modules Basically clustered together those genes. They are highly interacting with each other the bigger the bigger the Note the high mutational rate has and Then each module was annotated with their pathway enrichment Tool like for example module 5 Enriched in the other zone guidance actually this pathway that comes up in every single cancer. I analyzed that in a while Then we have for example module Translation some of them have several annotation Hitchhawk and TCGF better signaling Of course, they are be here. They are be be signaling p53 signaling and this big note is a p53 that's the way it goes and Take away message number three. So try different tools. Don't be shy issue of non-relevant enriched pathways I mean Reactome is a huge resource and it's trying to cover different Fields of human activities and if you are analyzing any kind of drug acts and you Stimulated your cell lines with this drug and then you'll run your pathway analysis and you've got tuberculosis or HPV Infection or something like that. So it's most likely not really relevant to your study So don't publish it excluded Be careful, but excluded. Yes In the ideal world Of course, you should create your owner list of pathways that you are interested in So you're taking those 1800s Plus react on pathways and actually selecting before seeing your gene list those that might be relevant to your study But it's a lot of work and in most of the cases you will need some kind of very basic programming skills But just just think So if no significant pathways were detected and you excluded all possible mistakes, don't get disappointed Maybe it's something new that we've never seen before and Yes, all lectures It's just additional message that all lectures are available here and we can gain a lot more information About the pathway network analysis including the cytoscape and our plug-in for the cytoscape questions