 OK, so I'm skipping the slides. Passway and network analysis. So what are we going to cover today? It's an introduction to passway network analysis, sources of passway and network, enrichment test, just running behind the passway analysis. And I'm going to show you a very small and very cute example of how to do a large-scale cancer genomic data analysis. OK, let's go. So the first question is, why do we need a passway analysis? Any ideas? So the first thing is, passway analysis helps us to reduce data from hundreds of genes that you're getting from your screenings or thousands of genes sometimes to a very handy number of the passways. And that helps you to increase the statistical power of your test. The second reason is genes or proteins are very rare, operates on its own. It's always operates in groups, in pathways, in connections. So if one gene is mutated in one patient and its neighbor is mutated in the second patient, the outcome could be the same. So that helps us to find the meaning of the long tail of rare cancer mutations. You probably all know that in cancer, in particular type of cancer, there are only a very small number of genes that mutated at a high rate, like P53 or KRAS and pancreatic cancer. A gene of genes, they just mutated at a very low rate, like 3% of population, 5% of population. So it's just, we call it long tail. So if you do the pathway analysis or even better network analysis, we can make sense out of those low level mutation rates. And it helps us to generate biologically meaningful hypotheses. I could significantly easier to generate hypotheses based on the EGF-R signaling pathway than Zingfinger 418 gene that is upregulated in Europe. Whatever. It's HRNA screening. Yes. So there are three different reasons. Why do we need more? But that's what I'd come up with. So what do we need from pathway analysis? Three different things. The first one, it's a biological question and hypothesis that you probably have. This thing is optional, but helpful. The second one is the list of altered genes or proteins that you got from your essay. And the last one is a source of pathways or network that could be publicly, that's good, commercially, also not bad, available. Let's start from the first thing. So biological question of hypothesis. As I said, it's optional. But it's always good to keep in mind that helps. So before you're starting your whatever screening, based on the literature, based on the publication, based on your previous experience, based on the suggestion from your supervisor, you can say, OK, I might expect that P53 signaling is activated after I'm treating my cell lines with this particular drug. In this case, you need to test basically on one hypothesis. You don't need to run enrichment tests across all pathways that could be like thousands of those. You just have one hypothesis. So why do we need a biological hypothesis? Or what kind of biological hypothesis? So hopefully, it's a part of your experimental plan. You might want to summarize biological process or other aspects of the gene function. Or if you're doing any kind of the expression profiling, so what pathways are different between samples with mutations, without mutations, or treated and untreated cell lines? You might want to find any kind of controller of the process. For example, if there are so many genes are abbreviated, there's probably a kind of transcriptional factor, a TF state for transcriptional factor is actually responsible for this process. You might want to discover a new gene function, or even a new pathway. Who knows? And then you can send an email to react on and say, hey, guys, I have a new pathway. Let's create it. Or it's a little bit more advanced and also covered in our big workshops to find the correlation with the disease of clinical data attributes. So where the gene list, that's the second component of our enrichment test, comes from. The first one is pretty obvious and straightforward. It's your screening sequencing, RNA-seq, DNA-seq, whatever. The second one, as Vincent just showed you, it could be a set of genes from the public data portals, like ACGCO, TCGA, or Cosmic. Like, for example, genes with high impact mutations, in breast cancer, women with recurrent disease. Fine. And the last one, the gene list, that's you are manually automated curated from the literature. Like, for example, you have a rare cancer and that's only so many publications that are published on this subject. You went through all publications and you collected all genes that were mentioned in those publications. There's nothing wrong with that. You can go to the pathway portal and run your enrichment analysis and discover that any kind of pathway is enriched in these genes. That's also fine. So the next point of presentation is where we're actually getting those pathways and networks. So I'm going to stop briefly on gene ontology, pathway databases, and network databases. So gene ontology. So what is ontology in general? It's basically a concept, a data model that represents knowledge as a set of concepts. Like, for example, what is berry? Berry, it's a strawberry, blueberry, raspberry, da, da, da. But besides that, berry is also food. Besides that, berry is also plants and so on. So using this concept, we can create a very huge network or hierarchical structure of all these concepts. And the same thing is going on with gene ontology only. They're trying to create a structure of these biological phrases or terms like protein kinases, apoptosis, membranes, and so on. So it's all available at this www.genontology.org website. Gene ontology website is not static. It's constantly updated. And the gene ontology consortium group has a lot of very ambitious targets. And one of those is to synchronize all these biological terms across different species, at least major species. And gene ontology is publicly available, freely available. And you can download any terms, any descriptions free. So what gene ontology is covering? There are three different components. It's a cell component. It's basically where your reaction of something is happening, molecular function, something like ligand interacting with the receptor or something basic like this. And the biological processes like cell division, apoptosis, and so on. From my experience, the last one is the most useful in cancer research. And just a couple of words on the gene structure. So it has a very complicated structure. So gene ontology is basically the root of this structure. Then we have a biological processes. It's one of those, sorry, that is located here. And then cellular process is one of the children of the biological process. This logical process is one of the children of the cellular process, and so on. So finally, we have the B cell apoptosis. That's probably the smallest child that is a child of the apoptosis. And apoptosis of the child of the program cell death. So it could be a gene that belongs to the program cell death, but not to the apoptosis. So it's very complicated. But at your stage, it's good to know that every term is also described by a set of genes. And that's how you're running your enrichment test. The second part is the pathway databases. So what are the advantages of the pathway databases? Usually they are curated, especially the reactor. We have a very nice biological view of the biological, or biochemical view of the biological processes. So the cause and the effect are captured very well. I'm going to show you one of the examples. And we have a nice drawing of cartoons that are presenting those interactions. But there are also disadvantages. The first one is the sparse coverage of the genome. And it's your fault, because everybody want to work on P53 signaling. Nobody want to work on zinc finger of 314. And of course, in the first case, it's easier to publish a paper and get a funding than the second one. That's why zinc finger never come up on a pathway analysis. But P53, there is probably 20% of pathways. And the second one, it's also, it's in this case, our fault. The different databases disagree on boundaries of the pathways. So if you download P53 signaling from their cat, preemptum, partner, and some others, they will be all different. So they're all wrapping on P53, yeah. But there are a lot of disagree, discrepancy, I would say. So just an introduction to a reemptum. Reemptum is based on OICR. And we have a group of people who are taking care. Not only OICR, but one of the haps is OICR. So a reemptum contains only hand curated pathways. And we have a very rigorous curation standards. Every reaction in a reemptum is actually traceable to the primary literature. As of October of 2015, there were almost 2,000 human pathways and covering almost 9,000 proteins. That's exactly what I'm talking about. So we are far, far away from the covering of the hojino. And the reemptum has an open access, free. In comparison, for example, to Keck, that is several years ago when it's commercial. So what's the structure of the reemptum? So randomly those G1S DNA damage tag point pathway. On the right side of the pathway viewer, you have a hierarchical representation of the pathways. So what we have here at the very top, it's a cell cycle. Then the cell cycle checkpoint. And then this G1S DNA damage tag point. So if you click on the third line, you'll see a very nice viewer of the pathway. It's basically this kind of viewer was transported later on in ICGC and Winston showed you one of the examples. So it's a very rich representation on what's going on in those G1S DNA damage checkpoint. So the green ones, it's proteins or post-translational modification, modified proteins. And the blue ones, it's protein complexes. We have some small molecules that also participate at ATP, for example, here. And if you click on one of the reactions, you see this reaction is highlighted with yellow, there was a small description what is all this reaction about. So in this case, I think it's a P53 is binding to the promoter of CDNA1A. So under each pathway, we have a text description of this pathway. And this is my favorite feature. So we downloaded information from the human protein atlas. And we have an expression of every protein that is participated in this pathway in the different human tissues. And you see, in the case of P53, the expression is pretty even across all tissues. But what is the second problem? CDNA1A is kind of jumping up and down. So in esophagus, there is a high expression. In heart, there's a low expression. Don't ask me why, but you probably, as a biologist, can address these questions. And you can also download this information in different kind of formats. That's how reactant looks like. Very rich. A lot of work is invested in this project. It's for you, for biologists, for biochemists, use it. So now we're going to switch to the networks. Pathways versus networks. So what is the difference? So this is how the pathway looks like. In this case, it's a HFR signal. So HFR receptor is interacting with its ligand. It's a complex. The blue color. Then it's dimerizing. And then, with the help of this kindness, it's getting phosphorylated. So it's a very nice view, easily human readable. Usually, pathways are very detailed, as you see, as a lot of information captured. So biochemical reactions are here. It's usually a very small scale. And it's concentrated information from the concentrated, I don't know what to say, from the literature. Networks are significantly more simplified. So you don't see those inputs and outputs in the pathways. All proteins have equal value. No small molecules here involved. So we remove them. Then, when pathways are very small scale, the networks are large scale. We don't have an EGFR signaling networks. Networks are usually huge. And you can extract from the network the subnetwork that is specific for your disease or specific for this particular pathway. Do you feel the difference? And if majority of the pathways are based on the literature, the networks are created based on the literature and usually based on the omics data as well, using the machine learning approaches. That helps us to increase the genome coverage. So what kind of network databases exist? So as I said, it's built automatically of the accuration, its small extensive coverage of the genome. And here I put four examples of the different kind of functional networks that exist online. Biobreeds, so don't get afraid by this number. It's not only human genomes. It's different kind of genomes. Intact means and our favorite is the FI network. FI states from the functional interaction network that was built based on the VR tone. So we have different kind of versions. And the latest version contains about 11,000 proteins, which is a much bigger coverage than just the VR tone. And 880 interactions. And those network looks like this. And it's only 5% of the network. So we have a lot of these little nodes, which is a proteins and those interactions between proteins. Some of the proteins are clustered together like here. And some of them are just carried around in the network. So how the network analysis works. So what we're doing, let's say this is our functional interaction network. And we have a lot of genes that are up-regulated, down-regulated, and kind of the same. So let's say, up-regulated, down-regulated. So we're projecting our genes into our networks. The next step, the plugin is looking for the interactions between those genes that are on our list. And the next step, we might look for so-called linkers that is not on our list, but that might help us to create a huge network based on our genes and linkers together. And if we are removing the background network, that's what we are getting. And we can retrieve a lot of useful information based on this network. So taking our messages, there are three ways to analyze your data. It's a world, pathway, and network-based analysis. Use all three, they're all different. And they're all going to provide you different information. So let's talk about the enrichment analysis. It's basically those analysis that the load test is running behind your pathway enrichment test. This can show you it's already from an instant presentation. So it's output from the enrichment test, the list of the pathways. And what we do care here is the p-value and adjusted p-value. So how do I get in those p-value? So enrichment test, for enrichment test, you need three different items. It's basically the same thing, what you need for there, almost the same thing, what you need for there, pathway analysis. It's your gene list. And one of the questions in Liverpool was whether my gene list should be normalized or not. I mean, the enrichment analysis is not doing micro-randomization, so everything should be done already before. So that's why I put here normalized gene list. The second one is basically your databases, React, Don't Care, whatever. And the last one, it's very important. And I'll explain you why. It's a background list, all genes that were tested in your set. Then we have a black box, big black box, enrichment test, and the output, enrichment table. The same what we saw on the previous slide, the list of the pathways, p-values, and any kind of adjustment, FDR, for example. So how enrichment test works? It's usually done by hypergeometrical test. So let's assume we have a bucket of 1,000 genes. It could be, for example, USH RNA screen. Or, for example, in all days, the microarray contains not all genes in the human genome, only a particular set of genes, 1,000, yes. Or, for example, you created a panel of genes for target sequencing, which is very popular in all days, 1,000 genes. Of those, 100 belongs to any kind of IGF path signaling pathway. You did your analysis, your sequencing, and you found out that five genes are significant. And of those, three genes belong to IGF path signaling pathway. Is it significant or not? Is your level five gene list enriched in IGF path signaling or not? So to test that, you have to generate the number of vertices and the length of those vertices. It's always existing. So the number one, there is no enrichment. Alternative is yes. My list isn't rich in IGF path signaling. So to calculate these, you use this very complicated formula. And I don't think you will ever see it again. So just look at that and forget about it, yes. And here I put basically the variables that are used in this formula. And the p-value is pretty significant, yes. There are three very important points that I want to stop here. The first one, the background list, those 1,000 genes, is a part of this formula. It plays a very important role. So if you are not testing all genes, if you are not doing whole zone sequencing, if you are doing target sequencing, keep it in mind. You have to use the number of genes that were tested in your essay. And then highly sophisticated class of enrichment tools like a gene profiler is actually giving the option to upload the gene list that was tested. The second point of view, this part will be done automatically. So you don't really need to see it. And the last one, as I said, if you have a hypothesis, in this case, it's IGF path signaling. And you are doing only one enrichment test. This p-value is actually end point of your journey. But as I said, in rectum, we have almost 2,000 pathways. And if you want to test all 2,000 pathways, then this p-value is not the end of your journey, unfortunately. You should always think about the multiple test correction. So what will happen? So let's say we have the same bucket. We have the same 100 genes that belong to HFA signaling. And when we draw us five genes from this list, one, two, three, four, 100,000 times, so no point for a light who will withdraw all genes that are belong to the HFA signaling. And the p-value will be very significant. Hooray, it works. But is it a true result? Is it what we want? I mean, that's what we want, but is it true? So that's why if you're testing across a lot of pathways, you have to use FDR, one-phironic correction, Q-score, adjusted p-value, or whatever your pathway enrichment test is providing. Yes? They all do the same job. They're all slightly different. Some of them are stricter. Some of them are less stricter. It doesn't matter. They all basically do the same job. So my favorite is from FDR. It's a post-discovery rate. And what does give you is expected portion of the absorbed enrichment due to random chance. I'm trying to explain what does it mean. Let's say you run your test. You did a correction. And you choose the pathways that are with FDR less than 25%. Let's say 20 pathways were significant. Of those five are due to random chance. If you choose 10% significance, of course, there are less pathways that are significant. Let's say there are only 10 are significant. Of those, only one due to random chance. And nine are good. And the stricter will go on the way it is. So in the case of the 5%, let's say 5% is significant. Of those, only all by 25 are due to random chance. So what threshold to use? It's up to you. But I really hate the publications that are using something like 25% of the error. It's not part of it. So what else do you want to mention here? Sorry. High pressure medical test. It's a very, very, very useful test. And if you google high pressure medical test calculator, you can find a number of those online. So in this case, I just put those numbers that we used in our previous exercise. And we basically got the p-value that I provided on my slide. So it's not only about pathways. It's not only about the go terms. You can use it in your life, in your research, in your science. Basically daily. Like, for example, 1,000 people are attending CCR conference. So those 200 people are PhD students. 50 people attending this workshop. Of those 25 are PhD students. You can see the PhD students as our audience for this workshop. If you're putting these four numbers in this online tool, you will calculate the p-value and you will know the answer. Got it? OK, now let's have fun. OK, take message. Sorry. Let's wait for fun. So high pressure medical test is a very powerful statistical tool. Use it. Don't forget multiple test correction, FDR, and keep in mind capital N, the size of your population, the size of your genes that you submitted for testing. The last one, I promised you to show one of the examples of the lexical cancer genome data analysis. So this is the whole sequencing of the 52 pancreatic cancers. More than 200 genes were recurrent. The first genes that gave us, and it was mutated in like 95% of samples. The second one is E53 was mutated in about 50% of samples. And there is another, it's one of those saline channels that I really don't know anything about. But the rest of the genes mutated at a very, very low level, like 6%, 5%, and then the tail was getting small and small and small. So what are we going to do? What are we going to do with that? So basically, we can collect those genes and run the possible analysis. There is no problem about that. So here, we created the so-called pipeline how to analyze this kind of data. So you're generating your list of gene, you're running your initial analysis using G-profile, ICG-C, BF-Tone, G-tools, there are a number of tools available online. So here, we can grow significant pathways in the reactor, trying to make sense out of what you see. You can do a broad interaction subnetwork, the way I showed you before. You can run clustering algorithm. I'll show you the output of that. And you can run the initial analysis of each cluster on each module individually. Join them to understand molecular mechanism, validate your lab, validate your module usually in the wet lab in the submit manuscript. And all these steps could be done using so-called reactor and functional interaction network cytoscape plugin. Who has ever used a cytoscape before? One, two, three, four, five. OK, five of the 50. Cytoscape is a framework. It's not a primary tool. It's a framework for creating different kinds of plugins. It's extremely useful for cancer genomics. So please, I mean, either you can attend one of our workshops or you can put your hands on by yourself, but it's highly recommended. And well, that's what we are getting out of Cytoscape. So we took all these genes that were mutated in those 52 exomes and very cancer exomes, projected them into the functional interaction network, created so-called pancreatic cancer sub-network and run clustering. And then we annotated each cluster separately. And for example, here it's a cluster of the genes and this cluster or module is visited. The same thing is enriched in the P-503 signaling. The size of the node is proportional to the number of samples, whereas genes was found mutated but P-503 here you see it's pretty huge for the rest of the genes. The rest of the nodes are pretty small. Or let's say raw GTP is signaling here, the blue cluster and so on. So the last takeover message, try different tools. Yes, issue of the non-religion enrichment pathways. So reactor is not only cancer research. Yes, it's trying to cover many, many, many aspects of the human life. So if you're, for example, analyzing your cancer cell lines and treating the drug or something like that and you see that the tuberculosis is enriched in password or any kind of HPV infection. I mean, don't publish it, it's not relevant password. So in ideal world, before you're doing the enrichment and you have to download all passwords from the reactor, go through them and only choose those that are relevant to your study and then run the enrichment. But it's a very laborious process. So it's much easier to parse your output than the input. And if non-significant passwords were detected and you've skilled all possible mistakes, please don't get disappointed. Maybe your password hasn't been curated yet. Think about it. Maybe it's something new that you discovered. Yes. And please, all the lectures on password network analysis we can find on this one for once. Just don't say it. You can download it. Put it on your iPhone. You know, when you're on a bus going to work, just listen, look, study, use it at least. And if you have questions, email us. That's it.