 Welcome to MOOC course on Introduction to Proteogenomics. After understanding how mutations in a given gene or specifically on P sites can alter expression of the gene and its effects on the signaling pathways. We will now learn about how gene list can be transformed into the pathways by Dr. Kastren Krogh. He will talk about how one could interpret the role of differentially expressed genes in clinical conditions as compared to the healthy individuals by analyzing the pathways which are affecting them. He will also talk about various pathway and databases which can be used to analyze pathways such as M6DB, wiki pathways, keg and others. He will also talk about two basic ways to perform pathway enrichment. So, let us now welcome Dr. Kastren Krogh to talk in more detail about how one can use various tools and transform the gene list to the pathways and make sense out of the data which one is obtained using various omic technologies. So, in this lecture we want to talk about how we can come or how we can transform gene lists into pathways. You know, if you perform your experiment, you compare your wild type and knock out, you perform your statistical test that Mani was talking about. You end up with long lists of differentially expressed proteins, phosphocytes or genes and possibly, you know, a high weight or like a higher proportion of those might be false positives, meaning they are actually not differentially expressed in your sample. And this makes it very hard to interpret these results biologically. So, this is just an example here. So, if you look at luminary and basal breast cancer subtype that we have looked at yesterday in the hands-on session, we see there is more than 1000 proteins are percolated in basal or differentially regulated between basal and luminal, you know, but actually what we want to know, what are the biological pathways that drive these kind of separations between luminal and basal. So, in order to do that, there are many different ways how to perform pathway analysis and it all starts with a pathway database. And you know, how do we represent the pathway in a computer? So, pathway in the most simplest case is a crew of genes that are members of a pathway or to share some common biological process, right. So, it is basically a pathway is a list of genes, gene symbols. And there are many different resources for a pathway databases like MCDB developed at a broad, we do pathways, net path which has been developed here and by Alex, by Penny's group. There is the Kyoto Encyclopedia of Genes and Genomes, Reactome and there is even more. So, here these, if you click on these links, it will directly forward you to the actual website of these databases. So, I just want to briefly point out Vicky Pathways which is a very promising research resource for curated pathways which has been, you know, developed over the last couple of years, but now it really starts to take off. So, this is like a Wikipedia for pathways. So, everybody who is, you know, studying a specific pathway. So, maybe you and you in your particular lab, you are interested in one specific pathway. So, actually you are the expert to do this kind of curation of a pathway, right. And this website or this entire resource, you know, should enable you as a researcher to help the community to provide well and highly accurate curated pathways. So, we are also using that resource a lot and, you know, they also have like a curator of the week here and things like that. So, if you are really contributing a lot to that resource, you might end up on their webpage. So, ok, there is pathway databases. So, now we want to do pathway enrichment and there is basically two different ways how to perform that or how to approach that problem. So, one is, you know, based on some sort of test. So, here just pointed out the fissures test where you test for over or under representation of pathways in a list of genes, in a list of differentially expressed genes, you know, in two conditions. Let us say tumor normal. So, meaning what that means, you have to define this list of differentially expressed proteins beforehand before you do your pathway analysis. Let us say you compare basal and luminal and then you will do your statistical test, two sample t-test and you look at everything that is differentially expressed at 1 percent FDR. So, that is the input to your pathway analysis. The other approach is so called gene set enrichment analysis which has been introduced in 2005. It is, you know, highly cited. It is, you know, very convenient way to look for, you know, small, but coordinated changes, you know, that you observe in your sample. Let us say like a protein does not maybe pass the threshold for being statistically significant in this pathway. But if you observe many different members of the same pathway that might not change a lot, but they are all changed in the same direction, right, which increases the evidence that your pathway might be enriched. So, and the main difference compared to the first approach is that here we are looking at all measurements. We do not filter anything beforehand, but we look at all measurements, you know, at once. So, briefly about Fisher's x-ray test, I am sure that many of you guys are aware of that. It is a test to, you know, to test significance of contingency tables. So, tables that, so, you know, that is a little story that Manny used to tell here. So, it has been developed by I, Fisher and to address claims by a good friend that he had, she was called Miss Bristol and she kind of insisted that she is able to tell whether the milk or tea was poured first in a cup. And so, in order to prove her wrong, he conducted a little experiment, you know, by, you know, conducting the experiment eight times and he would pour three times the tea first into the cup and three times the milk first into the cup. And then he just, you know, filled out this contingency, he counted the number of successes of the lady, you know, and filled out this contingency table and then you can basically calculate Fisher's x-ray P value by just enumerating about overall possibilities and calculator P value. So, in this case, he proved her wrong. And you can do the same with pathways. So, let us say you want to compare your pathway and your differentially expressed list of genes. So, you have your gene list on the x-axis and you ask the question whether the gene is in your list or not and here on the y, you look for pathways, meaning if the gene is in your pathways or not, right. And then you can fill out this matrix, it should, if you always compare against the background. So, the total end should sum up for example, to all genes in your human genome, which in this case, depending on which database you are using this number might be slightly different. In this case, we have roughly 19,000 genes. And here is a little example. So, let us say this is my pathway, this is like an arbitrary theoretical example. This is no, you know, does not have any biological meaning, but you have your pathway, you know, and everything that is highlighted in bold here does overlap with your gene list. So, meaning 6 of your gene list members on this pathway and so on and so forth. And your background is in this case, is the universe of 20 genes. Again, Justin is an example to how to fill up this contingency table. And then you can basically, again using R, you can just calculate for just p-value. In this case, it is not significant whatsoever, all right. So, this is something you do for every pathway that you want to test. One very, you know, convenient tool to use and very powerful tool is called David, which has been published several years back, but is very powerful and very easy to use. You can, as I just described, you can just paste in your list of differential genes. You can tell the software, whether this is my gene list of interest or whether this is my background list, then you can perform these sort of tests that I just described. And you will get, for all in every pathway, you will get the p-value and enrichment scores and things like that. So, it is very convenient to use because, to use because, you just go into Excel, do your, or go into the result of your statistical tests in Excel, you filter your significant genes and then just paste them in here. So, now we want to talk about geneset enrichment analysis or GSEA. And this is also something we want to try during hands-on sessions. So, I hope that we make that work. So, as I already mentioned, here you take into account all genes you do not have to filter before you do the analysis. And so, it is basically what I just said. So, Genesets, so pathways are also called Genesets. So, basically with the introduction of GSEA, the broad also came up with a collection of Genesets, which again, is just a collection of genes, which might refer to a pathway or, you know, the shared biological process and so on and so forth. But in general, Geneset is nothing else in a pathway. And there is a large collection of those you can find in the molecular signatures database or MCDB. So, if you go to this website, you will find this kind of overview. So, these are different categories of Genesets, which you can find in MCDB. And here I just highlighted, I guess, which are the most commonly used Genesets in MCDB. So, one is the so called Hormark Geneset data set or Geneset, which is a very small database. These are just 50 signatures or 50 pathways, but they are highly curated and represent very important and common cancer Hormark pathways. And the other one is the so called C2 CP. CP stands for Canonical Pathways. So, that is a collection of Genesets that have been derived from other, you know, pathway databases like CAG or Reactome and so on and so forth. Canonical means, you know, these are. So, what we believe is the actual pathway. And there are others that and there are others that might or might not be of interest. For example, if you are used to like genontology terms, you can also look into the category C5 and so on and so forth. So, the general principle of GSEA is shown on this slide here. There is actually a figure one, I believe, from the original publication back in 2005. So, again you start with a data matrix. So, we have seen this kind of data format a couple of times in our workshop. So, you measure features, in this case it is genes on in your rows and you measure these features across a set of samples. And you know, you always want to compare at least two phenotypes, two conditions. So, in this case it is condition A or phenotype A and phenotype B. Let us say one is tumor, one is normal. So, and you somehow rank these geneless and you would rank it in a way that you rank that you would rank differentially genes accordingly. For example, you do your two sample t-test and then you rank it according to your p-value. So, the most significantly differential genes are on the top and then the further down you go the less significant it is. It becomes. So, you have a list of ranked gene sets or genes and then you take your pathway of interest. So, this is now just one pathway again and to look where in my ranked list of genes do the members of this pathway fall. So, all of these like horizontal bars are locations of members of this gene set in my actual data right. And here you see that here in this upper part of this plot you see there is many more horizontal bars than down here. So, there is an enrichment of this pathway just visually you can see that right. So, there is an enrichment of this pathway among you know, among genes that are differential between A and B that is the principle. So, how you calculate that? So, they calculate a so called enrichment score or ES by basically. So, now we are we have transposed this matrix. So, we are now looking at the ranks on the horizontal axis here. So, from high to low and now we basically calculate an enrichment score by walking down this list here. And whenever we see a member we observe a member on our data set we increase our running some statistic here. And if you do not observe a member we decrease it right. And this builds up this kind of mountain plot here and at some point you do not observe enough members of this pathway anymore that. So, meaning this running some statistic starts stops to increase you constantly decrease. And then there is different ways how to quantify this enrichment score. So, one is just taking the you know, the maximum deviation here or you can also calculate the area under the curve and things like that. So, there is you know different nuances to that type of analysis, but that is the general principle. So, just to mention. So, this is one enrichment score in order to calculate the Piva you for that enrichment score you would you know repeat this analysis 1000 times you know you would do 1000 permutations and you would do you would permute your class labels. You would shovel your class labels just to get a background distribution and you would repeat this analysis you get a distribution and then from that distribution you can calculate an empirical P value for your observed enrichment score. All right. So, that is a general principle of GSEA you compare two phenotypes and you have measured a sufficient number of replicates like biological replicates in your phenotype A and B. So, another approach to this kind of gene set enrichment analysis is also is the so called signature projection method or single sample GSEA which basically works in a single class data set. So, there you do not necessarily want to compare you know two conditions or you might not have any replicates. So, you just have a single data vector let us say this data vector can be anything it can be you know expression values in a single condition. So, you just have one experiment measure the proteome in one time point let us say and you want to look at high abundant and low abundant proteins in general or you can do you can you know look at RNA protein correlation coefficients you can your ranking would be you know highly correlated protein or genes in proteome and RNA space you know compared to low correlated or negatively correlated genes and so on and so forth. So, that is the principle. So, basically what you do if you have a data matrix you would project your let us say these are gene centric matrices and you would project each column into pathway space and we will do that during hands on and I hope that this will become a bit clearer, but basically that is that is why it is called a signature projection method. So, you start with your genes or proteins whatsoever and then you apply this method and in the end you have the same kind of matrix, but instead of looking at genes you are looking at pathways and with this matrix you can again do like all kinds of statistical analysis like you know supervised or unsupervised marker selection type of analysis and so on and so forth. I hope you have learned how the pathways provide information for the combination of correlated genes making a network to make a system functional. You also learned that wiki pathways is like wikipedia for pathways. We also learned about hyper geometrical test which is based on Fisher's exact test where one can compare clinical conditions with healthy individuals and make a pathway based on differentially expressed genes. Other pathway enrichment approaches like gene set enrichment analysis or GSEA it includes all the proteins in the study without filtration and mapping a pathway. The next lecture is going to be the continuation of the pathway enrichment by Dr. Kruk. Thank you.