 So, we're going to, for the next approximately 15 minutes, learn about the statistical tests that are at the basis of pathway enrichment analysis and, in general, finding over-represented pathways in gene lists. And Vernick can help me if I miss a point that you've also given us this lecture. So, the basic objectives of this module are to learn about the statistics behind the enrichment analysis, pathway enrichment analysis, to be able to select the appropriate enrichment test out of multiple that are available, be able to understand what the concept of a background gene list is when running Fischer's exact test, or also known as the hypergeometric distribution, be able to compute a minimum hypergeometric test on a ranked list, so we'll get into exactly the difference between a ranked list and a regular gene set list, be able to determine when you need to do multiple testing correction and what type of multiple testing correction you want to use, and to be able to select whether to use a Bonferroni corrected p-value or a false discovery rate. And then also explain in plain language how you calculate each correction. Okay, so this lecture will cover an introduction to enrichment analysis. We'll focus on the hypergeometric test, as I said, also known as the Fischer's exact test, or vice versa. And also this different type of statistics that are used in the GSEA software, the gene set enrichment analysis software, which is called the minimum hypergeometric test for a ranked list, although it has different names. And then also to cover multiple testing correction, including Bonferroni correction and the false discovery rate computation using the Benjamini-Hockberg procedure. Okay, so this lecture is going to be fairly focused on statistics. And then after lunch, officially we have in the schedule that you can do a lab, and the lab we've scheduled an hour and a half to try out all of these things. And that's where you can use your own gene list, you can use other gene lists that are available, and then you can actually try all of these things out. If we happen to end earlier with this lecture, then we'll just make more lab time before lunch, and we can get started on it, okay? Okay, so we've talked a lot about gene lists this morning, and one of the major types of gene lists, as I mentioned multiple times, comes from gene expression analysis. And the sort of typical example is that you identify genes that are differentially expressed in your condition compared to control. And in the old days, people basically would set an arbitrary threshold to, people might still do this now, but like when gene expressions analysis started out, the only way that people analyzed their genes or defined a gene list was to set a threshold on the expression level change. So in gene expression analysis, you have a set of experiments that you've run on your condition of interest, and a set of experiments that you've run on your control, so you have multiple replicates ideally. And then you can use a statistical test to identify whether a gene expression level is differential, significantly differentially, or let's say the gene is significantly differentially expressed between the two good conditions, your condition and the controls. So usually the way that that works is you have, because you've done multiple replicates, you have a set of numbers associated with gene A in condition A, and a set of numbers associated with gene A in your controls. And you can compare, if you look at the distribution of those numbers, you might see a normally distributed set of numbers and another normally distributed set of numbers. And it's not easy to use a whiteboard here, but you can, many of you are probably familiar with a t-test, and you can, the t-test is a statistic that sort of measures how significantly different two distributions are from each other, assuming normal distributed data. And so usually you see these two like plots, like normally distributed plots, and if they're on top of each other, they're not differentially expressed. And if they're far from each other, the farther they are from each other, the more differentially expressed they are. And it gets a stronger, stronger p-value as a result of a standard test, like t-test. There's lots of different tests like t-test for the different types of statistical statistical assumptions, like non-normally distributed data. But the idea is that people used something like a t-test, and then they had a threshold that they said, we will take genes to be differentially expressed if its p-value of the t-test is less than 0.05. And in addition, they added another criteria of expression change, expression level change. So sometimes you can get, if you have a lot of replicates, especially you can get genes that would be, you know, they're actually differentially expressed statistically, but they're not very far from each other in the numbers. Like they might be still quite closely expressed, you know, their expression levels might still be quite close. And so people said, oh well we also want the expression levels to be, you know, two-fold more than, you know, in condition versus control. So condition is, it needs to be at least twice as highly expressed or under, you know, over expressed or under expressed compared to control. Now the problem, does anybody, can anyone tell me about why that might not be a good idea to do like kind of cut-offs like that? So the, so the issue is that it's especially this expression change is arbitrary. You can't write in a paper, we chose two-fold expression because, and then have an explanation that's like a biological explanation of why two-fold expression is better than one-fold expression or 1.9 or something. So there's no way to really know what that right expression level changes unless you had like a really, you know, like a bimodal distribution, your data or something, and there's no data between, you know, one-fold and eight-fold or something, you know, and you could say maybe we just could split them because there's actually the data says that there's two different things. But in general, there's no kind of, it's pretty arbitrary and so thousands and thousands of papers were published, especially in the beginning of gene expression, the history of use of genomics for measuring gene expression, that used this threshold and it was just basically copying what other people did and the first people who did it, it was just a convenient thing that they did it that way because they didn't have statistics that would help them avoid that. So sometimes it is possible to have a generate data from a genomics experiment that gives you a gene list and you know exactly what gene should be on that list. I gave some examples this morning. For instance, a protein interaction screen. You identify a set of proteins that are bind to your protein of interest. It's just those proteins. There's no extra proteins that could be found. It's just those are the ones that are identified. Or some any kind of similar test of where you're looking at interactions like a like molecular interactions. Often the experiments will tell you exactly what's interacting. I mean there might be some confidence measures that you can tweak to get different slightly different answers, but those won't be on the level of interaction. They'll be on like the the confidence of identifying the protein or something like that. So for gene expression data though it's not really natural to just arbitrarily define a cutoff in the expression full change. So ideally what you'd want to do is not do that. And so what you're left with if you don't do that is a list of all of the genes and their differential expression level. And that ends up being a really big list and a lot of the genes are for an RNA seek experiment with the whole genome. And sometimes you can get a whole bunch of other genes that are on there. So you've got all the protein coding genes. You might get links, long non-coding RNAs and other types of RNAs. Probably won't get short RNAs because there's different RNA seek procedures for that. Although some of the new RNA seek technologies are actually I think combining everything so you get all of the different RNA molecules. But the natural way of representing that data is just this big long list and you rank you can rank the list by the differential expression. So you can compute the differential expression and you can also compute if you have replicates you can compute a p-value. So it's okay to to filter on the p-value. You can say I'm going to take only genes that are differently expressed that are you know sorry technically it's okay to filter on a p-value. I'm going to take genes that are differently expressed less than you know 0.05 of a t-tust p-value of a t-test. But you actually don't need to do that. And if you don't do that and you work with the whole data you could make use of all of the information. And now there are statistical people actually the statistical tests have been around for a long time but over time people sort of learned how to use them effectively in genomics. There's statistical tests that allow you to use all the data without setting any thresholds. And that's a lot easier to write about in a paper. We just took the data and we analyzed it. You don't add an extra step. You took the data we filtered it in these different ways and then we analyzed it because people could always come back and say well if you change this number are you going to get different results. So that's the idea of a rank list. So this lecture really talks about two things. Statistical tests that are based on gene lists which is really everything I was talking about this morning. And then an additional concept which is this concept of the rank list where you have all of your data that comes out of the experiment and usually that's for an expression RNA-seq experiment the whole genome. You know 20,000 genes in human genome for instance. Okay so for the gene list so just keep in mind that there's two types of gene list that we're talking about. The gene list and the ranked gene list. Okay gene list is just a set of genes and the ranked gene list is ranked according to some value, some score like differential expression. Okay so enrichment analysis of a gene list is very much of the type that I introduced this morning. So it answers the question are there any gene sets surprisingly enriched or depleted in my gene list. And the typical statistical test that's used is the Fisher's exact test also known as the hyper geometric test. For anyone who knows about statistics you could also use the chi-square test. Chi-square test is the test that we actually learn about in high school and undergraduates like as the test for looking at you know differences in distributions of two categorical variables. And the Fisher's exact test is similar. The reason why we use the Fisher's exact test is that the chi-square test is actually doesn't work well with low numbers but it was with small numbers. But it was the one that everybody used for many years because it was easier to calculate. The Fisher's exact test is like before computers was hard to calculate because it has factorials in the equation. You have to like work with big numbers but computers calculated very quickly now and so it's actually pretty much should be used all the time in that particular case. So and this hyper geometric idea is it follows the hyper geometric distribution which is like a distribution related to categorical like you know discrete values. One of the discrete value distributions. So okay and then the so keep in mind that with the gene list you're wondering if there are gene sets surprising like another set that's enriched or depleted in my set. Just a note about nomenclature so in this course we've chosen to use in this workshop we've chosen to use the term gene list to be the list of genes that you're interested in. And gene set is the pathways or other things that are set we have a database of sets. We just separated those because it's they're actually both sets but if we say gene set it's hard to talk about them if we don't kind of separate them. So it's not official terminology although a lot of people kind of use that but gene list is just thinking about the list of your list of genes and gene sets are you know pathways a pathway gene set or any other kind of gene set. Okay the rank list answers the types of statistical tests on the rank list answer the questions the question are there any gene sets ranked surprisingly high or surprisingly low in my rank list of genes. And we'll talk about that but in more detail but basically you're looking for if you have a set of genes that are differentially expressed. Say you have 20,000 genes in the human genome and they're ranked by over expression to under expression in the middle of the list it's equal expression between cases and controls so there's no differential expression in the middle. If I have a pathway that a pathway gene set that I want to see if it's enriched in my list if I look at that pathway gene set and I look at the genes in my big rank list that are in that set like I look at all the cell cycle genes in my big long list rank list if they're spread randomly everywhere I probably that probably means that the cell cycle is not relevant for my experiment however if the cell cycle genes are all at the top of the list it's like all the high express genes are cell cycle genes and then like nothing else it's just cell cycle all at the top it's like wow that's very unusual like if by random chance I just expect my gene set to be my genes of a pathway to be spread out all over the list now I have this like really non-random pattern of like everything bunched up at the top of the list and anything that I say about the top of the list you can also think about the bottom of the list. So you know everything bunched up at the top is as statistically significant as everything bunched up at the bottom. You can also have things that are bunched up at the top partly and bunch up at the bottom those are actually just treated separately so we think of the top of the list as one thing and the bottom list as another thing but there could be questions about you know the interpretation of things when you see that kind of pattern but let's just keep it simple just to explain the statistical test. Statistical test is looking for a set of genes that's bunched up at the top of the list or at the bottom of the list okay and this particular statistical test that you know this one that we're going to cover is called the Minimum Hypergeometric Test and GSEA actually they're two different tests and there's a bunch of others but we don't cover them so let's see I just need to okay there are so okay so let's let's start with this first gene list okay so again using the gene expression experiment if I so this is the general idea of the enrichment test which I talked about this morning so we have our omics data set it generates a list of some sort could be a gene list that is discreet or it could be a rank gene list so we run our enrichment test we have our gene set databases and I like calling these pathway gene sets to start again and then we get enriched pathways that result or enriched set of gene sets so here's the spindle pathway and the apoptosis pathway and they get some kind of score this is the enrichment score it's the value statistic that comes out of the whatever statistic you choose for for this and when we go through the this morning actually in the lab you'll see that there are actually multiple scores computed so we'll talk about those okay so I'm getting to back to the basics of gene list enrichment analysis given a gene list like these genes which are happened to be yeast genes we want to look for gene ontology annotations that are you know take any annotations that we want to know if they're surprisingly enriched in the gene list and as we discussed this morning we want to know where the gene list come from and as we're talking about now we want to assess how surprisingly these this pattern is and then also what we'll talk about now is how to correct for repeating the tests because if you keep repeating the test forever you're always going to get a answer that you want right but you have to correct for that somehow okay so the standard design for generating the gene list is we call it two-class design so this is what I've always mentioned cases of controls condition versus controls but you can just generally think about it as class one class two it doesn't have to be controls for instance like for a pendamoma we had type A and type B and so we compared one type versus the other and then you based on your differential statistic like I mentioned the t-test for RNA-seq data usually the so you can you can generate different different statistics that help you rank your list so you could just look at the ratio of expression values frequently called the full change the grammar doesn't totally make sense in that that term but it's it's basically the ratio of expression levels in two conditions you can look at the log of that ratio and that that is frequently done the t-test is used if you have continuous data and for and for microarrays people also like to use this thing called significance analysis of microarrays these days people are using RNA-seq and RNA-seq is a bit different than microarrays because you're actually counting the expression levels of you're counting the number of transcripts basically that you see in in the RNA-seq experiment so when you do your experiment you you're basically sequencing transcripts but you don't sequence all the transcripts you sample the transcripts and you're using a DNA sequencing technology to read it DNA sequencing technology these days only read short reads up to like 100 or 150 base pairs ideally you want to sequence every transcript and you want to do the whole thing but we don't have technology that does that these days for every transcript so we're forced to work with these short reads and then you take the short reads and you align it to the genome and you use a genome reference that exists like for human people usually use the gen code alignment definition of where genes start and end and and then you know as these as these reads build up and you're on the on the transcript they basically get counted and so you get counts like one transcript count or a hundred or a thousand and a lot of and and so you you're left with these counts and then it's often you can have zeros and so the particular statistics relating to that have been modeled and you can't really use a t-test so people use different types of statistical tests and a popular example is in a package called edge R which implements and I'm just again forgetting the name of the statistical test that they use there but they they have a distribution that's not normal and they model they model the data with a non-normal distribution I can look it up if anyone's interested does anyone use other types of has anyone used other types of differential statistics for their arnie seek data arnie you had a question yeah okay so when you're calculating a t-test or whatever this other test is are you talking about biological replicates you're talking about replicates yeah so sorry I should mention that these this expression matrix here is a set of expression values sorry a set of measurements for one class and a set of arnie seek data points for another class if you just have one arnie seek for your for your class one and one arnie seek for your class two you can't compute statistics the only thing you can compute because all statistics are based on some estimate of variance basically which helps you understand the expected the expect the expectation so if you can't estimate variance because you only have one or two you know that two is hard to estimate variance from then you can only do something like a ratio of numbers and you'll have to be aware that sometimes that ratio will give you very high numbers even though you it's not really significant it's not something you want to look at so like for instance a big number versus a small number um you'll have problems with that so uh statistics help you deal with that so it's it's best to it's best to deal with have replicates but especially in the beginning of any technology replicates are usually expensive so people didn't include replicates so in the beginning of microarrays nobody had replicates then everybody had replicates and now arnie seek the beginning there they didn't have replicates now everybody have replicates single cell arnie seek because all no replicates right now but next year might be replicates so it's just a matter of cost um so it is possible to uh compute your differential statistic without replicates using like a ratio or some other thing like difference whatever you actually want to make up is as long as it's kind of assessing all the genes at the same time uniformly then you can rank them by that and you can do the analysis and one of the nice things about pathway analysis is that it doesn't rely on problems with individual genes like it's not necessarily sensitive to uh statistical problems you have with measurements of individual genes because you're looking for a pattern of many genes and so if all the genes are going up in the same direction or down in the same direction that that's a pattern that's more difficult to achieve randomly um so there's some benefits of pathway analysis for dealing with noise and the data but you you do have to be aware that if you don't have replicates you're going to have a problem with noisier data basically. Does that does that make sense? Do I answer your question another question? So like the Wilcox and Whitney rank sum test is that what you're talking about? I think that's probably the one so that that's that's a rank test that's similar to these rank tests that I mentioned sort of for rank lists um so yes uh in general we recommend using a ranked statistic if you have a ranked if you have a ranked gene list so we basically the main recommendation is avoid arbitrary cutoffs if you can and use of a test like the rank uh a rank based test which I'll explain more detail how they work um will avoid you having to make a threshold and in particular the threshold is is here so so yeah each of these like columns here which you can't really see is a different a different experiment so you have replicates here a bunch of replicates for blue and a bunch of replicates for red and you can compute the differential expression between blue and red and genes at the top of this list are more expressed in red than blue and genes at the bottom of the list are more expressed in blue than red and so this is the rank list that we like to use now you could compute this the threshold is basically like setting a line here and saying everything above the threshold this threshold is up in red and everything below is down in blue and I'm not I'm going to ignore everything in the middle so we we recommend if you can avoid it to not make this threshold here if you do make this threshold you can use the gene list statistics if you don't make this threshold to use the ranked statistics and we prefer the rank statistics and what you mentioned I think is a rank type of rank statistic okay does that make sense yeah going back to the question of replicates so if you small small population of cells and then you can actually have replicates but you have if you might have replicates in terms of biological experiment but then you have to pull whatever you have to get the data that you use so would that take care of the replicates problem so the question is does pooling help with replicates it doesn't it doesn't help ultimately but it does help kind of create an average of your data so if you pool different samples and that was an approach that people there's definitely many people have used that approach it and so it definitely helps kind of keep make sort of smooth things over right and smoothing things over will avoid problems where we have like random fluctuations that cause big changes in like ratios there is a couple of additional there are a couple of additional things to be said about replicates that you reminded me of one is the replicates are more important the more variants you have in your data like the noise your data are the more the more you need replicates so if you have data that's very clean you need less replicates and you usually don't know that to actually figure out how many replicates you have you have to do a power analysis which people have talked about but usually power analyses come with a lot of assumptions and the best way of doing it is actually to do an experiment to kind of figure out what the variants is and then you can you can do a power analysis we don't usually do power analyses most people don't for gene expression data because we usually don't know it's possible to do that and there are people who do this but it ends up being an estimate but in general there's sort of a few general guidelines so certain types of data are known to be less variable than others so cell lines are less variable than patient populations right especially if you're taking something from different times different ages different ethnic groups and like people eating different things you'll get more variability compared to lab controlled experiments so that things like that you can kind of get a sense for or if you have clonal populations of of anything you're in your what your organism that you're studying you expect that to be more less variable than wild like a wild type field based populations so that can help a little bit okay any other questions okay so time course design I need to make sure that I cover all the statistics so so just quickly here's another example of an experimental design for for gene loss so here we have a time course so we have different expression profiles measured at different times and we cluster them and you might find that certain genes follow certain patterns and each cluster defines a list so this is another way of creating a list we didn't create a threshold we found genes that are similar to each other and that creates a list and so now we have a list okay so getting to the actual statistics so we have a a gene list enrichment test where we we've somehow defined a list and this is not going to be a ranked list so we have our list of genes that are upregulated just for purposes of of discussion background is all of the and this is the kind of old way of doing it that I mentioned so but it we're still using this example of gene expression you can still think about it being valuable when you do things like this when you cluster or other ways of defining a list where you really want a list so here just to explain a few concepts here's the threshold that we applied in this case to generate to to define our gene list and everything the whole thing this whole rectangle is the background it's all the genes that on this example on a microarray but it's basically the list of any gene that you can hope to recover in your experiment so some experiments don't find certain types of genes those are not part of the background so the background is the universe sometimes called the universe of all the genes that you could detect that your experiment could possibly identify so just keep that in mind that's an important concept and then we have our gene set database or our pathway gene set database so as I mentioned before the statistical test looks for overlap of the gene set to the gene list but it's more than that it actually it's not just that it it also considers the overlap of the background so here is so basically we have a few numbers the size of the gene set the size of our gene list and the expected and also the overlap of of the gene set to our gene list and the overlap of the gene set to the to the universe of all possible genes we could think about so on an RNA-seq experiment it's like universes like all the genes in the genome so you know this is similar to the example I mentioned before where we have where we have you know the cell cycle half of our genes are cell cycle but only five percent of the genome is cell cycle so we have much more enrichment of cell cycle genes so the actual way that the the test is carried out is this hyper geometric test considers those four numbers the overlap the size of each gene set your list and the set and the overlap of the your gene set with a pathway sorry your gene list with a pathway that's why I like to use pathways because then it gets away from this idea of gene set so our gene list is overlapped with the pathway there's some number of common things and the pathway is overlapped with the whole genome or the the background and the output of the enrichment test is a p-value so the p-value assesses the probability that the overlap is at least at least as large as observed by random sampling the universe so if I just take my gene list which say it has 100 genes and I randomly pick 100 genes from the genome thousands or millions of times what's the the p-value assesses is supposed to assess likelihood that I get the amount of overlap that I see by chance and and it's at least as large as observed so you could have that amount of overlap or more overlap so that's does that make sense what the p-value means okay so general recipe for gene list enrichment test is to define your gene list and your background list often the background is like the whole genome unless you're working with an experiment that doesn't can't capture the whole genome select your gene sets to test for enrichment that's the pathways so find define some pathways previously I recommended genontology biological process and then run the enrichment tests and correct for multiple testing if necessary usually it is necessary because we each of these tests just does one pathway at a time so we do one pathway say I'm looking at genontology biological process and there's a thousand biological process terms that I'm considering I run this test once for each of those terms and each time I run it I get a p-value so I get a thousand p-values and then the significant p-values are the ones that are enriched are the pathways that are enriched now one of the problems with that is that because I've done a thousand tests there's a chance that I could get some you know good overlap by chance because the more I do these tests the more likely I could I could get that so you have to correct for that as I mentioned and I'll talk about it in more detail and then interpret your enrichments which we'll talk about more this afternoon and then publish yay okay so the possible problems that I mentioned with gene list idea is that there's no natural value for the threshold there could be a natural value in which case you're welcome to use it there which that could lead to different results of different threshold settings and you could potentially lose statistical power due to thresholding because you might have weak effects that combine to make a strong effect so maybe none of the genes in your pathway are really strongly differentially expressed but they're all differentially expressed a little bit and all in the same direction and that is a signal that's statistically significant and it's unlikely that that would occur by chance and so you'd miss that if you thresholded it so one of the advantages of the rank thing the rank list ideas that can go it can take all the information that you have and that's a good advantage so it it it takes the weak signals as well as the strong signals okay so the ranked the ranked list idea so again we take this this experimental type so we have multiple replicates in class A and class B and we we computed differential expression our different differential statistic to rank to get a rank list again the things at the top of this list are genes so each row here's a gene genes at the top are upregulated in red and genes at the bottom are up regulated in in blue say blue class and then we use one of these two statistical methods to compute the p value and you don't need to choose a threshold so that's the idea of being hammering okay so this recipe is for ranked list is slightly different step one is you rank your genes and then after that it's the same select your pathways run the enrichment test correct for multiple testing interpret your enrichments and and and finish okay so the theory component so hyper geometric test is used for calculating enrichment of p values for gene lists GSEA and minimum hyper geometric test is for computing enrichment of p values for ranked lists the multiple testing corrections that we'll cover are Bonferroni and Benjamini Hockberg FDR there are others but these are the two that you frequently use okay so the hyper geometric test also known as the Fisher's exact test okay so the null hypothesis from statistical terminology is a list the list this gene list is a random sample from the population the alternative hypothesis that we're trying to test for is that there are more in this case black genes than red genes expected so let's say we have a bunch of genes and we're going to call them you know there's 500 black genes in the genome and and for 4500 red genes in the genome and my list has one red gene and four black genes now is that you know given this is that statistically significant so the hyper geometric test models the expected amount of basically seeing the probability of seeing you know zero black genes in the list one black gene in the list two black genes in the list three etc five up to five right so five getting five five out of five is very unlikely getting zero is more likely because there are many more red genes than black genes so if we just have mostly red genes in this bucket and we're just picking red genes we're most likely to get red genes right and sometimes we'll get black genes according to this ratio okay so so the the answer to this statistic of how likely it is that we get four is the sum of the of the probabilities of four and five remember i said it measures the overlap of like of of how many black genes and black genes are like pathway of genes in a pathway like cell cycle so the so we got four out of five black genes here and the probability that we get that is this basically the combination of seeing four or five four out of five or five out of five and so that's where the at least comes from so this answer is the probability of seeing this is four point six times ten to the negative four so this is basically looked up or computed automatically from this this equation that is behind the Fisher's exact test and that's that's the p value this is the whole distribution here is the null distribution and this is what's assumed by this this test you can also think about it as the two by two contingency table so you compute this by the two by two contingency table for the Fisher's exact test so you again you have these four numbers genes that are in a gene set versus genes that are not in a gene set and genes that are say sorry this let's just call this pathway in the pathway not in the pathway in your list not in your list and these numbers go into computing it you can look up the formula online it's not that complicated but it's we're not presenting it here okay so just a few details the test for under enrichment of is is the test for of black is a test for over enrichment of red so you can switch those you need to choose your background population appropriately i talked about that and you need to again it's it's the the list of genes that you could possibly identify in your experiment and you need to consider when your experiments very biased and you not include the genes that it can't see in the in the background and the and and you can't you can't do all of your tests with all the pathways in one go you have to the test for enrichment of more than one independent type is different you have to apply this fissures exact test separately from each each type okay so other enrichment tests you could consider the binomial or the chi squared i briefly mentioned that and then the rank list is you know this what we're going to get into and the wilcoxon rank sum is mentioned there so in the man whitney you test comodoro smirnot etc okay so okay so i covered gene lists and we went over the fissures exact test and how it works now ranked gene lists okay so one test for a ranked gene list is the minimum hyper geometric test this is very simple you you take your ranked gene list and you set a threshold i said don't set the threshold but in this case you actually set a threshold and i'll tell you why in a sec so then you compute the fissures exact test and then you change a threshold to make it a little bit more permissive and then you compute the fissures exact test and you change it again actually go through all the thresholds so in that case you you are actually trying all the thresholds and you'll see if any of those thresholds gives a enrichment and at least one enrichment you'll like say it's enriched and you have to correct for multiple testing within that because you're doing lots of tests so that that's sometimes used so the g profiler tool that we cover in the class uses this idea of the minimum hyper geometric test the advantage of i'll go through the advantages and disadvantages in a sec okay the second idea is that there's this gsa test and the gsa test looks uses the uses a particular type of statistics which i'll explain so remember the rank based statistics like when there when there's a bunch of pathway genes at the top of the list like all bunched up at the top compared to randomly spread across the list okay so the way that it computes this is it goes down the rank list one by one and it says is this gene is the top gene in the list in the pathway yes or no if it's not in the pathway it doesn't do anything if it's in the pathway a score goes up so it just goes down and if i'm actually i guess the score goes down ideally if it's not in the pathway if it is in the pathway that's what these red bars mean so each each vertical line here represents a gene this whole long thing is a gene a ranked gene list they're ranked by a differential expression if it's red then it means that the pathway that the gene is part of a pathway and and we just go through one by one the score goes down basically if there's a if the gene is not part of the pathway and it goes up if the score if the gene is part of the pathway so everything's bunched up at the top the score will go up quite a lot if everything's spread out over the list the score will kind of stay it will go up and down but it won't really there won't be any peaks so basically what the score does is it takes this maximum peak here and it says that's the enrichment score now you have to do some more statistics to get a p-value from it because this is not just doesn't define a p-value just says that it's just a way of scoring pathways that are bunched up at the top of the list okay so going from an enrichment score to a p-value there's two ways that you can do this the main way that that works for gsea is that you have to compute what's called an empirical p-value and an empirical p-value is a p-value that you don't compute with a simple statistical test you have to compute by going back to basics of statistics and doing permutations so you do random sampling and and then you you ask you do enough random sampling that you can estimate like how often you see like if i do 2,000 random samplings how often do i see the like this enrichment score at this height by chance right and if you never see it in 2,000 random samplings it's very significant because one you can't even see it if you do 2,000 or 5,000 random samplings if you see it in half of the results well it's like not very significant because half the time you do a random sampling you see an enrichment score that high okay so that's the idea okay so you compute an empirical p-value for each gene set overlap and you generate a null hypothesis distribution from the randomized data and you can you know there's a few different ways that you can choose to do this so this is and this basically just says what i explained that you look for your real enrichment score to see how often it appears in all these enrichment scores from random random sampling i'll tell you how how to do the random sampling in a sec okay so you know in this case the real enrichment score was seen in four out of 2,000 random samplings so the p-value is basically that it's 0.002 so it's forward led by 2,000 and that's it that's how you computer empirical p-value okay so there's different ways you can do this which we'll talk about but one of the things you can choose is the number of times you do your sampling the more you do this the longer it's going to take to compute that's the basic idea so if you if you set that number really big your computer is not that fast you might be waiting like hours but you know the bigger you set that the more compute time it takes because it has to do this random samplings and it's doing the whole process 2,000 times in that case okay so the you don't have to do this for this minimum hypergeometric method because minimum hypergeometric method just can rely on a multiple test correction which i'll talk about sorry one more thing that is not on these slides but it's important to mention is that there's another way uh so actually just the way that gsa i don't know do you do you verny do you cover this uh permutation based thing in the lab okay so just to briefly gene that you cover a gene set permutation okay so the there's two in gsa there's two ways of doing permutation one is you can randomly select gene sets to kind of randomly create pathways and you repeat with random generations of pathways another way is you if you have cases and controls you can randomly split up the cases of controls and so that mixes up your data so that the differential expression will be will be computed based on random assignments to the classes we typically well in the lab will typically use the gene set the pathway permutations because usually i guess it's a bit complicated to explain this but originally gsa was made for microwave data and it had all statistics inside of it to do the compute the differential expression and so it could split up the class labels and recompute the differential expression and then redo everything for RNA seek it doesn't actually have and i don't know of a tool that makes available all the RNA seek statistics so RNA seek statistics are typically handled in an r package like edge r or you know one of the others and then and then we load the ranked list into gene mania and into gsa so then it's hard then if you wanted to do the permutation based on class labels you have to do that somehow outside of gsa and that requires coding in r which if anyone's interested we can cover but we can talk about offline um so we'll talk about that again don't worry about that too much it'll be mentioned again in the lab and you'll get more into it in the lab um okay so um yeah so here's an example of the gsa plots that what they look like so this is a plot of a highly enriched gene set so you in gsa you can actually see these plots and so you can see oh yes it's like really enriched here's one that is not really enriched it kind of goes up and down and here's one that's depleted so you can see how that corresponds how these these uh red and blue colors correspond to these patterns here okay okay multiple testing correction um that's the last topic here um okay so I mentioned that um you have to correct for multiple testing and I kind of mentioned why so this is the just an example of why so how to win the p-value lottery part one so if if I am looking for my set of you know my that example I'd have had with four black balls and one red ball um out of this set um if I keep on making random draws like sometimes I get four reds sometimes I get five reds sometimes I get four reds you know like at you know 7,834 draws later oops I get one that's like exactly as enriched with with uh you know has as few red balls as this um and we can expect a random draw with an observed enrichment once every one divided by the p-value of draws so the p-value is 0.05 divide one divided by 0.05 that's how many draws you expect to get a random draw that looks like what you have so if you you know if you have a really really really really really really good p-value it's always going to be good because like it's 10 to the minus 100 you know that's a really huge number of draws that you'll expect randomly before you get that pattern but if it's 0.05 well it's like 20 or something right so um that's not that many and if you're doing 5,000 pathways and you expect to get pathways with p-value of 0.05 by chance quite a few times so you have to correct for that um so um the um yeah so this is just the example with actual pathways so uh in this case we have um you know you do the you do the draw this is like testing for one pathway this is testing for the next pathway this is testing for the next pathway that's what this slide is just explaining that it's not that you're doing the same test over and over again you're doing one pathway then the next pathway then the next pathway so it's not exactly the same as that a simple example that I mentioned because the pathways are different sizes there's different numbers of red and black balls so it's not easy to sort of figure that out so uh there are two major tests that that people use uh Bonferroni test how many people have heard of the Bonferroni test or used it okay and Benjamini FDR or FDR test okay so less people with that so Bonferroni is the kind of simplest test for multiple correction for multiple tests um you basically uh multiply the p value by the number of tests very simple that's why every more people have heard of that um the actual thing what it means the actual definition of it is that the corrected p value is going to be greater than or equal to the probability that one or more of the observed enrichments could be due to random draws so the jargon for this in statistics is that it controls for the family wise error rate so that's actually ends up very stringent it's it's very stringent because it says that um we expect out of all of our tests none of them should be like present due to random draws but in practice um it's very stringent and so it can get rid of real enrichments um and often one is willing to accept a less stringent condition like I'm okay with five percent false positives I'll deal with it at the at the uh benefit of with the benefit of being able to get more signal out of my data um okay and then so this FDR is a gentler way of doing it so the FDR is the expected proportion so the false discovery rate is the expected proportion of the observed enrichments due to random chance so again this is like the percentage of false positives that I expect to see um and compare this to Bonferroni which means that like none of the enrichments or you know probably that any of the enrichments is due to random chance um okay so Benjamini Hockberg is the the names of the people who developed this FDR test uh this this correction method um and the result of this correction method is also often called the Q value which you might see in various tools so they convert the P value to a Q value Q value is the multiple tested correction multiple multiple test corrected version of the P value um okay so here's an example of how it actually works um so you sort the P values of all the tests in increasing order so the best P values are at the top and then you um uh you basically do this uh test where you multiply the P value by the number of tests divided by the rank of the P value so 53 divided by one one is that means that this P value is like number one in a list two means number two in the list three means number three in the list so over like so the top P values um uh so basically um you get like a gentler correction as you go down the go down the list so in the end this gets just multiplied by one um so that's Benjamini and Hockberg created this this idea and the theoretical statistics behind it are what I mentioned how it actually works is seemingly fairly simple how it's actually computed I mean okay so um the Q value or the FDR corresponds to the P value or nominal P value sometimes called um that's the smallest adjusted P value assigned to P values with the same or larger ranks so I guess that you know I've never found this definition to be as useful as just thinking about this in terms of the expected false positive rate um and but here's an here's an example so so this P value 0.0031 has a um uh it's a P value it meets the P value threshold of FDR of less than 0.05 so um it's the highest ranking P value for which the Q value is below the desired significance threshold so if I say I want um if yeah so this is I guess explaining how to use the the Q value so you compute your Q values and you choose a threshold like I'm okay with 5 false positives or I'm okay with 10 false positives and you cut this um this Q value at that point and then the corresponding P value is the P value that is the highest the um worst P value for which you expect to have that many false positives so after that the P values are going to be in false positive territory according to your threshold okay so um the stringency of multiple testing depends on the number of tests that you do so uh the more tests that you do the more sensitive the test has to be or the more strong the signal in your data um and so one way of of making this less stringent is by reducing the number of tests so this is one of the reasons um why I recommend starting a pathway enrichment analysis with just pathways instead of incorporating everything in gene ontology and everything that you could incorporate because the more things you incorporate the worse your multiple testing correction is going to be so you're going to have to multiply your P values by bigger numbers in general um and uh so it's just an important thing to be aware of um oops um okay so uh to summarize this um let's see if this okay this is a powerpoint issue um to summarize uh the um ranked list um and we'll go over in the lab like tools that use this so just to give you a preview g profiler which is a freely all of basically everything we talked about in this class is freely available um g profiler is um uh uses this minimum hypergeometric test where you you compute the normal hypergeometric at different thresholds um GSEA uses uh this enrichment score followed by empirical p values and then multiple testing correction um is is applied to that um Bonferroni is a very stringent test of multiple testing sorry it's a very stringent multiple test correction and FDR is a more forgiving one that controls the proportion of false positives um one thing to so this is the end of this this lecture one thing to mention is uh that's important to know in pathway enrichment analysis is that pathways are not independent there are pathways they're not necessarily independent so you could have two pathways that are related to each other right they have a lot of genes in common like i could say um uh you know i'm looking for an example like um you know a bunch of pathways are related to control of chromatin state they're likely going to affect include the same genes in some way or they're going to have overlaps especially in gene ontology where we have these uh terms that you can go all the way up the hierarchy and assign them all to a gene and all of those terms are by definition overlapping with each other um and so the basis of this multiple testing correction is that the tests are independent but when we test for enrichment of a pathway and then we have a very similar pathway and we test for enrichment of that and then another similar pathway and we test for enrichment of that the tests are not completely independent they're partially independent because it's not like we're running the exact same test over and over again with the same pathway it's different pathways there's different numbers of different genes but they're overlapping in general none of the standard pathway enrichment analysis methods model that um that uh that issue that overlap between the gene sets there are some that have been published that try to do it but they're not widely used um because they have there might be other issues with using them um like in general the problem is is that that problem is not fully solved statistically of how to do how to correct for multiple testing across automatically um uh using that type of in that type of scenario where you have gene less gene sets that are you know pathways that that are similar um and um and so as a result the multiple tested correction values that come out of pathway enrichment analysis are not exactly correct statistically because it's not it's not meeting the assumptions of the correction test so uh what we do what we'll talk about later is um kind of a way there's no because there's no perfect solution right now we use a reasonable solution or an approximate solution which is visualizing and exploring the results and grouping uh pathways that are similar so you can identify major themes and then you can study the p-values associated with those themes to see if um like what the distribution of p-values is across all the similar pathways uh that'll become clear later when berenique talks about the when we talk about the enrichment map this afternoon