 Hi everyone, my name is Kuwait. I'm a competition biology researcher. I've had my lab for about 10 years now. My training is in machine learning and computer science and also biology. We do a lot of methodological development. I'm training statistics as well and I developed the algorithm that I'm going to talk to you about tomorrow called the G-mania inference algorithm and Gary and I collectively developed the interface. I've been working in this field for a long time and I've been teaching this class about 6 years. Great. And so what I'm going to teach you about today is finding overrepresented pathways in gene lists or gene-set enrichment analysis. And what's going to be important, what I'm going to be trying to convey to you is some of the concepts and we're not going to go deeply into the math obviously but I want you to understand what the concepts are and what the statistics are trying to do. And the reason this is important is there are tools that you can use to do these enrichment analyses but they change all the time. Sometimes the tools themselves aren't perfect and sometimes interpreting the output of the tools needs you to be able to understand how the calculations were done in the first place. And some of the issues that can appear when you use these tools. So let's just jump straight into it. And so you have a list of learning objectives here. I feel a little bit uncomfortable because I usually like to point at things and then we have two screens here and I can't really do that. I guess that doesn't work, right? Maybe if I have, oh good, okay, I have a mouse. All right. And so what we're going to talk about is we're going to talk about two different ways to do enrichment analysis. One is using something where you have just a gene list and Gary's talked about various ways of coming up with gene lists and I'm going to talk to you about a couple other ways that you might do the analysis to come up with the gene list but the other way of doing it is analyzing a ranked list of genes. So if you have some score that you can assign the genes that you can rank them from top to bottom say, you know, up-regulated to down-regulated there's a different series of tests that you can do on that ranked list that in some cases gives you more information and also has more statistical power so it's more sensitive, it can detect weaker signals. The other thing, so that's what these two learning objectives are. This is the gene list and this is the rank list and then you're not just going to test enrichment against one pathway or I'm going to call pathways gene sets because I want to be more general because you might be testing things like as Gary said, like chromosome position and that's not necessarily a pathway, right? But when you test for enrichment for gene sets you're usually testing dozens or hundreds at the same time, right? And then when you do these p-value calculations you have to correct for that fact and I'll explain why you need to correct for that during this talk and justify that to you but the way in which you do that correction is something called a multiple test correction and we're going to talk about the two major multiple test corrections. These are the only two ones you need to know, these are the ones that everybody uses. And so you're going to be going to be able to select between these two types of corrections and one's called a Bonferroni correction and the other one's called a false discovery correction and then I'm just going to explain to you in very plain language how you calculate these things. These corrections are actually really easy to calculate and I'm just doing that because actually in some of the tools that you're going to be using especially if you're working with a non-modal organism that that's not going to be built in for you that you might actually have to do that yourself, okay? All right and so going through learning objectives we've also gone through the outline. So the two rank-based tests I'm going to tell you about are a GSEA and the minimum hypergeometric test in case you've heard about either of those two things maybe this will like stimulate your knowledge so that you know what to look for, okay? And so as I said there are two types of enrichment analysis and I'll repeat this again I guess and the first one is the gene list. So are any gene sets wherever you see gene set you can just put pathway in your mind are the pathways surprisingly enriched or depleted in the gene list and there's one statistical test for this it's called Fisher's exact test sometimes people also call it the hypergeometric test this is the only one it's the only thing you ever need to learn about this other people have used other different statistical tests you can do this type of analysis but that was in the old days when computers were slow and you had to approximate the test now you can do the exact test and there's no reason to use approximations anymore because computers are fast enough to do this. Okay and then now when you go to rank list say in the example I gave you have differential expression up-regulated or down-regulated in response to your perturbation the question you ask is slightly different are any gene sets remember that means pathway ranked surprisingly high or low in my rank list of genes do they occur at the top or are they down are these pathway in general up-regulated or down-regulated? Okay and now there's a lot of statistical tests for this and really there's kind of they're basically doing similar things there's just small technical differences between them these are the types of things that statisticians argue about but from the point at which we're at the level that we're going to be looking at we don't really need to understand these differences in great detail because it may essentially do the same thing okay so here's the enrichment test and then we always like to use expression as the example because everybody kind of comes from a background where they know about microarrays this is kind of the common language now when people do genomics is they think about microarrays even though no one uses them anymore okay so you have some experiment and then in this case we're going to say it's a microarray experiment you have some sort of perturbation and then you have a gene expression table in response to your perturbation and you want to ask the question use that gene expression table to define a gene list these are the set of genes you're interested in that have acted in a surprising way or responded to whatever you're doing now the question is what can I say about this list of genes and the way that you do that is you take a set of genes database of gene sets, these pathways and you compare it to your gene list look at the overlap and come up with an enrichment table which has the name of the pathway or gene set here along with a p-value associated with that enrichment okay so here's the details given the list here I just chose five genes are there any gene sets or annotations that are surprisingly enriched in this set of five genes compared to some background that you're going to define and how you define the background is important and we're going to discuss that in a minute but just think always when you're asking enrichment you're asking enrichment against the background you know is this gene set surprisingly enriched for genes involved in ribosomal biogenesis compared to the background of all genes expressed under these conditions for example so whenever you do this enrichment test think in the back of your mind compared to the background of the background is okay so the details are where did the gene list come from and I want to take a few minutes to talk about that also how to assess surprisingly so when I say are these annotations surprisingly enriched how do you assess that well that's the statistics those are the p-values and then how to correct for repeating the test and we're going to talk about that in the second half of my presentation not right now two class design so imagine this is like case versus control or wild type versus mutant or like normal and this is perturbed and you're asking and then you look at in this case differential expression it could be differential methylation it could be differential expression as measured at the proteomic level it could be a number of other ways of measuring differences between genes under two conditions and then you rank the genes by some differential statistic you know is the mean expression level under the second condition higher than the others how many full times higher is it you know maybe we're only going to consider something to be up-regulated if its p-value is significant you know you have various different ways of coming up with these gene lists and we're we're going to suppose that you themselves have made that decision because it varies a lot by how you're measuring expression and what your questions are now once you have the gene list what you do okay so you can threshold by this differential statistic to define a set of genes that are up-regulated in response to your perturbation or down-regulated or maybe you can just combine these two things together and say this is a set of genes that are differentially regulated right those are all valid gene lists that you can come up with another way of coming up with a gene list is using a time course design so say you have some you know differentiation of stem cells or something so you measure expression or whatever you're measuring under different conditions and then you get for each gene you have a path through this time course like you know you see the time course you see it go up an expression you go down an expression maybe it comes up an expression again and then one way that you can use this information to come up with a gene list you just cluster things use like a clustering algorithm like K-means or K-medioids there's various methods like this within the literature that you're probably familiar with if you just look up clustering gene expression you'll get access to a lot of these things and then each one of the clusters can become a gene set sorry a gene list would I do in the full change yes I would I would do in the log full change the reason for log full change is a lot of these clustering so if you just do full change going up doesn't look the same as going down when you go up 10 fold you go from like 1 to 10 but when you go down you go from 1 to 1 tenth which is pretty close to 1 10 is a lot further away from 1 than 1 tenth is when you put log full change you go from like minus 1 which is like the log base 10 of 10 sorry plus 1 to minus 1 so going up and going down is the same does that make sense? that's why you use log full change okay so under these conditions each one of these clusters becomes a separate gene list and you can ask for enrichment within the cluster and here like an obvious background would be enrichment in this cluster compared to genes in other clusters okay alright so now we've defined our gene list let's say these are genes that are upregulated above some like full change significant difference in their expression and we've defined our background and now we're going to do the gene list enrichment test okay and so here the background is all genes on the array and maybe this is the point at which I'm going to stop and say a few words about how to define your background um define your background is really important so in the old days when microarrays were first invented there were microarrays because they were expensive there were microarrays that just contained spots for all genes as expressed in the immune system okay so if I took a random subset of those genes from that microarray and asked are they enriched for immune function well guess what they're enriched for immune function because that's how that gene list that's how that background was selected in the first place right so you have to so when you're asking if there's a surprising enrichment for genes of a certain function you need to know what your background is and that enrichment has to be discernible above the background okay now we'll return to this issue throughout my presentation and certainly ask me questions about this um when you do a proteomic assay for example you know maybe you want to consider as a background proteins that you suspect that you actually will be able to detect because they're in higher enough abundance to be able to appear like tandem mass spec right yeah precisely precisely the way to define the background is to think to yourself okay what are all the genes that could have appeared on the gene list given what I'm measuring right so like in your case yes you don't have if it doesn't have a phosphous site it's not going to show up right yeah so it also has to be expressed yeah great okay so if this isn't clear we're going to return to this issue and then you can continue to ask me about it or also ask me about it offline and I'll see if I have any good ideas about how to help you to find a background okay so now you take your list and your background and you compare it to the gene set right and then here we've just done the little Venn diagram to show the overlap now see there's like there can be some areas of the gene set that are neither on your list nor on your background I mean that's going to be obvious right unless your background covers all the genes of the genome some of the things can be annotated that actually aren't either in your gene list or in your background okay but then you want to ask the question given how much overlap there is of the gene set of this square here is this overlap large right and so how do we assess that will we assess that using a p-value and what does the p-value here represent the p-value represents the probability that the overlap that you see or is at least as large sorry let me say this again the pre-value is the probability that you would see an overlap as big as you did or an overlap larger than that just by random sampling right so if I was just going to if I was just going to for example let's if I was going to take the gene set and instead I was just going to randomly select genes from the background what's the probability that the overlap I see with your gene list would be at least as large just by random sampling that's what the p-value represents so p-value 5% means only 5% of the time if I was randomly sampling from your background would I see an overlap at least as this large okay and that's a way of measuring surprisingly yeah sure maybe the next slide does this okay no it's going to come up later in the talk but I'll just tell you now it's the number of genes that are both in your gene set and in your gene list in that pathway so for example if there are 20 genes in your pathway that are in your background set and you see 10 of them in your gene list is that surprisingly big given the size of your gene list and the size of the gene set or not that's what we mean by overlap here alright so how do you do this we got to define your gene list in your background list you have to select your gene sets to test for enrichment it to run the enrichment test correct for multiple testing interpret the enrichment in publish it's pretty straightforward publishing that might be a little bit less straightforward but yeah okay so now if what you have is a gene list you're done right and I'll just tell you what the test is and I tell you how to do multiple test correction there are many more decisions because there's just the one test that you can do you have to make a decision about what the threshold is but after that it's pretty straightforward what you should be doing now if instead you want to use a rank list so you have some way of assessing like in this case differential expression statistic for example or how much you believe that this gene should be part of the gene list then there's tests that look at the whole list at once and ask whether or not there's enrichment near the top or near the bottom and the advantage of these tests is you don't have to choose a threshold and the thresholds can be somewhat arbitrary if you choose a threshold wrong you might lack sensitivity so meaning that you might there might be a signal there that you can't detect because you chose the wrong threshold also removes this kind of arbitrariness from your analysis because you haven't had to choose a threshold right and so if you good did get different results well that could be a bit problematic whereas if you don't have to choose a threshold then you aren't making an arbitrary choice in order to get your analysis to work out usually if you don't avoiding arbitrary choices in your research is the way you should try to go when you're doing analysis okay okay and so here the idea is basically the same except now instead of having a gene list you have to rank the genes in some way once you have that ranking then you can apply one of the rank list tests and the two we're going to talk about are the GSCA or the minimum hypergeometric test and then it's exactly the same you assess p values and you correct the p values if you need to okay so the only difference here is that you rank your genes instead of choosing threshold okay so now I'm going to tell you what the now I've given you an overview and we're just going to talk about the theory any questions about this part of the talk now we're in the statistics okay so now getting back to your question what does this overlap mean so here we have a gene list there's five genes on that gene list and the background population and the background population contains 5000 genes 500 of them are black 4500 are red right and we want to ask is there more black balls on this gene list than we would expect if we were randomly sampling from this right so if I just went in and pulled out five balls how often would I see four black balls or more now what percentage of the time that's exactly what the p value is okay so everyone whenever you mention p values people are asking what the null hypothesis is with the null hypothesis is you have to tell people sometimes what random sampling means like what question you're answering because you could be for example asking a different question you could be asking do I see more or less black balls than I would expect randomly sampling from this population right and that's a different null hypothesis here the null hypothesis is I see more black balls than I would expect more or less is called a two tail test okay that's what it means when people say what the null hypothesis is and so this the test to assess this p value is sometimes called the hyper geometric test sometimes it's called Fisher's exact test they're the same test okay and so what do you do well you use something called the hyper geometric function to compute the probability of getting zero black balls out of five one black ball out of five and so forth all the way up to five black balls out of five and then you sum four plus five because it's four or more black balls right if you want to know what this function is it depends on the size of the background population the number of black genes the size of your gene list and the number of black genes in your gene list right and that's the overlap so it just depends on these four numbers you can go in Wikipedia you can look up this function if you want to you don't really need to but this function is called the hyper geometric function that's why it's called the hyper geometric test and then you just sum over four and five because it's four black balls or more okay and then that answer is the p value this is the null distribution so the vast majority of times the tools do this for you it's all under the hood it doesn't really matter like you don't need to know anything about this sometimes if you're like if you're working a non-moder organism like I said before sometimes you have to do this yourself and by doing it yourself what you have to do is come up with what's called a two by two contingency table and so what's the two by two contingency table it just is a way that you can plug into whatever tool that's computing the p value for you how many genes are in the gene set and in the gene list how many genes are in the gene set and on the gene list you just lay out all the four different possibilities of being in the gene set and or being in the gene list and it can extract from this table the numbers that needs to compute the hyper geometric p value okay okay so there's a couple of details so far I've been saying you want to test for enrichment of black balls in this case what I mean by black balls presence in the gene set to test for under enrichment of black balls you can instead just test for over enrichment of red balls right so you can make it a depletion test by just saying saying the opposite is the gene set you need to choose the background population appropriately right so and here the question is what are the genes that could have shown up on the gene list right and the reason that's important because you're using the background to say what the probability of randomly sampling a gene list that has this degree of overlap with the gene set like this many shared black balls right so all the genes that could have shown up in the gene set should be in your background but you do want to exclude genes that could have never shown up in your gene set because for example you're not even measuring their expression level right yeah background or is it the controls well I wouldn't use the negative controls because you're not going to get a lot of hits on your negative controls necessarily you want a sizable set for your background so if you're doing proteomics in that circumstance one way to do it is sometimes people maintain these databases of all the proteins that have ever been detected to interact with other proteins so that they are amenable to your assay for example right so the negative controls themselves are not always the best thing to use as the background because they don't have a lot of genes in them they're not negative they're genes that you could have selected right so for example your positives don't show up in your negative control list right and you actually want your positives to be in the background set because they are some of the genes that you could have selected right so the negative control I mean it's a way of measuring like the frequent flyers these are the proteins that are going to show up under all circumstances because either they're sticky or they're very abundant right what you're not looking for is the frequent flyers you're not looking to remove those you're looking to say okay well this is the set which is shown up because you know they're cytoplasmic right or you know they're being expressed in the cell type that I'm doing this interaction assay in right or they're at sufficiently high abundance that I will be able to detect them through proteomics even though they're not so highly abundant that I detect them in every proteomic assay I do you have to yeah you have to make a decision about what the background is I mean the other thing that you could do is you could be careful about the way in which you interpret your results right so if you are having a lot of difficulty defining your background I think the best thing to do is to think about what a lot about what your background should be but the other thing you could do is you could say okay well you know I've done this like test on my immunochip and I see enrichment for immune function in sort of very broad way well I'm not going to interpret that necessarily but I've detected given the background that I have right so you have to be aware of this problem and there are various ways of trying to solve it my suggestion is to try to control it do a good job of defining what your background is but you could also do a good job of interpreting your enrichments so that you don't over interpret them because you know that you could have seen some enrichments given the sort of the background that you're sampling from more questions okay alright like I said you will sometimes see other enrichment tests and I said sometimes people use binomial or chi-squared to do Fisher's exact tests these are approximations to Fisher's exact tests right and they're done because you know basically in the old days citizens didn't have computers they had to do everything with like tables and like writing stuff down but the reason that Fisher's is so computationally heavy is you have to sum over this entire tail so in the example I gave you you just had to sum 4 black balls and 5 black balls but usually they're like you're summing from like 100 to 900 and you're doing a lot of kind of complicated computations so that's why these other tests were used sometimes to do what you can do with Fisher's exact tests with the rank list you have to make a choice and there's like you know there are you know what is how many are here? 5 major tests that people tend to use we're teaching you the first two these last two tests here will cox on rank sum and man-witney-u test they're actually the same test they were independently discovered given different names and later on people realize it's the same test so and that test is like does everybody know what t-test is? so if you want to know what the man-witney-u test is is you take all your rank list and instead of having the t whatever value you would use for the t-test you use the rank instead of the actual value so instead of having the differential expression you replace differential expression with the rank and then you ask are the ranks assigned to the gene set significantly higher is the average rank assigned to the gene set significantly higher significantly lower than the background that's what that test is okay and that's called the will cox on or man-witney-u test now the test we're going to teach you um is along the lines of what's called a ks test or comagorov smirnov test that's a great name people can say smirnov now because it's a vodka comagorov smirnov I mean that's a bit hard for people but people call it the ks test and they're all variations of the simple ks test and I'll show you how that variation works in a second but the reason that we're teaching you this test people like this test better is that for the ones that are like t-tests they have a hard time with some types of data so say you have say the gene set is either up-regulated or down-regulated so if you imagine the differential test statistic there'd be a single bell curve which would be the background but if you look at your gene set there's one bell curve here it's called a bimodal distribution because there's two bell curves because some of the genes are up-regulated some of the genes are down-regulated this this and this they have the same mean because here the mean is kind of between the two and here the mean is just in the center the Wilcoxon-Man-Whitney-U test can't detect that but the ks test can and we're not teaching the ks test we're teaching variations of it because if you're talking to a statistician you can tell them ks tests and they'll know what you're talking about so what's the minimum hypergeometric test so this was introduced in this paper here if you want to look it up and the idea is really simple so I taught you just now if you give me one gene list how to compute enrichment and assign a p-value to that so what you could do instead is you could say okay well let's do this hypergeometric test at every possible threshold that I could use to define the gene list using my rank list and then we're going to take the minimum p-value and we're going to ask whether or not that's significantly different or not that's what the minimum hypergeometric test is okay and it turns out that this minimum hypergeometric test is also equivalent to what's called the GSEA test I'm assuming you've heard of GSEA because it's very widely known it's a very widely known way of doing enrichment analysis and so this is the way it ends up working so you have a gene set here and then you have a rank list and these black lines what they're meant to represent these are meant to represent genes from your background set that are on your rank list and I'm ranking them from top to bottom here and then you take your gene set and the dark red lines have indicated where the gene set genes show up in this rank list and then the question is are these gene set genes are the red lines are they near the top or near the bottom of the list or is there distribution in this list more or less random okay and so to assess that you take a running total of something called the enrichment score okay GSEA I don't know why I can't say this suddenly if you look at the GSEA literature they actually do call this an enrichment score for me to make this connection with the minimum hypergeometric test I can also define the enrichment score as the negative log p value of the hypergeometric test at that threshold but the score works is every time you come to a gene that's not in your gene set that's not in your pathway you go down a step and every time you come to a gene that is in your pathway you go up a step right and so the alignment here is not perfect but it's meant to be and so like every time you hit a gene you go up a little bit and then you get a whole bunch of genes that go down so it's like it's coming like a running total of the number of genes that are in your gene set kind of divided or subtracted away all the genes that aren't in your gene set right and so what will happen is as you get to the top as you go up you're seeing higher and higher enrichment of genes from your gene set in the list up until that point okay so in this case we no longer have a genes list we have a way of ranking the genes and so now the genes are ranked from highest like most like let's say up-regulated to down-regulated and so it's a different way of defining a gene list it's a way of defining a ranking of genes and so normally what you would do when defining a gene list is you just choose a threshold here and you say okay everything above this line is my gene list now that now we just have a rank so there could be a bunch of different thresholds that you could choose and then your gene list would be defined as everything from the left until that threshold and so what the ES score is doing is it's a way of measuring enrichment as you step down that list choosing different thresholds so as you when you find genes that aren't in your gene set this enrichment goes down obviously when you find a gene that isn't a gene set the enrichment goes up and so it goes down, down, down goes up, up, up, up, up and then maybe you reach the top here and this is your most enriched gene list okay and so that's exactly what you do is you say okay let's find the point of which the enrichment gets the largest and in GSEA literature they call this the leading edge subset this is like the best possible gene list that you can come up with and we've discovered this best possible gene list just by stepping down my rank list of genes until I find one that gives me the best enrichment and then the ES score at that point is called the ES score for that rank list this is sort of the maximum ES score okay now the question then becomes now we have this maximum ES score is this surprisingly large or not right, you can imagine that even if you have a randomly ranked list of genes there's going to be some point at which you get the maximum possible enrichment and because you're choosing the maximum possible enrichment it's still going to be positive because you get to choose where you're going to do your cutoff so even with a random list you could choose what looks like a pretty good cutoff so you've got to make sure that the maximum ES score that you get is actually surprisingly large compared to like random orderings of your genes or random permutations of your genes and so how do you do that well basically the way you do that in GSEA and also for the minimum hypergeometric p-value is that you compute this max ES score with random orderings so you take your ranked gene list and then you just randomly sort it like randomly reorder it recompute the maximum ES score randomly order it again recompute the maximum ES score and do this again and again and again and remember the p-value is the proportion of time you would see an enrichment score at least this large with random reorderings so you compute that by just counting let's say I'm going to do it 100,000 times and only 10 of those times did I get the maximum ES score at least as large well my p-value is 10 divided by 100,000 yeah so the background list is actually this list right here the whole thing is the background the background gets it's it's already included so I've used two different meanings of background sometimes I've said you have a gene list and then you have a background what I should say is that the whole list is the background so the things that appear in the gene list and the things that didn't appear in the gene list but that are in your background set so when you rank your genes you're ranking both your gene list and the background set at the same time so the background is in your gene list so in this circumstance actually I've put 401 genes so I have actually a pretty small background taking this from a genome of like 20,000 genes okay does that make sense yeah okay okay and just as a because I like to be complete for a minimum hyper geometric p-value you actually have another option you can use a multiple test correction with the ES score but we're not going to talk about that okay so like I said the way in which you compute the p-value in this circumstance is you just compute it empirically you sample randomly by randomly reordering your set computing the ES score making the distribution and then counting the number of times that you get an ES score that's as least as large as the one you got with the real sort of ordering and that's your empirical p-value so I you know 4 out of 2,000 times I got an ES score that was least as good as my real value that's the way that these systems do it there's one technical issue here that I had to be very clear on is when you compute this p-value you have to add one to the number of times you see a score at least as large you have to include your real value as one of the ones that you've seen okay otherwise you're going to get these p-values of zero it could be the case that you're going to see it zero times because your ES score is that high and as your p-value doesn't really it's not meaningful because it doesn't exist because there is at least one random ordering that gives you an ES score at least that large because you've seen it right so you have to add one to this and what does that mean? what that means is the p-value that you come up with is always going to be at least one over the number of times you've done the random reordering if you need a p-value that's one in a thousand you have to do a thousand reorderings if you need a p-value that's one in ten thousand you have to do ten thousand reorderings right and the reason I make this point is that now when we go to the enrichment correcting for multiple tests we do need very highly significant p-values okay meaning that you're going to need p-values of that order and for that reason when people use these kind of permutation based this GSEA you never use the most stringent multiple test correction you almost always use the last stringent multiple test correction I'm going to tell you about this less stringent one so don't use what's called the Bonferroni correction use what's called the false discovery rate correction okay and you'll know about that in a second okay so here's some more GSEA examples this is where you have enrichment at the top of the list so here the blue lines are the shaded line here is the ranked gene list and then the red line show up where yeah sorry does someone have a question no okay the red lines show up where the the gene set that's enriched and treated is the blue lines are the gene set that's depleted and treated so here remember so if your genes are enriched at the bottom of the list this enrichment score plot goes you go to a minimum instead of a maximum so you're looking either for the maximum or minimum ES score and this is what the enrichment score plot can look like if your gene set's not enriched okay yeah so I think what you're referring to is the number of random permutations that you do yeah you should do as many random permutations as you can so and you're not cheating by doing more random permutations right you're just you're better approximating the p-value so if you do like a thousand random permutations the lowest p-value you can get is 10 to the minus 3 1 over a thousand that's a bound on your p-value your p-value can actually be smaller than that but you're just not able to figure that out because you haven't done enough permutations if you do 10,000 permutations the smallest p-value you can get is 10 to the minus 4 it's that's a bound it doesn't mean that you're cheating by doing more permutations you're just more accurately estimating what your p-value is so so those are those are different things so the background set or like what you the other genes that you add so when you have like a thresholding to define a gene list you have a gene list and then you add in the background which is the gene set are not on the gene list you're defining that's the size of your background set right and you're saying that if you have a smaller one you're not going to get as much significant as you have a bigger one that's true but that's not related to the idea of like the random reassortment right so so what I'm talking what I was just talking about is the number of times you have to randomly order the set that doesn't that's independent of the size of the set itself yeah I mean it's not completely independent I mean sure if you have a set that's like has 50 elements in it there's only 50 factorial reordering but 50 factorial is a really big number right so because it's such a big order in number yes there are fewer ways of ordering the set if you have 50 in it then if you have 100 in it but you're never going to evaluate all 50 factorial reordering so now I see what your connection was does that make sense yeah yeah well the full change is what you used to sort it yes and then basically the rank is what the row number is in Excel yeah you can't do that because you're reducing the size of your background so you can't choose your background based on the outcome of your well you can't really choose your background based on the outcome of your experiment right so so if you just so I don't think in that case you can remove the stuff that's not expressed unless the question you're asking is up-regulated versus down-regulated and so if you're just asking up-regulated versus down-regulated you could probably remove the stuff in the middle as long as you're up-regulated and down-regulated sets approximately at the same time I think that would be okay more questions okay so why do we have to talk about multiple test corrections so the reason we have to talk about multiple test corrections is there's a great way to win the p-value lottery right you want a significant p-value if you want to get a p-value that's 10 to the minus how do you do that well you can just continue to take random draws out of this background population right and so if I have a p-value that says you know I expect 1 in every 10,000 of my random draws are going to have this much overlap with the gene set well if I do 10,000 random draws I would expect 1 on average of those to have this type of enrichment right so you can win the p-value lottery by just like randomly redrawing I mean this is obvious right but what's less obvious is that when you're testing different annotations or different gene sets you are implicitly redrawing from this distribution okay so you're cheating in a way and for example if I say that if I say that my p-value is .05 right so you know only 5% of my random draws will have an enrichment at least this large and if I do 1,000 draws if I test 1,000 different gene sets or 1,000 different annotations for a given gene list I would expect 50 of them to have a p-value of less than .05 right because my p-value is 5% so I expect 5% of the tests I do to have that enrichment at least as large and correct for that and the reason this is important is that when you're looking at annotations you could be looking at thousands of annotations right because you know you have some gene list you want to know what's going on in this gene list you can try everything and as Gary said there's a lot of databases a lot of different things you could look at and so you got to make sure that when you say well these 50 different annotations are enriched in my gene list that you don't see those 50 just because you've done 5,000 tests you see those 50 because they actually are enriched in your gene list okay and so multiple test corrections are ways of correcting for this fact okay and so there's the easiest one in the world is called the Bonferroni correction and how does the Bonferroni correction work when you take your original p-value and sometimes people call this the nominal p-value and you multiply it by the number of tests that you did which is the number of gene sets you're looking at so if your p-value is like 0.05 on the gene set and you've done a thousand tests you multiply that by 0.05 and you get 50 your p-value is now 50 which is a bit weird no it's number of tests yeah sorry the number of tests are like it's the number of gene sets you're looking at so like if you're like so if you say I'm going to see which go category is enriched in my gene list you know there are like 10,000 go categories so when you ask that question which go category is enriched in my gene list you're asking 10,000 questions and so in that case m is 10,000 so if you instead use something like go slim like Gary talked about which is like a summary of the go annotations then you're only asking 100 questions you're saying which of these 100 go slim categories is enriched in my gene list and in that case m would be 100 does that make sense so you multiply the original p-value by 100 in this case and you get the correct p-value okay so p-value has got to be between 0 and 1 people know that right so what happens when you get p-value 50 well it goes back to what I've been saying before the p-value you compute is a bound an upper bound on your actual p-value so when you say the p-value corrected p-value is 50 that's saying the corrected p-value is 50 or less which is fine not only is it 50 or less we know it's also 1 or less but that's just it's a bound and the fact that you can get these really high p-values is telling you that the Bonn-Froni correction it's extremely conservative because it doesn't want to make any assumptions about anything about how the categories might be related to one another or anything like that it's the most conservative correction and it gives you what's called the family-wise error rate okay so what does that mean so remember I told you before if you do a thousand tests and you correct and you're looking for p-value 0.05 or less there's going to be 50 of them just by a random chance on average that are going to have that p-value right because that's 50 is 5% of a thousand okay when you correct the p-value you do the Bonn-Froni corrected p-value 0.05 that says that the probability that any one test is due to random chance is less than 5% so you would expect 0 on average in a thousand to be due to random chance okay that's called the family-wise error rate that's like one or more tests so now this is a problem as I said because it makes it very conservative doesn't make any assumptions you get a lot of false negatives there are things that are going to be enriched in your list that you won't necessarily be able to detect because you've done this Bonn-Froni correction and you've taken a p-value which actually normally would be significant and made it insignificant okay and especially when you're doing these genomic analyses and you're doing this with a lot of these different gene sets and a lot of the gene sets are connected to each other in some way sometimes people are willing to accept a less stringent condition in fact not only sometimes almost always people do something different and that's called computing the false discovery rate or the q-value and the q-value makes a very different promise about the false pauses that you would expect in your enrichment tests right and what's different about the false discovery rate of the q-value is it makes a promise about the expected proportion of tests of your observed enrichments or reported enrichments that are due to random chance okay so what do I mean by that so if I if I do a thousand tests if I look at a thousand different gene sets and then I I say that 100 of them are enriched and the false discovery rate of 0.05 what that means is only 5% of the 100 on average will be false positives so I'm saying that 95% of my reported enrichments are real and 5% of them are wrong I don't know which 5% of them if they are I would just take them away if I knew which 5% of them there were but on average only 5% of them are wrong okay so that's very different than the Bonferroni correction the Bonferroni correction says if I report 100 enrichments and I say the p-value is 0.05 after I've done the Bonferroni correction I'm saying there's only a 5% chance that any one of these is wrong right and the false discovery rate I'm saying I expect on average 5% of the ones I report to be wrong so 5% of the 100 or so like 5 on average are wrong if I just use a nominal p-value of 0.05 I don't do any correction and I do 1,000 tests and I report 100 of them well actually what I expect is 50 of them are wrong on average because 50 is 5% of 1,000 it just depends on your community and so a lot of communities a lot of communities actually use 10% FDR 10% 5% is good too do you know but like I said it varies a lot from community to community okay so now I'm going to tell you how to compute the false discovery right and so I've just compared that to the Bonferroni correction now the false discovery rate there's a variety of different techniques to compute it or to estimate it the most common one by far is called the Benjamin E. Hochberg procedure which is what I'm going to tell you about and the FDR false discovery rate is often called the q-value instead of the p-value okay so how do you compute the Benjamin E. Hochberg I mean the reason I'm telling you this like I told you how to compute the Bonferroni correction and that's easy right you can do this in Excel this is also something you can do in Excel if you have to because the tools won't necessarily do this for you or they won't even necessarily do it right okay so how do you compute that well so here what I've done is these are all the gene sets that I've tested and this is a cartoon meaning that I made it up but I made it up in such a way to illustrate the procedure better so let's say these are all categories and my ghost slim and I've tested them and transcription regulation is the one that's most enriched and transcription regulation is not enriched at all translation and so I've taken all these categories I've taken their nominal p-values those are the p-values in this case that I computed say using Fisher's exact test or using one of these rank list tests and I sorted them from smallest to largest right I don't know why here it says decreasing order I would recommend right now you take this out and write increasing order because I so smallest to largest okay then you compute what's called the adjusted p-value so what's the adjusted p-value you take the nominal p-value you multiply it by the number of tests that you've done 53 and then you divide it by the rank in this sorted list so the first one you multiply it by 53 divided by 1 that's the p-value the second one you multiply it by 53 divided by 2 and that's the adjusted value the third one you multiply it by 53 divided by 3 that's the third one multiplied by 53 divided by 4 that's the fourth one and so forth so let me make a couple comments here number one this p-value up here at the top of the list is just the original p-value times 53 that's the Bonfroni correction as we go down the correction gets weaker at the bottom of the list I'm multiplying by 53 and dividing by 53 that's 1 so at the bottom of the list we're not making any correction of the p-value at all so from the top of the list to the bottom of the list the correction that we're doing to do the adjust p-value is weaker the other thing is is this adjusted p-value even though the nominal p-value goes from smallest to largest it can go up and down this is above 5, above 0.5 above 0.5, below 0.5 above 0.5 now this one's 1 and then this goes down it doesn't necessarily go up and down because the correction you're doing is getting weaker and weaker okay so now the q-value can be computed from this list what the q-value is the q-value for a given rank is the smallest p-value at that rank or above sorry I got that wrong that was stupid alright the q-value is what the hell oh I get it, I get it, never mind sorry I've been doing this for so many years today is not my day, I'm a little bit tired okay the q-value is the smallest p-value at that rank or below right the smallest adjusted p-value so you see here for this one the smallest adjusted p-value is 0.04 and then you just have to believe me that everything below here is above 0.04 this the smallest one is 0.04 you have to believe me again that everything above here is 0.04 so it's 0.04, 0.04, 0.04 and then we get to this one and this is now the smallest p-value right and then you can see here the smallest p-value at this rank or above is 0.99 got that that's the q-value now you're done you've done the Benjamin Hoshberg it's called the step down procedure and you compute the FDR and it's just described there and it's very easy to do in Excel the most difficult thing about in Excel is like figuring out how to get this like number here which is 0.456 all the way down for the rows but you can figure that out it's pretty easy and then people when they choose the FDR threshold they choose the FDR threshold based on these q-values right and so if you're going to threshold at 0.05 there's going to be a nominal p-value that's associated with that threshold and you can just look that up on your table now one comment here even though this number goes up and down the suggested p-value because this number here is the smallest p-value at that rank or below it always goes up right so you can always threshold it and be sure that you get everything that works yeah so I mean I can give you some suggestions I don't do proteomics I know what some of my colleagues do but I mean that is a really experimentally specific way of coming up with that like everybody has depending upon how you're doing the measurements in the first place how you come up with the rank depends not only on what you understand about the instrument and its measurement properties but also the kind of question that you're asking are you asking for differential regulation are you asking for up regulation versus down regulation so in proteomics sometimes people just they just count the number of peptides or they count the number of times the number of unique peptides that show up there's various different ways of doing this quantification and then I can't well you're going to have to argue with proteomics people about how to rank lists and this is sometimes why people like coming up with gene lists instead of rank lists of genes but you had an answer to that what do you mean we are using actually a classical APMASIC approach with a base and then we are using a control for example in the GFP and then we are putting our result in a website developed here in Toronto oh and close yeah yeah gave you a probability gave you a probability it's not a P-value but very well for us well I mean the P-value is really it gives you a way of measuring enrichment and like we've seen when we define the rank list things you can use it just as a score of the P-value the higher the enrichment and then we are going to use that score in various ways we can use it as a way of establishing a threshold we can use it as a way of establishing a false discovery rate so you can be pretty flexible like even though what it gives you is a P-value you can use that to rank the other nice thing about the work that Anne-Claude Gingrad does is that she also comes up with frequent flyers so she tells you about proteins that show up a lot and they are being expressed in the cell type and you can use that as a way of defining backgrounds proteomics okay questions alright folks you've done good alright so so I mean we talked about two different ways of doing the multiple test correction and if you do the form of bone fronety correction and you're testing like 10,000 gene annotation categories to get a significant P-value you're going to have to find a P-value that's 5 right and that's a pretty small P-value so how do you deal with that problem well you can do the false discovery rate and the false discovery rate is not always going to work right because maybe still the correction you need to do still depends on the number of tests that you do so it still might be too stringent the other way of approaching this problem is to be careful about what questions you ask of your data right so instead of testing all 10,000 go categories you can just test go slim so instead of asking 10,000 questions you're only asking 100 questions and then what you have to correct for is 100 tests instead of 10,000 tests so that the P-value that you need for significance becomes like 10 to the minus 3 or 10 to the minus 4 instead of the 10 to the minus 5 or 10 to the minus 6 alright so the various ways of choosing this what you can't do is you can't choose what tests you're going to do after you've seen the data right you have to choose the tests you're going to do before you've seen the data now that's obvious when I say it but there's ways of kind of implicitly doing this that are less obvious so a question that I get a lot is okay well I've got the data now I have my gene list now I just want to remove all the go categories that don't have any hits in my gene list so you can't your data to choose your tests right because basically what you're saying is the overlap between my gene list and my gene set is 0 and then the P-value with 0 overlap would be 1 because any random sample is going to have an overlap of 0 or more so you're like filtering out all the P-values that you've implicitly computed by looking for no overlap between the gene set and your gene list so you can't do that what you can do is after you define your background set right you're saying okay these are the genes that I think my gene list could have come from then you say okay well I'm going to remove all go categories that are neither in my gene list or in my background set that's fair right because you wouldn't be able to get enrichment for those anyways because you'll be choosing 0 balls from a bin that contains 0 balls that's okay so you can say okay we're going to remove gene ontology categories that don't overlap my background set or have small overlap with my background set the other thing that is good to do is once you define your background set removing categories that don't have a lot of genes in your background set and when I'm saying background set I'm cleaning the gene list and the ones that are on your gene list in your background right so if there's only genes in the entire background set that are in this gene ontology category when you have a small gene set you don't get very significant p-values right so you're not going to be able to detect enrichment of small gene sets also something that some times people do is they remove gene sets that are too large right so you know let me take a step back and go back to what I was saying before removing the small gene sets the reason this is such a valuable valuable valuable thing to do is that in the gene ontology categories Gary described it's hierarchical right so if you annotate in a low level category that annotation gets propagated up to all the upper level categories eye development is a type of development so any gene in eye development is also a gene that gets annotated in development and the way that the hierarchy works is at the lowest level there's like tens of thousands of categories that are tiny now you're not going to be able to detect enrichment in these tiny categories so if you just set a threshold and say ok if there's no more than 10 genes in my background list I'm not even going to test for enrichment in that that removes like 90 80 to 90% of the gene ontology categories just like that and it's a great thing to do to be able to get enrichment in those sets and so what gprofiler does which is going to be one of the methods that you're going to be using in the lab you can select cutoffs to say ok I'm not going to try any gene sets that have less than 10 genes in my background list for example and this is what we generally do in my lab now going back to what I was saying before sometimes categories are so broad that are non-formative you know your gene list isn't rich for development well what does that tell you it doesn't really tell you that much so sometimes people set an upper limit in the size of the gene set that they do the tests for that's I used to recommend that now I don't think you should do it because it's just there's so few of these gene sets anyways it doesn't make a big difference you might as well include them in your analysis but the reason that this is important is what Gary's going to talk about tomorrow you're doing enrichment oh what Gary's going to talk about later on is that when you do these enrichment analysis often you get like dozens of gene sets that are related to each other because of this overlap between the go categories so you get development is enriched and eye development is enriched at the same time but having all these redundancy in these these enriched categories helps you when you do something called enrichment maps which groups all these categories together and then defines different processes sort of higher level processes that are going to be enriched so my recommendation is either use something like ghost slim or remove categories that don't overlap much with your background set so I like 10 some people use 5, some people use 3 5 might be good, I like 5 I mean really you make this you should make these decisions based on what are called power calculations and you can compute the basically the most significant p-value that you would expect given that your gene set size is 5 for example and then you can quantitatively make decisions based on what the most significant p-value you could detect is right or the most significant p-value you can expect to detect given a certain level of enrichment and that's a quantitative that's a principle way to make these these decisions or you can make arbitrary choices like the number 5 this appeals to me ok so questions so we talked about statistical tests we talked about two types one just considers a gene list and that test is called fish's exact test it's the only thing you need to know about that there's one that considers a rank list and there's lots of things you need to know here you need to know how to rank your genes and then you have to choose what type of test that you use the tests are basically all the same thing in that or there's basically two types there's the minimum hypergenomic trick test and GSEA test and these are come of your own offer chaos style tests and basically they look for maximum enrichment you can think of it as choosing the most enriched gene set sorry the most enriched gene list right and then there's another type of test which is just like a t-test applied to ranks and I didn't talk about that and then once you've done your statistical testing and computed nominal p-values if you're testing more than one hypothesis or more than one gene set you have to do a multiple test correction you can do the bond foroni which controls the probability at least one false positive and you just multiply by the number of tests or you can do the FDR or Q value which controls the expected proportion of false positives and this typically uses the Benjamin Hoshbrook procedure which is what I taught you okay so these are the learning objectives hopefully we satisfied all of them certainly