 Okay, I think a few stragglers will come in after we start, but we're five minutes into it. So my name's Quaid Morris, as I mentioned earlier, and I'm going to be talking to you today about finding overrepresented gene functions. And that's the second part of the manual module. And I'm going to be talking to you primarily about the theory of how to do this, and Daniela is going to come at 2.45, sorry, at 3 o'clock, and he's going to talk to you a bit more about the practice. He's going to introduce some tools to do an over-enrichment analysis or over-representation analysis. These are interchangeable terms. I'll probably use over-enrichment analysis, but other people say over-representation analysis. Okay. And so, as Gary nicely pointed out this morning, there's many ways to generate gene lists, and because I've come from a more gene microarray background, these are the two examples, motivating examples that I chose, but of course you have a lot of other ways of generating lists of genes or lists of proteins. And so here, one way is simply by clustering. Now people are doing this a lot less, so the idea here would be is that you sit, for example, in this case you're doing some time course, gene expression data, and this is cell cycle gene expression data. And if you find a grouping of genes based on some clustering algorithm that gives you an obvious gene list to use, or you can do some sort of pull-down or some sort of over-expression study, maybe do a knock-down, maybe do a phenotypic screen. And in this case you have a way of scoring genes, either by the ratio or by some sort of measure of enrichment that you can use to define this gene list by simply thresholding it. The other sort of thing that we're going to talk about today is not thresholding the gene list at all, and simply using this score that you come up with as a way of ranking genes. And there's a lot of types of over-enrichment analysis that you can use to do on gene rankings instead of sets of genes. And then I have my own list of ways of coming up with gene lists, but as Gary went over those before, we're going to skip right over to sort of the summary of what an over-representation analysis looks like. Okay, and so before you do an over-representation analysis, obviously you need a gene list. And so here I just have a list of five yeast genes that we're going to use as a motivating example. And as I said before, you have the list, you have the scores, and then you have to have a set of gene annotations or attributes that you want to test for enrichment in your gene list. And so the tools that Danielle is going to be describing, those attributes have already been pre-collected for you, so you can just choose those by selecting them from a list. Okay, and then the question that you're asking is, are any of these gene annotations or attributes surprisingly enriched in the gene list? Right, and in order to answer that question, or the way in which you answer that question is assessing the statistical significance of enrichment of that annotation in your gene list according to some background distribution or your background set. And it's very important how you define that background set. I'm going to say a few words about that later on. And so what this means essentially is just calculating P values. Now often when you're doing this type of over-representation analysis, you're actually testing for a whole bunch of different things at once. You have a gene list, you're trying to figure out what's going on in the list, you're going to be trusting for enrichment for a whole bunch of different gene functions or a whole bunch of different gene attributes. So when calculating your P values, you have to take that into consideration and that's called a multiple testing, and you have to do a multiple testing correction. And so as I mentioned before, I'm talking about the theory part of this and Daniela is going to be talking about the practice part of this. And so what I'm going to go over really is sort of the standard ways that these P values are assessed and how you do this correction for multiple testing. Now a lot of the tools that you're going to be using already do all this sort of thing for you. But I think it's important to know where these P values come from, what exactly they mean, so that you don't misreport things. Or when you look at other people's papers, you can understand what it is that they did exactly. Okay. And so the first thing I'm going to talk about is sort of the bread and butter of this whole thing is called Fisher's Exact Test. And whenever you see an over enrichment analysis, you see a P value, that's the P value that's almost always reported. And then I'm going to talk about two different ways of testing for multiple, correcting for multiple testing. There are two different ways that people use. One's called a bone for only correction and the other one's called a sort of FDR type correction. And I'll explain what those are in a little bit more detail. And then I'm going to briefly go over some standards to just go and test for over enrichment analysis. And these tests are going to be based, like I said before, on the gene rankings or gene scores. And Daniella is going to talk about one particular one of those tests, and this is a GSEA test. But that test is actually very similar to something that I'm going to talk to you about. Okay. Just in a review, just to go over what exactly a P value is, I'm assuming that you have the background statistics that you've seen P values before. But just to be precise about what we're going to be talking about for the next hour, I'm going to tell you. Okay. So the first thing you do to calculate P value, calculate some test statistic using the data. So in this case, in the enrichment analysis, the test statistic that you're calculating is the number of things in your gene list that have this annotation, right? And then the P value is the probability or bound on seeing that value of the test statistic, or one that's more extreme, under your null hypothesis, and your null hypothesis just describes what you would expect if you were sampling that, if you were generating your gene list randomly, right? Okay. So intuitively, this isn't precisely what it is, but intuitively the P value is the probability of a false positive enrichment. So what it actually is, is the probability of the observation of the test statistic or one more extreme under the null hypothesis. But everybody thinks of it as just the false positive probability, and that's a fair way to think about it. But the reason it's defined in this kind of complicated way is that when the statisticians that came up with this, they don't want to make any assumption at all about the distribution that you're actually sampling from when you could generate this gene list, because they don't know anything about the underlying process. What they can model though is they can model the underlying process of what's going on if you are just randomly selecting gene lists, right? And so they're saying things about your probability of randomly generating this gene list based on your assumptions. Okay. I would appreciate it if you have any questions just stop me right in the middle of my talk. I'm very happy to stop and answer questions as we go through. Okay. So now I'm gonna talk about Fisher's exact test. And so to perform Fisher's exact test, you need to do two things. We have your gene list and then you have to count the number of annotations. So in this case, we have I think five yeast genes and these genes, these four are annotated in, I think with cell cycle, I define these things as. Okay, so four of the five have been annotated in cell cycle. Now to perform Fisher's exact test, we also have to define the background population. Now on the tools that you use, sometimes you have the background population defined for you and sometimes you need to define it yourself. And what I mean by defining the background population is you have to provide the gene list, the longer the gene list that these are selected from. So for example, if you did a microarray study, your background population would be all the genes that are on your microarray, including the genes that you pick. Okay. And this is actually very important and I'll give you an example where you can go wrong by not defining the right background population. Okay, and then the question that Fisher's exact test is asking is what is the probability of finding four or more black genes? So we're just gonna call these annotated genes black genes into random sample of five genes from the given background population, right? That's all null hypothesis. The null hypothesis is we put our hands into this little bin and we pull out five genes at random, right? And so the P value you get is the probability that you would observe something like this if you were just doing this randomly. Okay, and the way that you calculate it is you can just calculate so the probability of seeing a certain number of black balls out of five given this background population and then you just sum up the probabilities of four or five balls and this gives you the P value here. So it's the probability, it's the probability, okay? And that's it. That's the basis of over enrichment analysis, the calculation of this P value. Now you don't have to do this yourself. This is called a hypergeometric P value because the distribution that you're plotting here is something called a hypergeometric distribution. So you might see this called a hypergeometric test but usually people know that it's called Fisher's exact test. Okay, all right, and then this is, oh, this is an all distribution and there's a P value. So that's it. Now you've know, that's over enrichment analysis. It's pretty straightforward. So there's a couple of important details. So right now I've just talked about how to represent a test for over enrichment. If for example you wanted to test for under enrichment, just change the color of the balls, right? And so instead of asking, do I see more cell cycle genes than I would expect, you can ask the question, do I see fewer cell cycle genes than I would expect, right? And so you count the number of non cell cycle genes that you see and run Fisher's exact test on that. Okay, and then this is what I've stressed before, that you need to choose an appropriate background population. So this is a good time to give this example. So for example, early data that I worked with, people were using a microwave that just contained genes from the represented immune system, right? And so if you pull a gene list out of that microwave, even if it's a random gene list, that gene list is gonna be very highly enriched for those with immune function, right? And so if you define the background population as the entire, in this case, mouse genome, you're gonna incorrectly assign very high enrichment P values to immune function, but that's because everything that you could have possibly pulled off this away is a immune function gene. Does that make sense? Okay, and then to test for enrichment of more than one independent type of annotation, so if you want to, so instead of having red and black balls, you have balls that are squares or spheres, you just apply a Fisher's exact test separately for each type and more on this later. Okay, so Fisher's exact test is used for enrichment of gene lists and for a single type of annotation, the P-value for Fisher's exact test, we've already gone over this, and this P-value depends both on the size of your gene list and the size of the background population as well as the number of genes with the queried annotation in your gene list and those in the background population. Okay, so you need four numbers to do this calculation. Okay, so as I alluded to earlier, most of the time you're not doing some one test, you're doing a series of tests, like a thousand tests, and you have to correct for these thousands of tests when you calculate the P-values, right? And then there's two types of corrections that people do. Well, first of all, I'll tell you why you need to do this correction, and secondly, I'll tell you about controlling for the family-wise error rate, and so people call this generally a bond for only correction, but I'll tell you what that is actually correcting for, and also controlling the false discovery rate, and so there's pretty much one way that people standardly use to control the family-wise error rate. There are multiple ways for controlling the false discovery rate. I'm gonna tell you about one that's very popular, it's called the Benjamin Hoshberg procedure. Okay, so the reason you need to control, the reason you need to control for multiple testing is that there's ways in which you can win what I call the P-value lottery by simply repeating the test over and over again, right? So if you're sampling from a background population and you wanna ask what the probability is of getting random draw from that background population with four black balls and one red ball, and that probability is say like one in 10,000, well you can get a sample from that background population that has four black balls by simply sampling over and over again. So doing more and more draws from the same background population, right? And so for example, I'm getting a little confused here because my next slide is, looks like it's my first slide. For example, if you take your P-value, say you have one in 10,000 chance of seeing something, you calculate one over the P-value, so one over 10,000, well one over one in 10,000 is 10,000, so if I did 10,000 draws from this background population, I would expect on average at least one of those draws would have an observed enrichment if my P-value is one over 10,000, right? So if you just draw over and over again, your expectation goes really high, you need to correct for this fact, and then that's essentially what you're doing when you're trying out different annotations. Now the annotation you might, if you try out different annotations, you're changing the number of genes that are actually annotating the background population, but you're essentially running this computation over and over and over again, right? So the P-values you report are no longer real P-values, they're no longer things that represent your sort of, your estimate of your false positives, your probability of having at least one false positive. And so for example, if you're using gene ontology website, now I have older statistics as Gary reported this morning, it's actually 32,000 terms now, and if you're running, if you're looking for enrichment in any one of these gene ontology categories, you're gonna be running your test 25,000 times. All right, okay. So you can correct for this in one of two ways. One is controlling a family-wise error rate, and that controls the probability that any single test is the false positive. So if you're gonna report a lot of enrichments, and then you report a P-value, and you say that P-value controls the family-wise error rate, what you're saying is the probability that any one of these positive enrichments I'm reporting comes due to the background distribution. So you can imagine it's a very stringent test. The other thing that you can report is you can report the false discovery rate, and this controls the proportion of positive tests. So if you report 500 positives at a false discovery rate of 10%, then you're saying that on average, I expect 10% of these, no more than 10% of these positives to be wrong. Okay, and so who's seen the blonde from the correction before? Okay, good, all right. This is new material. So the way that you take your initial P-values and then you change them to take into consideration the fact that you've done this multiple test is you can simply multiply the P-values that you originally calculated using Fisher's exact test by the number of tests that you've done, right? And then this gives you a bound on this probability that any one of your tests is a false positive, okay? Now, remember I said this is a bound, so it means it's greater than equal to. So for example, if the best P-value you get is 1 in 100 and you've done 10,000 tests, once you do the blonde from the correction, the sort of so-called corrected P-value you get is gonna be 100, right? And then you shouldn't be scared about that. That's just because it's just a bound on the probability. So the probability is actually maybe a little bit closer to one in that case. Okay, and so that's all you do. It's very easy to do this correction. You see this correction a lot in people's papers because it's so easy to do. You just count the number of tests that you've done. The problem is it's very stringent, right? As I alluded to before. So it can wash away real enrichments. And usually when you're doing these types of functional genomic analysis, you expect a lot of enrichment in your gene list for a lot of the annotations you test, a lot of the attributes you test. And so in those cases, a lot of people are more willing to accept a very, a more permissive way of correcting these P-values. And in that case, you can correct when using what's called a false discovery rate. And as I mentioned before, what that means, that's a bound on the proportion of the things that you report that are wrong, right? Which is different than the probability that any one of them is wrong, right? So for example, if you report 500 positive examples, that would say on average 50 of them are wrong. Okay, so the procedure that people typically use to do this is what's something that's called the step up procedure, Benjamini-Hosberg step or BRFDR, right? And so it's not as simple as just multiplying the P-values by something which you have to do is you have to take all the P-values of all the tests that you've done and you rank them from highest to lowest, right? And remember, lowest P-value is what you want, right? Because that's sort of false positive probability, right? And so with each P-value associated, rank with the P-value, and then you also have to assign a level of significance that you want. So people tend to use 0.05 per P-value or 0.01, right? In this case, in false discovery rate, it's, you know, you can use those same numbers. Sometimes people use a more permissive threshold to use 10%, okay? So you take your level of significance and you start from the top of the list, which is the least significant P-value going down to the bottom of the list and then in each one of these, and then you calculate what I'm just gonna call a Q-value here, and so the Q-value is your initial level of significance, 0.05 in this case, multiplied by some number, right? So at the top of the list, that number is one, and as you go down the list, that number is the number of tests you did minus the rank plus one divided by M. So at the bottom of the list, it's the bond for only correction, right? One over the bond for only correction, but, so. So as you go down this list, calculating these Q-values, you do this test against the, you do this test at your level of significance. So you start from the top, go to the bottom, and the first time that your Q-value here, sorry. Let me start again with this slide. Okay. Here's your P-values. What's confusing about the slide to me and presenting it is you have two ways you can do this, right? You can multiply by something or you can divide your P-values by something. And so I've done both on this slide. So let me start again. Okay, so here are your P-values. These are your Q-values. And so your Q-values, in this case, are a changing level of significance that you're comparing against, right? So before when I was telling you about the bond for only correction, what I was doing is I was multiplying this P-value by something by one over this multiple right here, and I was testing it against your level of significance, 0.05, right? Now I've reversed those things around because I think it's easier to think about this way when you're doing the step down procedure, but I confused myself. So hopefully I haven't confused you completely yet. So let's forget everything that's come before and start over. Okay. I did this out later. All right. Okay, here are your P-values. You take your P-values and you sort them from highest to lowest, right? The stuff with the bomb is the most significant. And you know with the P-values, you're gonna be testing against the level of significance, and that level of significance is typically 0.05. And so say, and that's good, here's gonna be represented by alpha. So say we do wanna test against the level of significance of 0.05, right? Okay, so now what happens is, is that the level of significance that you test against depends upon where this P-value ranks in the sorted list, right? So in the highest P-value, you test it against your original level of significance, which is 0.05. At the last P-value in the list, your most significant P-value, the smallest P-value that you calculated in any of your tests, the level of significance that you test against is 0.05 times one over the number of tests you do. That's like one over a hundred, right? And this is equivalent to doing a Bonn-Frony correction, except for the Bonn-Frony correction, you'd multiply this number by one over a hundred and then test that number versus 0.05, okay? And that's where the confusion came. So but when the step down procedure, what you do is you start at the top and you go down and you stop the first time that this P-value is less than this Q-value, right? And this Q-value is representing an increasing level of significance. So it gets harder and harder to pass the Q-value, right? But once you pass that, once you are able to pass that Q-value by having the P-value being less than that, then you stop and then your P-value of 0.04 is, if you threshold 0.04 and say anything that's, any P-value of 0.04 is 0.04 or below is a significant test, then you're controlling your false discovery rate at 5%. Okay, so that makes sense. Okay, it makes sense to some people. Does everyone get what a false discovery rate is? Okay, and everyone gets that there's a step down procedure to control it called the Benjamini-Hoschberg. Okay, and everyone also gets that the false discovery rate that you calculate depends upon how many positive tests you say there are, right? Which makes sense, because the false discovery rate is the proportion of positive tests that are false positive. Does that make sense? Okay, and then the last thing that is important to get is that the false discovery rate of say 5% doesn't necessarily correspond to a P-value of 5, 0.05. The P-value threshold that corresponds to a given false discovery rate threshold depends on all the other P-values that you calculate when you do this enrichment analysis. Okay, so what people will say, which you'll see sometimes in papers, is we controlled at a false discovery rate at 10% and we used this P-value threshold to get that. And this procedure called the Benjamini-Hoschberg's procedure is the one that they use to calculate what P-value you would have to use to control the false discovery rate at a certain percent. Often people use 10%, yeah. I don't know why we just see that a lot. So it's not less than, so this means, okay. Once you find the threshold, everything below that is significant. Even the last one, yeah, yeah, yeah. If you get all the way down and nothing passes the threshold, you have no significant test. A lot, yeah. I'm sorry to say that it happens quite a lot. Yeah. Exactly. Okay, so let me repeat the question. Why don't you repeat the question first and then I'll repeat the question in class. Okay, we want to control our false discovery rate at 5%, yeah? No, no. So what I said when people control a 10% false discovery rate, that's their level of significance. That corresponds to level of significance in P-value. So if you want to control the false discovery rate of 5%, you use .05 here, okay. But what that means is if you do 10 tests, if you were to control what's called the family-wise error rate, you do the bone-phoronic correction, which means that instead of testing against .05, you test against .05 divided by 10, right? Which is equivalent to taking your P-value multiplying by 10 and then testing against .05. Now for the false discovery rate, to figure out how to control the false discovery rate at 5%, then you have to calculate these Q-values for every single one of your P-values. So you take the P-value, you rank it, and then you have a Q-value that decreases, and then if at any time this P-value is less than this decreasing Q-value, that's the threshold that you use for significance that will control your false discovery rate at 5%. There's no easy formula, you have to sort all your P-values, yeah. Michelle, there are different things, right? Your family-wise says, I, you know, family-wise tells you the probability that any one of your positives, the tests that are significant, is a false positive. So the way that I typically like to see it reported in papers is you say we control the false discovery rate at X% using this P-value threshold that we calculated using Benjamini-Hausberg step down procedure. And then the individual. So what people sometimes do when they report false discovery rate is they report the Q-value. And the Q-value in the report is slightly different than this, but what they ultimately report is they report the threshold, it's a little bit confusing, but when they, what they, there's a transformation that you can do that says, okay, I've got this P-value, I deem this test to be significant, so the P-value is below the threshold, but now what I'm gonna do is I'm gonna report the level of significance at which this test would still be significant, so the smallest level of significance. And it's a very easy way to calculate that I could tell you about after class or in the break, but I mean, that's usually what people report. And they call that the Q-value, yeah. So are you obligated to test all possible on time genes? Because that's the, yeah. It's easy to say, okay, well, I've got my top 10 on top genes that I'm really interested in, and I'm gonna test those and hopefully come up with fewer false positives that way, but of course, if you're then to be intellectually rigorous, you would then have to stop if you find nothing, right? If you find nothing, it out of significance of those 10 on top genes. So is there any way other than sheer force of will to resist the temptation to move further down and search for additional ontogenes at which significance would be achieved? So to answer your first question, I mean, one way that you can do this is ahead of time decide what is you're gonna test. Right. Right. And if you ahead of time say, I'm not gonna test the whole goal ontology, you could do that. Right. You say, I'm gonna only go test the ghost limb. This is something that Gary talked about this morning and instead of being 25,000 categories, you're looking at often less than 100 categories, if at any organism. The other thing you can do is you can do kind of a power analysis. You can say, look, I'm not gonna test any goal ontology category that has fewer than N number of annotations. And so as long as you're making these decisions before you're looking at the data and then defining, you know, then you're in good shape. Right. You can also make these decisions based on the size of your gene list, for example. Right. Whether you can look at 10, decide that you're not happy with the result and then look at 100, that's something I'm gonna have to think about a little bit and I'll, it's gonna take me about five minutes of thinking about it. So let's talk about it after the break. Sure. Yeah. Oh, one more information. That strategy would probably be very useful if you use both Roni correct me for that. Because that's so committed for anything else in that country. But with EFDR, if you have sets, because usually these sets even in which analysis are hanging with others. So if you keep adding stuff that's quite redundant to what you have already found and the proportion of two positives is about the same, I don't think there should be such a huge bar that you just play expanding the number of sets you're testing if you use the, the bit of just hardware correction. So you're saying if you expand your sets? Yeah, like instead of testing pain, you test all of them. And then you have about the same, I mean, unless you really do that, you're reading all what should come out. Right. But say you don't really know that. Even if you test a lot of them, the FDR correct you shouldn't be punitive because you have done so many tests. Because you have about the same fraction of two positives. Right. If you test a few of these, that's my own idea about it. So what Danielle is saying is, when you look at the go ontology, a lot of the categories, there's a lot of overlap among the categories. Right. And so if you just test a random subset or you test a much larger set of the categories in the go ontology, you're sort of your, those tests are all going to be basically with the same annotation over and over again. So, but by expanding the number of annotations you're testing, you're not actually going to be changing very much because in the same proportion are going to end up being significant. I don't completely agree with that, but you know, that's your experience and you know, you have a lot more experience doing that type of thing than I do. But certainly what I find is doing this kind of power calculation ahead of time is a very good thing to do. If you can say, look, I'm not going to look at any go ontology category with fewer than 30 annotations, you actually cut out a lot of the go ontology categories. I apologize. How is that a power calculation exactly? Like, does it, does, can you take that selection and actually convert that into, you know, now having excluded everything with fewer than 30 annotations, I now have a 95% power to detect an XYZ effect or it seems more like a threshold rather than a true power calculation. That's right. I mean, I don't mean it's a true power calculation. I mean, it's inspired by the idea that if you have something that has fewer than 30 annotations, you're not going to actually be able to detect any enrichment. Yeah. So the number of annotations that you can test totally depends on the size of your gene list that obviously on the power system. If you have 100 genes in your gene list, you're not going to test 30,000 annotations. This doesn't make much sense because randomly one of them will fit. Right. Yeah. So how do you calculate how many annotations should you be testing on a given set number of genes that you had in your list? I guess that's one of my questions. Well, I'm not going to talk about that. You can do that calculation if you want. Yeah. OK. So any more questions? OK, great. All right. So we need to correct the p-values for the fact that we're doing multiple tests. There's two types of corrections, the von Froning correction, which controls the family-wise error rate. And then I talked about the Benjamin Hoshberg procedure. It controls FDR. And that's the expected proportion of hits that are due to random chance. And you can control the stringency of the number. You can improve your chance to get significant hits by carefully choosing which annotation categories to test. And that's because both of these corrections depend upon the number of tests that you do. So if you have a good way of filtering tests before you look at your gene list, you should. OK. And so the last thing I'm going to talk about is statistical tests for over-enrichment analysis that are based on gene rankings or gene scores. And so I have somewhat limited time. So I'm going to skip the section that says, why can't I use the t-test? It should be pretty obvious once we get to start explaining these other tests that I'm going to talk about. So yeah. I'm actually going to talk about it. Oh, do I? OK, so I've got to 245? OK, great. Good. I'll talk about the t-test then. I'm afraid that I'm not going to do a power calculation without any notes on the board in front of you in five minutes. But if you could do that, or if you could do that, I'd be very happy to see someone do that. I don't know anyone who could do that. So why can't you use the t-test? OK, well let's just take a step back. So what I've been talking about up to now is whether or not you have a gene list. Now I talked about doing multiple test correction. You still have to do multiple test corrections if you're looking at gene rankings instead of a gene list. But sometimes you can assign values to every gene that you're interested in. Say you're looking at a degree of overexpression when you knock down a particular microRNA or whatever. So in that case, you might want to make use of the fact that you're actually able to measure some sort of quantitative or semi-quantitative value and associate with each one of your genes or proteins. And often that type of information can give you a lot more information than that type of analysis can give you a lot more information than just choosing an arbitrary threshold to call a gene list. And so what I'm going to call those things, there's going to call those scores. So I can talk about them in a very abstract way. And there's two things you can do with scores. You can use the scores to rank the genes from highest to lowest. Or you can actually look at those scores themselves and say, is there a different distribution among scores for genes that have the annotation that I'm interested in looking at and those that don't. Now everybody should know that if you're trying to compare the distributions of scores for two different types of genes, sort of the standard thing to use as a t-test. In this case, almost all the time you can't use a t-test because it makes assumptions that aren't going to be true of these gene scores. So there's actually two standard tests that have been developed and have been around for a very long time for dealing with this case. One is called, I call it the Wilcoxon-Man-Whitney test. There's actually four names for this test. One is the Man-Whitney-Wilcoxon, one is the Wilcoxon-Rank-Sum, and the other one is the Man-Whitney-U statistic. So you can imagine what happened here, right? The other thing is something called the Komogorov-Smirnov test. And this is the test for it. This is the Wilcoxon-Man-Whitney, or WMW. This is just a robust t-test, basically, what it comes down to. The Komogorov-Smirnov test tests for arbitrary differences between gene score distributions. And what Danieli is going to talk about is the GSEA test, which is another rank. Well, it's actually a score-based test that's actually very similar to Komogorov-Smirnov test. OK. So as I said before, if you have scores or values or log ratios for genes of type A that don't have the annotation and genes of type B that do have the annotation, one obvious thing to do is to compare it for differences between those two distributions. And typically, which people use is the t-statistic, which is just a corrected difference of the means between these two distributions. And you can calculate the t-statistic this way. And what you would do, basically, is you calculate the t-statistic and then you get some value. And you test that under t-distribution by calculating the cumulative area up until that point. And that gives you a p-value, right? Now, you can't use the t-statistic. Why can't you use the t-statistic? Well, the first thing is that this assumes that these distributions are approximately Gaussian. That's almost never true. Even for log ratios for microarrays, there's almost never true that you're going to see two Gaussian distributions. So if you do that, the t-statistic becomes, if you use it in the wrong types of distribution, becomes invalid. And then when it becomes invalid, that means the p-values that you're calculating are actually wrong. So you could have a false positive test that you assign a very low p-value to simply because of some property that's not correctly modeled in your assumptions. And so what we're going to do is we're going to end, essentially, a t-statistic tells you it's a test for significant difference in the means between the two distributions. And you might be asking for other differences among the two different distributions. It's going to become a bit clearer when I show you some pictures in the next slide. So this is the case that I was worried about initially. And this is where you have sort of a black. This is the distribution of scores assigned to the genes with the annotation. And this is the distribution of scores assigned to genes without the annotation. You can see one of them looks kind of like a bell. And the other one looks like a bell with sort of a stick hanging off of it. That's not really a good thing to use the t-test on. Now, another thing that you could be looking for is, say, for example, you're measuring log ratio and you expect genes with a given annotation are either much more highly expressed than the condition you're querying or are much expressed at a much lower level than the normal. And then there, what you would expect is you'd see sort of a bimodal distribution of gene score where gene score here is log ratio. Whereas the things that don't have the annotation have a nice distribution looks like that. Well, the t-test doesn't actually test for things like that. Because the mean of this bimodal distribution here is right here, and it's right over top of the mean of the distribution of the things that don't have the annotation. And you have to use a different statistical test to test for these types of differences. So let's say that, for example, you're comparing the fasted state and the overfed state to the freely feeding state in eukaryotic organism. Is that an example of where you might be looking for that kind of bimodal distribution? There will be some genes that are very under-expressed and fast and relatively freely feeding in a different set, perhaps, that would be very much overexpressed in overfeeding. Whereas your freely feeding group would be your normal distribution, they're trying to get a handle on other biological situations where you encounter that kind of pattern. But actually what I'm talking about is comparing two conditions. So let's say freely feeding versus fasting. And then you might be asking the question, are there sets of genes that both have higher expression in the fasting condition and have lower expression in the fasting condition? And so it's difficult for me to imagine genes that would have both those effects, but maybe you could find something like that. Let me think of something that's a little bit more. I'm trying to be very abstract about it. So I don't want to define it. So we're talking here about, say, log ratio. So change in gene expression as measured by log ratio between two different conditions that you could be running microarrays on. Multiple testing corrections do matter. But here we're talking about just how it, OK. We're talking about a different case. So we're talking about instead of a case of defining a gene set to look at these log ratios directly and ask questions about the log ratios. OK. Do you have a sense of what I'm talking about and things that might be expressed more highly and under expressed in sort of fasting condition? Let's say you're comparing tumor versus normal. And you want to find genes that change their expression in the tumor. And you might imagine there's some sets of genes that are over expressed and some sets of genes that are under expressed. And in some cases, the annotation that you might be looking for is associated with cancer. So say you're looking at cancer type A. We're looking at prostate cancer. We have a set of genes that have already been identified as associated with cancer and breast cancer. I want to find out if those genes, if my prostate cancer analysis is recovering genes that have already been associated with breast cancer. And so in that case, you might expect to see a vimolar distribution between those previously associated with breast and those not. But in this case, to generate the log ratio, don't you still need to do the comparison to the reference tissue? Isn't your log ratio, doesn't that one number already collapse two states into one? Because you already got eight fold over expressed in tumor compared to reference tissue. So I still don't see how that splits into two distributions. Well, there's eight fold over expressed and eight fold under expressed. So this is zero of the log ratio, right? And say you're doing log to base two ratio. This is three and this is minus three. Well, what's your normal tissue? Just have an arbitrary expression right in the middle at zero because you're setting it into reference, or are you? No, no, this is a distribution of log ratios associated with genes with the annotation that you're testing and those without the annotation you're testing. Yeah, sorry. Okay, let's get back to your question. All right, so say you don't want to define a gene set. If I looked at this data and I was defining a gene set, I define this to be gene set number one, I define this to be gene set number two, and then I might actually have a gene set that contain both these modes. All right, but say in this case you didn't want to define a gene set, you wanted to use all the data that you extracted from your microarray. You wouldn't use thresholds, but you would still have to have some tests that would tell you whether or not you're seeing differences between the score distribution, the log ratio distributions of genes with that annotation. So had been previously associated with breast cancer and genes without that annotation that had not been previously associated with breast cancer. So the fish's exact test is a test you can use for gene lists. And this is a test you can use for gene scores. Okay, questions answered? Okay, all right. And say you're doing some proteomics and you're looking at spectral counts or peptide counts, for example. In that case, both of your gene scores just peptide counts or spectra that mapped to that peptide and those scores are positive. And so, and often those scores, the distributions are start at zero and kind of go down or they're very close to zero with a long tail. So they look like plus all type things. In those cases, those things will never really look elsey unless you have a whole lot of peptide counts. So in those cases, you can also can't use the P test. Sorry, the T test. Okay, right, so the two tests are Wilcoxa-Man-Whitney, the Komagoros-Miranoth, and GSEA, which is a Komagoros-Miranoth test, basically. Okay, so let's talk about the Wilcoxa-Man-Whitney test. Okay, so Wilcoxa-Man-Whitney test that is simply, so I've explained here how to calculate it. Don't really necessarily want you to know all of this but it's all in your slides if you're mathematically oriented. But the basic idea is that the Wilcoxa-Man-Whitney test, more or less, is just a T test on the ranks of the scores. Right, and so what you do is you take all the genes in your assay, you sort them according to scores from sort of the largest score to the smallest score, and then you compare the ranks of the genes that have your annotation, which are the black genes, versus the ranks of the genes that don't have your annotation. And you run a T test on those two rankings, right? And that's valid as long as both of these sets are relatively large, right? And so when actuality, what you're doing to do this is you calculate what's called a rank sum. So here are all the black genes, you rank them. So the first black, the highest black gene has a rank of one, the second highest black gene has a rank of three, this is a rank of four, and so forth, you add up all the ranks of the black genes, in this case the ones with the annotation, that gives you a rank sum of 21, and then you calculate a Z score associated with that, which just, it's difference from what you would expect the rank sum to be under random conditions, and then you evaluate that under a T distribution, or in this case, a normal distribution because your degrees of freedom are wide enough. So it's basically a T test on ranks. And so what I described so far, it's only valid when there's no tied scores, but nonetheless, if you're calculating a Wilcox on Man with me, you're gonna use software, and all the software does is sort of tie rank correction, yeah. Exactly, I'm just telling you how to do it for one annotation, so you do the same thing that you would do for Fisher's exact test, you would calculate the Wilcox on Man with me p-value for every single one of your annotations, and then you do the multiple test correction as before. The only thing that you're changing here is you're going from Fisher's exact test to the Wilcox on Man with me. Yeah. If the genus versus the gene score. Yeah, significant. If you have gene scores, I would use a rank-based metric. Right, that's, I mean, that's what I would always use if you are able to get gene scores. Yeah. Yeah, and sometimes you don't have that. So say you do a pullback. Yeah. Yeah. Just that, let me tell you this. There's a paper that came up, threshold your rank-based in these different pressures. Your results in the enrichment analysis might change and sometimes in the science. So the problem is if you turn your rank-based into a single-based, you might do it in many different ways, depending on what you set your threshold, you might get different results. That's one of the main reasons why it's preferred, is to use the rank-based to be the best that uses all the scores and all the ranks. So this is a problem of significant results. Right, okay. So what Danielle was saying is that if you set a threat, if you have gene scores and you set a threshold to define a gene set or gene list that you're gonna start with, that what you, the enrichment that you see depends very much upon how you set that threshold. So that's another good reason. In addition to, often you'll guarantee it's more likely you'll see enrichment if you use a rank-based metric. It's also your results are more reproducible if you do that, because you're not making this arbitrary choice as our shoulder. Okay. I'm just going to compare, are you saying, here's my list of genes that are over-expressed and state-of-the-art over-expressed is significant that whatever the other... No, because you're talking about annotations, right? Not gene. So you've got your list of genes. Every gene that you have is associated with some, like say log ratio, which is positive or negative, right? And then you say, this annotation is enriched among, either, so if you're gonna do WMW, you gotta define directionality. So you can say, this annotation is enriched among highly-expressed genes or over-expressed genes at this p-value. And if you're doing it the other direction, if you're saying, this annotation is enriched in genes that decrease their expression, then you associated p-value with that. Client. Okay, so you're not measuring how, like you said, any kind of over-expressed is a certain amount of over-expression that you just expect from another real over-expression. Like, do I see it a little higher? Is there a point of, I guess, if it's expressed, you know, 105% of one of the genes that are over-expressed? You say, that's a significant thing, you know? So one thing you could do, so what this corresponds to is a test for significance of median, differences of medians between your two distributions. So if you want to actually report a difference in degree of over-expression, you can calculate, you can report the difference between your two medians, for example. Right, and then some people actually show this, there's actually a way of calculating confidence interval in your medians, which I'm not gonna talk about, but I can talk about afterwards. And then if you really want to do that, you can report those confidence intervals, right? So you would use it as often in the same way that you'd use a t-test, right? So t-test, you say there's a significance and difference in the means, you know, and you report a p-value for that, and then you can also report confidence intervals on what that mean is. So for this, you would be reporting significance and difference in medians, which makes it kind of a robust t-test. Does that answer your question? Okay, okay. Lee? Exactly. Just one. Exactly, yeah. So now there's the Komogorov-Smirnov test, right? So in the previous test, you could just throw away what the gene score was because it only depended upon the rank. You can no longer do this in the Komogorov-Smirnov test. And so what you're doing in the Komogorov-Smirnov test is you're calculating cumulative distributions, and then you're comparing the difference between those two distributions. So you're asking, is there a point at which that I see, is there a point in this sort of gene score axis where I see a lot more of the enriched genes and the genes with the given annotation versus those without that annotation? Right, and you can calculate that by measuring the difference in the two cumulative distributions at points, okay. Right, and so the formal question that you're asking is, is the length of the largest difference between these empirical distribution function, empirical cumulative distribution function statistically significant or not? Okay, so there's the slide. Now let's explain what this is about. The people know what a cumulative distribution function is. Okay, all right, good. Now I'm gonna go to the board. Can I raise this? Yeah, I'm gonna go to the wall. Okay. So say this is the histogram right here, right? Everyone knows the histogram is. And I'm gonna call it, I just call it probability density so I can draw some move lines instead of boxy ones. Okay. And so the way that you calculate a cumulative distribution function is for each point as you go along increasing gene score, you say, is that better or worse? I'm gonna sum all the mass up until that point. So what do I mean by that? So imagine these were a bunch of boxes. So there's one box, two box, sorry. There's one bar, two bars, three bars, four bars and five bars. Go to the first bar, that take, let's say that covers point two of the distribution. So I actually took the slide out that explained this because I assumed everybody knew what a cumulative distribution function was. So here we go. Okay, all right. So I'm just gonna show you one histogram right now and then I'm gonna show you how to turn that histogram into cumulative distribution function, okay? All right. So say this is your histogram, right? And then on the y-axis here, so these are just different bins of gene scores. So let's say these are log ratios. Can everybody see? Yeah, okay. All right, and then this is frequency, okay? Right, so the highest frequency is going to be, right, 100% and it's gonna be way up there. And since these bars all have to add up to 100%, otherwise this wouldn't be frequency, they're gonna be below 100%. So let's just say this is 20% and then this is 40%. Right, and we could do the addition ourselves, be happy about 20% plus 20% plus 20%, 60% plus 40% is 100%, right? So what a cumulative distribution function is, and I'm gonna put it on the same axis so it's high enough that everyone can see it, it's the sum of the frequency up until that, up until and including that point. So the first point in the cumulative distribution function is there because the sum of the frequency up until minus one is 20%. The second point in the cumulative distribution function is here because it's 20% plus 40%, so this is gonna be 60%, right? And I didn't draw on this axis high enough, right? So I'm gonna run out of room so you can imagine that I'm just gonna be shrinking it as I go up, right? So the first point in my cumulative distribution function is 20%, the second one is 60%. I'm gonna add 20% more because I get here, so now we're at 80%, and then here at the end, I'm gonna be at 100%, right? Now it's a cumulative distribution function. Okay, and so here are the histograms but they're smooth instead of boxy, but that's fine, there's nothing wrong with that. That just means we have a lot of really, really small boxes and then these are the corresponding cumulative distribution functions, right? And so this is for the red one, so you can see as you go up in gene score, there's not gonna be anything here, so this is gonna be 0, 0, 0, 0, and then when you get to this lump here, your cumulative distribution function is gonna go up really quickly because you'd be adding, adding more and more, but the amount that you're adding is gonna be increasing until you get to this point, and then the amount that you're adding is gonna slow down. Once you're beyond the lump, you basically got to 100% because there's nothing left over here. So that's what this shape looks like, right? Now this shape here where you have two, to find a modal distribution, you get to lump number one, you go up really quickly, and then here, about half of the probability mass is in this lump, the other half is in this lump, so if you add this up, it's 50% of the frequency, and this is the other 50% of the frequency, so when you get to the gene score that's right about here, which is here, you're at about 50%, and then when you get all the way to the other side, you finally reach 100%, okay? So now all that the Komagorov-Smirov test does is it asks what's the largest difference between these two cumulative distribution functions, right? And then that's just some test statistic that it assigns a probability to. So two things that are very different from each other are gonna have very different cumulative distribution functions, right? So the largest difference between those two things, those are things, this is something you can calculate with a null distribution, you can assign a p-value to, but what it's actually measuring is it's measuring whether or not these are different distributions, not whether or not one distribution lies to the right or the left of the other. So is there any questions about that? Okay. So, not only are these tests in the other way of saying it's powerful to the t-test because what that means is it actually require more data points to detect the same amount of difference. And the reason that that's the case is that the t-test is making these assumptions that are very, very, sorry. Because the t-test can make more assumptions about the types of distributions that they're measuring than these two things can. And if you make fewer assumptions, your tests are less powerful. So if you can run a t-test on your distribution, you should because you're gonna get, it's more likely that your p-value is gonna be smaller, but in most cases you can't run a t-test, so you have to use one of these two. Okay. Now they give you different answers, right? So, as I said, this is actually mostly like a robust t-test. It tells you about the significance of difference of medians. So it tells you if one of the distributions is translated, if it's larger than the other. So if in general genes with a given annotation are have a higher expression level or a lower expression level. Whereas the KS test is just gonna tell you whether or not the distribution of gene scores for these two sets of genes is different. Right, and that's a very different thing, right? Okay. And this is a very rare problem that I wouldn't worry about, but I'm kind of complete about things, is that if you have a lot of gene scores that are the same and you have a small number of observations, some people don't implement this test correctly. So if you're in that condition, so just a comment. If you use the statistical package like R or something, you can be convinced that they're gonna implement the test correctly. If you use your friend down the street, he has some web interface that he's just made, and he says there's a WMW test, there's a small chance that they didn't implement it correctly, because it's actually hard to do the implementation in these cases. Okay, so now we can go back and look. Yeah. If you're not convinced that it's Gaussian, don't use the T test. I shouldn't have said sometimes use the T test. I would never use the T test. But, because why be wrong, right? So. But, if other people have previously used the T test on the same data, use the T test, yeah. So there are ways of telling if things are Gaussian or not, which I can tell you about after class, okay. All right, and so let's go back to the three distributions that we started looking at. One is the one where you just have a long tail, so this is obviously non-Galaxy. So here you can use any of these three tests. In this case, I would only recommend either the Komagorov-Smirov test, the GSEA test, because there's no, it's not like one distribution is translated with respect to the other. They both have about the same mean, but they're certainly very different. And in this case, you could again use any one of these three tests, okay. All right, so what have we learned? Well, we learned that the T test is not valid, and most of the time, almost all the time with functional genomics data, you shouldn't use it, especially when one or both of the score distributions is not normal. I'm sorry, you know what? I'm gonna answer your question. Cases in which you can use the T tests are ones in which you have a fairly balanced annotation and you have some well-behaving functional genomics data like micro-ray log ratios, right. So in those cases, you'll tend to see more Galaxianity because you have like a, if you're querying 20,000 genes, you have 10,000 of type A and 10,000 of type B. There might be problems without liars, but a lot of the times those problems are gonna disappear and most of the distribution is gonna look Galaxian, right. The reason I said don't use a T test is a lot of the time when you're looking at gene attributes, these attributes are only true for a very small proportion of the genes, and in that case, the proportion of the genes that have that annotation, they're rarely gonna look Galaxian, right. And you need both of your distributions to look Galaxian to make a T test work nicely. You know, it's not worth it, right, yeah. I mean, I think you could probably get away with that case. Again, I would look at the distributions first before you did anything. And again, if you only have positive data, you're back in this problem where it's probably not gonna be very Galaxian. So, yeah. Okay, so this robust T test are a test for difference medians, the WMW test, and to test for overall difference between two distributions, use the KS test or GSEA test. Okay, now, just to be complete, there's a couple other things that you might see. You might see the chi-square test, and these are tests that are done on contingency tables. So when you're doing Fisher's exact test, you're doing it on what's called a two-by-two contingency table. The two-by-two, and so Fisher's test is exact for the two-by-two contingency table. The chi-square test makes an approximation, so I would only use it for larger contingency tables because there's not really an easy to calculate version of Fisher's exact test for the larger ones. And sometimes you also see people do a binomial test in cases where you would normally use a Fisher's exact test. Again, this is a way of approximating it. All right, so if you see either of those two things, essentially what they're doing is a Fisher's exact test. Yeah, but we're talking about, so for example, can anyone think of an example where an attribute would have more than two values? So for example, you could say the number of target sites for a given microRNA on the gene. You could say zero, one, or more than one. That's an attribute that has more than one value, and you might want to distinguish between those two to three different categories. Daniela? In the other case, might be if you have a property of counts, you need to put several departments, including four departments that have the very least of which you use to do the work of the department. Okay, so localization to different cellular compartments, right? Like nucleus, cytosol, and some sort of membrane, right? Or let's say ER. So those are the cases where you might have attributes that are not binary value, but micromothic value. Everything we're showing you today is gonna be binary. And usually, yes, you're right. People do have a binary situation. Any more questions? All right, so now we have a short coffee break, and then Daniela is gonna take over at three o'clock, so.