 My name's Quaid Morris. I'm an associate professor at the University of Toronto at the Donnelly Center. My expertise is computational biology. My training was in machine learning and neuroscience. And right now, I work on a number of things, but largely, I work on post-transcription regulation and RNA-binding proteins. So today, please feel free to stop me during my presentation. If you have any questions, if anything comes up, I like things to be very interactive. And that's worked well in the past when I've given this type of lecture. So you can see the creative comments thing. So what I'm going to be talking about today is finding over-represented pathways in gene lists. So the idea here is you have a gene list and you want to find out whether or not some predefined pathway or set of genes is over-represented in that gene list. There's the Donnelly Center. That's where I work. Do I have a laser pointer? Yeah. It's my office right there in case you're wondering. OK. So again, the outline is I'll give you a short introduction of what enrichment analysis is. And there, we're going to just define terms. So I'm going to be talking about gene lists and a gene set. Those are both two lists of genes. And I just want to distinguish between them. Then I'm going to just give you a quick demo of David just to motivate what we're going to be talking about for the rest of the class. And then I'm going to provide you the theory of enrichment analysis to answer a lot of the questions that you have whenever you see any of these enrichment analysis tools. Now, there's a lot of different tools for doing enrichment analysis. And so what I'm going to try to give you is an overall general theory about all the things that you're going to encounter in these various tools. The reason that we teach David is because all the enrichment analysis tools, they have different advantages and disadvantages. And still for us right now, David is the one that's surprisingly easiest to use and has all the features that we think are remain important for enrichment analysis. But going forward, David hasn't been updated since 2010. And going forward, there's going to be new and different tools that you're going to have to be using by yourself. And so the theory here is just to tell you what the statistics underlying things are and allow you to identify what the right options are to use on the hundreds of different tools that are available. And so what I'll be talking about is the workhorse of enrichment analysis is called the hypergeometric test. You sometimes will hear it called Fisher's Exact Test. And this is actually the binomial test is actually an approximation to the hypergeometric test that's useful when you have long gene lists. And I'll be talking about doing multiple test corrections. And this is very important when you're testing your gene list or your results against a whole bunch of different gene sets that come from, say, GO, or come from lists of genes that are associated with diseases. And then I'll talk briefly about other types of enrichment tests. And the one that people are using most often these days is called GSEA, Gene Set Enrichment Analysis, which is a bit of a confusing name because a whole field is sometimes called Gene Set Enrichment Analysis. But this is the tool that's from Ezra's lab at the Broad Institute. And then the lab we have, as I said, is going to be just using David because we think that's the tool that this point is easiest to teach. And it's a tool that I still use in my lab's research. So even though it hasn't been updated, I think it's still a worthy tool. OK, so what's an enrichment test? And I'm sure everybody is already familiar with this. But let's just go through it to make sure that we get all the terminology correct. And so for the last five years or six years, we've always been talking about using a microwave experiment. Now the world has changed a little bit. And people have RNA-seq instead of microarrays, or they have chip-seq to give you sort of a list of genes that might be associated with your peaks. But nonetheless, what you have is you just have a set of genes and some associated value with those genes. And I'm going to call this the data that you've generated a gene list. And then you take your gene list and you compare it against a set of databases that contain gene sets. So gene sets are things that have been predefined by somebody else. You want to test your gene list to see if it's enriched for a gene set. Now, this gene list could be, in fact, a gene list. So it could be just a set of genes that you have, sorry, a list of genes that you've come up with. Or it could be a ranked list of genes where they're ranked by their expression level or the relative expression level between the two conditions that you're testing. Maybe they're ranked by the number of peaks or the height of the peaks in a chip-seq experiment. OK. And so based on this enrichment test, you come up with an enrichment table where the rows are lists of gene sets. So these are, in this case, genes or proteins that are supposed to associate with the spindle, genes or proteins that are associated with the biological process apoptosis, and some p-value that's associated with those that is some measure of enrichment of that set in your list. OK. So just formally given a gene list, given these gene sets, and they could come from gene ontology, which I guess Gary talked to you about yesterday. They could be transcription factor, binding sites, and promoter, which Wyatt has just talked to you about this morning and yesterday. And the question that you want to ask is, any of these gene annotations are set surprisingly enraged in the gene list, right? And the details is where do these gene list come from, which I'm going to talk about, how to assess surprisingly and how to correct for repeating the tests, right? Because you'll be testing this list over and over again with different sets of genes, right? And if you do that, you're going to find something that's significant. You've got to correct for that fact. OK. So your standard two-classics design, let's say you have some case, and you have some control. You have expression levels that you've measured on these two conditions, and then you rank the genes for some differential statistic. This could be a simple ratio. This often the result is some sort of t-test, right? Or some t-test statistic. And then you have some genes that are up-regulated, some genes that are down-regulated, and you select based on the threshold, OK, I'm going to look at the genes that are up-regulated, or I'm going to look at the genes that are down-regulated, or I'm going to look at the set of genes whose regulation has changed between the case and the control. OK. And so there's various ways to do that. We're not really going to go over that, but I'm happy to answer specific questions that you might have after I talk about the lecture within the lab, because I have some expertise in these areas. The other way to do it is sort of a time course-like design. So you have an expression matrix, and you cluster so that each one of the rows of the expression matrix, the columns, correspond to genes. Each of the rows corresponds to the expression level of the gene over time. And you group these profiles over time into various different groups. And you just set of genes that are being up-regulated, set of genes that are being down-regulated, set of genes that have maintained the same regulation and then suddenly drop. And then you cluster these things into groups using algorithms like K-means or K-medioids or there's various clustering algorithms that you can use. And then these clusters give you the gene lists. And then you want to say if there's anything special about the lists that fall into these clusters. And those are kind of the standard ways that people generate gene lists. Sorry. Questions? Yeah? So how then have to do it? Yeah, so the question is what would you do if you want to assign a rank to a gene based on its membership or not membership in the cluster? And your intuition about how to do it is I think that would be the way that I would do it. I have never actually done this myself. But a way that might be good to do it is to say, OK, some clustering algorithms like K-means, they define what's called a centroid or the mean of the cluster. And then you can rank genes by how far away they are from that mean, using Euclidean distance or whatever sort of distance metric you want to use. So it would be like what you mentioned, like some measure of how close they are or how well they're clustered in that cluster. And even if you have a rank list of genes, you don't have to be able to assign a rank to every gene. So in this case, you could just rank genes that are nearby the cluster and assign the same large rank to all the genes that are kind of equally far away from the cluster. Does that make sense? OK. All right. So this is what we're going to talk about today. We have a gene list. Let's say they're up-regulated. And then we're going to have a background and we're going to have genes set database. So I'm going to spend some time talking about the background. I'm going to start talking about it now. And I'm going to talk about it throughout my lecture because this has become exceptionally important when you start working with things like RNA-seq or you start working with things like mass specs and protein assays. And with the background, because you have to define this background to measure enrichment against, in the days of microarray, this was a lot easier because the background would simply be all the genes are on the microarray because those are all the genes that you would be able to detect. But now it's not clear with the newer technologies what genes are detectable and which genes aren't detectable. So for example, when you do RNA-seq, a lot of the time, there's some genes that they're much less detectable because they have repeat sequence in them so that when you do the mapping of the RNA-seq reads, those that repeat sequence is essentially ignored in the mapping. So even if those genes are being expressed at a high level, a lot you may not get as many reads mapping to those genes because of the fact that they contain a lot of repetitive sequence. So they're simply not detectable. Let me take a step back, kind of went a little bit deeper than I wanted to go. Let me say more specifically what I'm talking about here. When you do gene-set enrichment analysis, you have to define the background which is, the best way to interpret it is this is the set of genes that you could have selected the gene list from, right? Because what's going to happen is that your null hypothesis is going to be the gene list is a random set from the background, right? Now, if your background, in the early days of microarrays, there were things called like the immunochip, which was like a microarray that just had immune-associated genes on it. And so any random subset of genes from this microarray would be enriched for immune function, right? So it shouldn't be surprising at all that the gene list that you get using this microarray is going to be enriched for immune function, right? Similarly, if you're doing RNA-seq and you're doing RNA-seq on just cells from the immune system, if you have white blood cells, you're extracting RNA from them, no matter what test you do on those cells against the background of those cells, say you're trying to stimulate the immune system in some way and you want to see what the impact of the stimulation is, you're going to get some enrichment for immune function, even if you're just randomly subsetting the genes from genes that are expressed in white blood cells, right? So it's very important to define this background appropriately, OK? Does that make sense? And if it's not clear, I'm going to try to make it clear throughout my talk, but this is a very important point these days. OK, and so here's the gene set. The gene set overlaps with both the gene list and with the background. And the question is how significant that overlap is, right? And what the enrichment test asks is if you look at the size of the overlap between the gene set and the gene list, what is the probability that you would get an overlap at least as large as that if you're just randomly sampling the genes from the background? OK, so if you just put all the genes in a big pot and you pulled out an equal size sample and you did this again and again and again thousands of times, how often would you see an overlap with the gene set that was as large as the one you see on the gene list that you're testing? OK, and that probability is called the p-value. Does that make sense? So it's kind of a permutation test? It's exactly. It's, I mean, they're all permutation tests. So I mean, essentially what's underlying everything in statistics is some sort of permutation test or some sort of null hypothesis. The reason all the various tests like hypergeometric tests and binomial tests and various things like that have names is because there's an analytical way of doing that. There's some equation that you can use to compute what would happen if you just did this an infinite number of times, right? But you know, underlying that is this basic idea of random sampling, or in this case permutation, is one way of doing the random sampling in this case. So in these programs, actually, they come up with how many times you need to do it, like valverine or 10-valverine, or depends on your number of genes. So yeah, so I mean, are you talking about GSEA or something? Yeah, so. We can talk about this. So yeah, certainly. Let's talk about it afterwards. Can you bring this up again when we talk about GSEA? Yeah, OK. OK, all right. Oh, I guess now it's a David demo. All right, OK. Here's David. I find David Bytebright typing David space functional into Google, because I can never remember the URL. OK, and so this is what it looks like something first. So I went to the Wiki. Here's the gene list. I just clicked through it. I get this list. I'm going to select the whole thing and go back to David. Make David big. OK. So when you go to David, it's what you get. OK, OK, so I'm teaching to. OK, how long is my 10 hour and 15 minutes? OK, yeah, so we're going to go through this in the lab. I'm just showing this to you to motivate your interest in the rest of the talk. All right, so here's David. I'm going to paste my list in. Oh, that didn't work. There we go, there's my list. So I don't know how much you took away from Gary's talk, but what this list is, it's a list of aphometrics probe identifiers, and I know that because it ends with N-A-T. OK, and so I paste the list in the David. Now what I got to tell David is where this list came from, how to interpret these identifiers. The nice thing about David is even if you get it wrong and you tell it the wrong type of identifier, it's going to try to figure that out for you. So these are actually identifiers from this array right here. Let's see what happens if we give it the wrong array. Snip ID, I bet you that was screwed up. OK, and then you just have to tell David what type of list it is, so this is the gene list. It's not a background, right? And I'm going to submit list, and David's going to be upset with me. Yeah, OK, all right, and so I'll just go to the conversion tool in this case. Oh, it's taking a bit longer than I would expect it to. Usually it doesn't take this long at all. So here's David, David's beautiful, right? So it's actually, so this is a conversion summary. It's not sure what type of gene ID it is, but this is its best guest, and it's right, right? So I just could tell it to convert all the genes. You can go by gene by gene and tell David what the source is. This is its guest, but I'm just going to convert it all. So the reason that's going on is because of what Gary talked about, there's all these different identifiers, and it's not always clear where the identifiers come from. So David does this automatically. If you use other tools, sometimes you have to do the identifier mapping yourself. And Gary, I hope, I didn't see his talk, sorry, but in the past he's introduced Synergizer. He is one great way to do this. You can also do it through BioMart. Woo-hoo, there we go. And so now David has converted the list. This is, I think these are internal David identifiers, and this is the gene name that David has assigned to it. So Quaid's great list. OK, whoops, did I screw it up? I screwed it up. Sorry, folks. Yes. OK. OK, David. All right. Did it really? That was really annoying. OK, so here's my list. Now here you can choose the annotations that you're going to, these are the gene sets. And the way David organizes things is it groups gene sets into groups of gene sets. You could select which groups to look at. So here's the gene ontology. And when we go through the lab, I'm going to give you a little bit more information about what all these different gene ontology identifiers mean. Use general annotations. These are disease links. So this is OMIM, which is a source of gene to disease links. It's protein interactions. I mean, David's beautiful in this way. And then, so it's actually already done the enrichment analysis. And it chooses some by default. You ask it for the chart. And this is what David gives you back. And so what these are, these are the categories. So this is the set of gene sets that this came from. This is the name of the term. Here's what's called the nominal p-value. This is the p-value that you would get using David's version of a hypergeometric test. And I'll say a few words about that in a second. And then this is a corrected p-value. And I don't know what that corrected p-value means. I've never been able to figure it out. So I actually use the, oh, I do know what that corrected p-value means. I use the Benjamini. That corresponds to what I'm going to call the false discovery rate. I don't know what David's FDR is. And I've never been able to figure it out. And it seems wrong to me. So I would always use the Benjamini or the Bonfroni correction. We'll talk about that in a second. OK. Unless somebody else knows what it is. No? OK. All right. And now you can also download this file. And that's what happens if you don't right click. Looks terrible. I actually have a slide about that. So right clicking is important. I've forgotten how to right click on the Mac. Anyways, we'll do that later. OK, good. Let's get back to the theory. Sorry, I won't even be doing this six years. I always remember what my next slide is. OK. So what we did, we copied and pasted the list in. Or we can upload a gene list file. Or we could have chosen an example gene list. That's what happened there. I chose the wrong identifier. And I had to tell David if it's a gene list or the background. If you want, you can click over to the background. And you can choose backgrounds that David has already defined. Now these are all arrays. So those of you who are still using arrays, David can be very helpful to you. Those of you who aren't using arrays, you're going to have to figure out how to define your background list. And ask me during the module or when we get to that point in the thing about various ways of defining background lists for different types of data. I have some ideas, but nobody has a perfect way of doing it yet. OK. So now David's going to ask you to check the list in the background. And essentially all that means is David's going to tell you what the name of the gene list is in the background that you use. This is important so that if you've uploaded multiple lists or you have multiple backgrounds, the data is doing the right test. And then here I just chose the functional annotation chart. And so that gave me a list of all the enriched gene sets along with their associated p-values and their multiple test corrected p-values. And that was the Benjaminian in this case. You can also group these things together using functional annotation clustering. It's worth fooling around with that. I find that very useful to use. OK. So here when we went through this screen, in this case I allowed David to use the default selections, but you can change those default selections. The red ones are the default selections. And these are categories. So these are sets of gene sets, like go biological process, or in this case disease gene associations that came from OMIM. And then once you've decided about that and you press functional annotation chart to get the chart that I showed you. OK. And so this is the gene set name. This is the category that the gene set comes from. This is the number of the genes in that gene set that have the annotation. And this is the ease p-value, which is very related to the hypergiometric p-value, which I'm going to talk to you about in a second. OK. So I used to be worried about this, but I am no longer. So I feel happy just reporting the Benjaminian p-value. I'm multiple correct to p-value. OK. And then you can download the spreadsheet. And as I showed you, when you click on it, you just get it put into your browser. But if you right-click or do whatever you're supposed to do in a Mac, and I use Macs all the time, so I do know, you can save it as a file. And this is what it looks like if you don't right-click. And this is what it looks like if you do right-click if you pull it up with an Excel. Thanks for listening to that. Yeah, so I did that. That's if you want to get the maximum number of categories out. And so the reason that I did that in the past is because if you want to do your own multiple test correction instead of relying on David's multiple test correction. But now I'm happier with David's multiple test correction. OK. Questions about David? There's going to be a lab about David. I'm here to help you with it. Veronica's here to help you with it as well when you go through the lab. But any specific questions I can answer right away. And now we're going to go into talk to you about what these tests are that David's actually doing, what the multiple test correction is, and why you need to use it. Oh, I've never encountered that. But if you can reproduce that behavior today, I would be very happy to help you with that. Yeah. Whoa. OK. So you probably will have to measure the overlap yourself. So what I mean by that is you have to take your gene list and you have to compare it against the gene set and figure out how big the overlap size is. And then I've got a link on one of the slides that takes you to like an online calculator that calculates the p-value. There's other tools. So the thing about the David paper that's linked, it actually contains a list of the other gene set enrichment tools that are available as of 2009. There's also something called Ghost Bingo on site escape. And there's Ghost Stats in R. And I don't think that they have those intrinsic problems. But I'm not sure. Sorry. Yeah. Right, so I think this is the overlap. The overlap between the gene list and the gene set. Do you have a question about another column? Yeah, so that's just for the one category. It's by category rather than by gene list. OK. Yeah. Sorry, let me get the terminology straight. It's by gene set. So a group of gene sets is a category. And so each row here corresponds to a different gene set. And so the category just tells you where the gene set came from. So in this case, these are molecular functions, gold molecular functions. That's what this means. No. No. What you report in the paper is the significant associations. You can if you want. I mean, so what people usually report if you want to. So if you report a p-value, that tells you how significant the association is. And so that tells you how unlikely it is to have happened by chance. But a lot of people also report what's called the effect size. And the effect size is how large the enrichment is. And that's given by fold enrichment over here. And so fold enrichment is like how much larger the overlap is than what you would expect by chance on average. It's not those percentages. No. Oh, I see. OK. These are kind of complicated questions that I will get back to at the end. Please remember to ask them. OK. Because what you're talking about is actually something subtly different. OK. Other questions. I'm sorry I keep pushing the questions off. I will answer questions now if I can. OK. I don't want you to scare you. OK. All right. So we're going to talk about the hypergeometric test. And then this is the same test used for Calcutta in the E-score. It's essentially just a slightly modified hypergeometric test. But it's fine. And then we're talking about multiple test corrections. There's two that you've probably seen before. And those are the major ones that everybody uses, the Bonn-Froni correction and the Benjaminia-Honshberg FDR. If you see FDR, it's usually Benjaminia-Honshberg. And then other enrichment tests. So we're going to talk about GSEA for rank list. OK. So the hypergeometric test gives you what's called a nominal p-value. And these multiple test corrections give you a corrected p-value, or a q-value, or an alcohol discovery rate. And then this gives you a nominal p-value again. So you can use this test and then correct it with this. OK. So the hypergeometric test. This is sometimes called Fisher's exact test. And I can't get a straight answer about what the relationship between these two things are. But certainly calling it a hypergeometric test is correct. OK. And so what this is is you have a background population. And we're going to represent that background population by balls. And some of those balls are going to be reds. And some of those balls can be black. And so there's going to be 5,000 genes. And let's say we're talking about a yeast genome here. And we're talking about some gene set which is represented by the black genes. There's 500 genes in there, so it's probably some important process like translation. And then through some experimental assay, we've come up with a gene list. And here's my gene list. And there's four black balls and one red ball. And my question is, well, is this surprising? Is this a surprising sample from this population? And if it is, how surprising is it? And so every time you do a statistical test, you have to have something called the null hypothesis. And the null hypothesis is essentially what you mean by random chance, right? And in this case, it just means that it's just a random sample of the same size selected from the background population. That's what I mean by random chance alone. That's my random sample. And the alternative hypothesis here is there are more black genes than expected. And the reason people define an alternative hypothesis is essentially to ask if it's what's called a one-tail test or two-tail test, right? So one-tail test is saying, there's more black genes than expected. And like a two-tail test would be, there's more or fewer black genes than I would expect. Do you guys get the difference there? OK. All right. So how do you do this? So once I told you what random chance is, you can compute how likely it is if you just take a random sample with five balls, you would get zero black balls out of five, one black ball out of five, two black balls out of five, and so forth. And you could do this by just sampling again and again and counting and then dividing by the number of samples. But there's a function called the hypergeometric function that will allow you to calculate these probabilities directly without having to do the random sampling itself. So if you look it up in Wikipedia, you'll see it's like permutations and combinations, a lot of factorials. But essentially, you plug in the numbers and it gives you all the probability. And so these are the probabilities of the different states, right? So now to define a p-value, I ask, what is the probability that I would get four black balls or more? So I sum up these two things. And that's my p-value, right? That's my nominal p-value that tells me the significance of this sample. And then the thing that I might screw up, and people often screw up when they talk about p-values, is smaller is better, right? But people will talk about larger p-values meaning more significant p-values. But more significant p-values are smaller, obviously, right? OK. And so my p-value associated with this sample for this gene set is this number right here, 4.6 times 10 to the minus 4. That's it. That's a hypergeometric test. OK. So sometimes you'll see it called Fisher's Exact Test. And again, I've heard different stories. But Fisher's Exact Test might be the two-tailed version of the hypergeometric test. So it's either do I have more black balls or fewer black balls than I would expect? Yeah. You get to choose your threshold. So no, no, I understand. So that's so standardly, people use 0.05, right? And so that's for the nominal p-value. OK. And so when you're doing, say, next generation sequencing or things like that, sometimes people use a smaller p-value threshold because they've not actually done the multiple test correction, right? So 10 to the minus 5, the multiple test correction is kind of built into that. And that's sort of a kind of a fast and dirty way of doing a bond for only correction, right? Now every once in a while, people use like a smaller nominal p-value threshold because when you're generating a whole lot of data, a very small artifact or a very small bias in the way that the data is generated can have a tiny effect size, right? So in this case, like the fold enrichment would be 0.001. But your p-value also depends on the number of genes in your gene list, right? So if you have a long gene list and a small effect size, you can get an extremely significant p-value. And sometimes that can just be artifactual because of some small bias that you didn't realize is there in your RNA-seq analysis. It's a bit more of a complex answer to your question, but I try to provide a full answer. Does that make some people have questions about what I just said? OK. OK, and this thing here is called the null distribution. Sometimes you'll see that. OK, and so in answer, Lynn, in answer, to your question, so sometimes you can't use David or say you're working on an organism that there's not a lot of support for, you'll sometimes have to measure the overlap between the gene list and the gene set yourself. And then in that case, you need to make what's called a 2 by 2 contingency table. The rows are in the gene set, not in the gene set, in the gene list, not in the gene list. So these are the ones in the background, right? And then you can input that 2 by 2 contingency table. In this case, I found an online version. I'm not sure if it still works. It worked as of two years ago. But you search in Google for Fisher's exact test, you can get that done. And you can just plug in the 2 by 2 contingency table, it'll give you a p-value. If you don't have that, if you use R, you can often do the same thing. So in the previous slides, I was telling you how the hypergeometric p-value is computed. And you could do that computation yourself using the hypergeometric function and doing the sum that you need to do. And what I'm saying here is that there are various online tools or software tools that will do that calculation for you without you having to code that up. Great. So important details. If you want to test for under enrichment of black and you want to continue using hypergeometric tests, you can test for over enrichment of red instead. So the E score used by Dave, it's attracts one from the observed overlap between the gene list and the gene set. And they do that to make sure that you never see a significant association when the overlap is only one. And that can happen if you have a very small gene set and a very small gene list. The overlap can look significant. So for example, say you're looking against the entire genome, let's say 10,000 genes. I don't know what genome has 10,000 genes, somewhere halfway between yeast and human. And then you have a gene set that there's only 10 genes in the gene set. So only 0.1% of the genes in the genome contain that gene set. Now, say, are in that gene set. Now you have a gene list that also has 10 genes. Well, it's only 1.1% of the genome is in that gene list. And if I were to randomly sample genes from the genome and 0.1% of the genes in the genome are in the gene set, there'd be like about a 1% chance in any particular gene list of size 10 that I would get one of those genes. You just sum it up, 0.1 times 10 is 1%. That looks like it's very significant, but the overlap is actually only one. And if you have a whole bunch of gene sets, you could possibly get that overlap of one quite often. So David is very worried about that. And so they subtract one from the overlap so you never get a significant association where the overlap is only one between the gene set and the gene list. I personally think you can correct for these things using multiple test corrections, but this is the way David does it. OK, and this is the point that I mentioned early on, but I'm going to come back to and come back to is you've got to choose the background population appropriately. And I gave you the example if you're looking if it is in. So let's say we're doing some sort of case control type study, and in your control sample, you're only looking immune cells. The genes that are expressed under those conditions are going to be already enriched for those that have immune function. So any random sample from that set is going to be enriched for those that have immune function. And if you don't appropriately indicate your background population, in this case is genes that are expressed in immune cells, you're going to get a whole bunch of spurious associations with immune function. It just comes from the fact that that enrichment's already there. And if you want to test enrichment for more than one independent type of annotation, let's here I said red versus baxical square, you just apply the test separately for each one of those gene sets. And then you have to correct. OK, questions about hyper geometric test? Yeah. Yeah, so I see the binomials in approximation to hyper geometric test. They will measure the same thing. Yeah. You're sampling those types of immune cells, do you remember? I mean, do you just ignore those ones, like those immune related ones that come up? I'm just anticipating problems coming up with the correct background rate. So we could talk about how to do that. So one thing to do is if you have a case control situation, all genes are expressed under either condition become your background, right? And then you ask if the genes are expressed under the case. For example, that's one way to do it. The problem with just looking at the list of enriched functions that you get and subtracting off the ones that might come from the background, I mean, say you're looking for a specific type of immune function. So you're looking at like T cell activation, right? So the question is, do I see T cell activation because I stimulated the immune system in the appropriate way? Or do I see a T cell activation just because there's already a lot of T cells in my control? So it would be better to work on defining an appropriate background set than try to remove functions that you think are likely there anyways. But certainly something good to keep in mind when you do these types of analysis. The other thing that you could try doing is you could try, let's talk about it afterwards. I want to go on record claiming strange things. Other questions? OK, so let's talk about multiple test correction. Everybody wants enrichment, right? So how do you get enrichment? Well, the best way to get enrichment is just to keep trying a whole bunch of gene sets until you find one that works, right? And so why is that a bad idea? Well, it's a bad idea that if you just keep taking random draws from the background population, you'll eventually get one that shows very significant enrichment, right? And essentially, the number of draws that you need to take on average to get an enrichment p-value equal to whatever number you're looking for is one over the number of p-values, or one over that p-value. Now, this is a problem because there's lots of gene sets, thousands, right? And often people want to compare against all the thousands of gene sets, right? So essentially, when you do this, you're taking 10,000 random samples from this distribution. But like p to the minus 4, which is 1 over 10,000, that sounds like a pretty significant p-value, right? If you didn't know that in the background what's happening is that you have all these random draws. So you have to correct for that so that you don't get what's called a type one error, spurious enrichments at a given p-value. Okay, so how do you do this? Okay, that's what I've just talked about. So the easiest way to do it is something called the bond for only correction. And what the bond for only correction is, is you can take the nominal p-value and you multiply it by the number of tests that you do and you get a corrected p-value, okay? And then you test this corrected p-value against the original significance threshold, right? Your significant stress roll is 0.05, right? And you do 10,000 tests. You take all the p-values that you get, the nominal p-values, you multiply by 10,000 and you say which ones are less than 0.05, right? When you do this operation, sometimes people get upset because the p-values become larger than one and what does that mean? It's meaningless. P-value is a bound on the probability of it's gonna be a random sample, especially when you do this bond for only correction, it's not precisely equal to the probability of random sampling except in some cases. Okay, does that make sense? Don't worry if you get a p-value greater than one. If that upsets you, you could instead take the original p-values and divide the significance threshold by the number of tests that you do and it's the same thing, right? So, yeah. Well, the p-value you're reporting is not corrected for multiple testing but the enrichments that you report are still gonna be appropriate. No, no, it's not the p-value, it's the significance threshold. So if you wanted to test against the significance threshold of 0.05, you could instead test against the significance threshold of 0.05 divided by 10,000 and take your original nominal p-values. But it's better just to correct the p-values. And then you could threshold them at one if you want. It doesn't really matter. Yeah. Exactly. Yeah. And this is called controlling for the family-wise error rate. And you don't have to make any other assumptions about the relationships between the gene tests. This is based on what's called a union bound and it's gonna work regardless of what the association between the gene sets are. Okay, yeah. Sorry, can you clarify now, genes that you're, I mean, your list are? Sorry, that's the number of gene sets that you're looking at. So if you look, yeah. If you selected, say, with the David, you can select two check marks. It would affect your M, but your M is not two in that case because when you select with David, you're selecting categories of gene sets. That you have p. Precisely. Yeah. Yeah, and we'll talk about strategies to use to avoid having to correct against 10,000 gene sets. Right, so if you're gonna test the entire GO, they're up to like what, 25,000 different terms or something by now? You don't wanna divide by, you don't wanna multiply your p-values by 25,000. Right, then for something to be significant, you need p-value of 10 to the minus five. Right. Okay, but there are strategies to avoid doing that and still not cheat and still have significant enrichments that you can rely on. Okay. Okay, so this is the problem that I've been talking about. This is very stringent, right? So it can wash away real enrichments that lead to sort of false negatives. There are categories or terms that are enriched in your gene list that you're not gonna be able to detect because you've had to do this very strong correction because you've test so many gene sets. Right? And so often one is willing to accept less stringent condition. This is called the false discovery rate and this is a gentle or a correction when there are actually real enrichments. But here I'm gonna stop and tell you about other strategies for dealing with this problem. And the basic strategy for dealing with this problem with the Bonfroni correction and testing against that and having to multiply your p-values by 25,000 is only test the gene sets that you think you're gonna be able to find an enrichment in. Right? So if you're asking a specific question of your data, only ask that specific question. Now you have to decide what gene sets you're gonna look for before you see your gene list. Right? Because once you've seen your gene list, then suddenly you could be in the back of your mind ruling out gene sets then not knowing that you're doing that. But if ahead of time you say, okay, I'm doing experimental immune cells, I'm interested in T cell activation, I'm only gonna look for enrichment for gene sets that are associated with T cell activation in some way. Right? That's one way to do it. The other way to do it is when you take go, you don't have to test against all the terms that are in the go hierarchy, all 25,000. Right? You can look at your background. You can say, okay, these are the genes that are gonna be in my background anyways. Let's find the genes, let's find the categories that are actually, sorry, the terms, the gene sets that are actually already represented in the background. Right? So let me be a bit more clear because I see some puzzled faces. So let's go back to immune system type thing. So let's take all the genes that are expressed in the immune system. You know? Those genes probably don't have to do a lot with like retinal function. Right? There's probably no photoreceptors expressed in the immune system. Right? So you can take those gene sets out because you don't need to bother to test for those because first of all, you don't care about it. And second of all, those genes aren't gonna be expressed in the immune system anyways. Right? So regardless of the gene list that you come up with, you'll never find enrichment for that. Right? Does that make sense? So you just remove gene sets that aren't gonna be expressed in your background. The last thing that you can do is you can say, I'm not interested in all gene sets. I'm only interested in gene sets that are, say, specific. And gene sets that are sufficiently large that I will be able to detect their enrichment. So what do I mean by that? Well, if the gene set is small, if it only has like three genes in it, you don't have enough of what's called statistical power to detect enrichment even if you have a very large gene list. So what I typically do when I do these types of tests is I take a threshold. I say, I don't care about any gene set that has more than 300 genes because it's not specific enough to tell me anything interesting. And I don't care about gene sets that have fewer than 10 genes or 30 genes because the gene set isn't large enough to have sufficient statistical power for me to be able to detect it with my gene list. And then you only test against gene sets of a given size. So David allows his functionality in terms of letting you look at specific levels within the GO hierarchy. And as you go down the levels of the GO hierarchy, you get smaller and smaller gene sets. So if you're asking a very unspecific question, I would, so the other thing that people do, and this gets back to the question that you asked about the pie charts, is they look at something called the GO slim. So what the GO slim is, is it's a very small set of categories that are often mutually exclusive. They're like development, right? Or metabolism, or categories that are very broad like that. And then GO slims are defined for different species. I don't know if Gary went over this in his presentation. So often there's only like 20 or 30 different categories within a GO slim. He gives you some broad information about what might be enriched. Those sets are quite large. So you have a lot of statistical power to detect enrichment in those sets. And that's what I would do if you're asking unspecific questions. So if you have an unspecific question, you can get a vague answer, yeah. So in David, there's, I'll show you later, but there's levels. And so one thing, so I don't think David has GO slim. David has what's called GO fat. And GO fat is like everything but GO slim, right? But they do have, you could do different levels. So often GO slim is the first or second level of the GO hierarchy, yeah. Well, GO all is everything. That's like GO slim plus GO fat. Yeah. I just want to ask you guys to talk about like, if you did something, if you started with a big question and you kept saying, or did I go slim? You would say pull up the paddle as it is. You have some Gs that are just there. Would you then go and say use David only looking at doing a test in all the metabolic pathways? Or is that too biased? Or is that a way to discover which specific metabolic pathways might be? Yeah, you can't really do that. I thought I could think about this and this comes up on occasion. I've thought about this myself but it doesn't feel right to me because what you're doing is using information about your gene list to determine what gene sets you're then going to test, right? And so when you're deciding what correction you should use, it's like it should be the gene sets that you could have tested depending upon the outcome of your first analysis, right? So if you're going to use ghost slim to define metabolism and then you're going to drill down in the metabolism thing, so then you'd have to correct for the gene sets and your drill down but what if your ghost slim told you that development's important? Then you'd have to consider what you drill down in the development gene set and then your correction would have to consider both the metabolism and the development gene sets because that could have been one of the outcomes, right? So I mean, I don't know a way around this, right? And then maybe someone solved this in the literature, I don't know. I mean, one way to do it is to, one possible way to do it is to take your, let's talk about offline. All right, other questions? Nope, okay. So we're going to talk about now about the false discovery rate. Okay, so the family-wise error rate, this is the thing that you control with the bond-throwning correction. This is a bound on the probability or this is the probability that one or more of the enrichments that you report as being significant are false positives or are due to random chance alone, okay? So this is just saying something about, it's saying one or more of what you say is enriched, right? What's different with the false discovery rate is it doesn't say about one or more, it says something about the proportion of the enrichments that you report that could be due to random chance alone, right? So if you report five enrichments and you're controlling the false discovery rate at 10%, that says on average, I expect no more than five of these to be due to random chance alone. But if you report 500 enrichments and you're controlling at a false discovery rate of 10%, that's like saying I expect no more than 50 on average of these to be due to random chance alone, right? That's what the false discovery rate is. Whereas with the family-wise error rate, the bond-throwning correction, if we report 50 and you say I've controlled my p-value at point one, let's say, then I expect there's a 10% probability or less that any one of these 50 enrichments is false. And it's the same if I have 500. There's a 10% probability or less that any one of these enrichments are false. So you get the difference between, this is sort of the absolute number, there's one or more for the bond-throwning and the false discovery rate is the proportion of enrichments that you report. So you report more enrichments, you're saying that there's gonna be more that might be due to random chance and we're just gonna call those false positives, right? But it doesn't say anything about any one, okay? And so the reason this difference is important is if you have more enrichments, it becomes a gentler and gentler correction because you're allowing more and more of the more and more false reports or false discoveries, okay? Does that make sense? I can explain this, we'll get back to it and if questions ask them, I don't mind to try to explain the difference later throughout the talk, but maybe if you have questions now you can articulate, this would be a good time to ask them. Okay, all right. Okay, and the vast majority of all false discovery rate corrections, there's various ways to compute them, are done using what's called the Benjamin Hoshberg procedure, right? So someone says false discovery rate, they almost always mean this. And false discovery rate thresholdings, it's often called the Q value as opposed to the P value because it's not really a P value. It's something different. Okay, and so in your notes, I combine old slides with new slides. And I didn't fix this because I didn't wanna confuse you as you're going through your notes but I'm gonna tell you right now what are the old slides and what are the new slides, okay? Sorry. Okay, so here's the Benjamin Hoshberg example. I'm gonna show you example of how to calculate the false discovery rate. And so the idea is first you take all your categories here and I've got 53 gene sets that I've tested and you calculate the nominal P value for every single one and we're gonna use let's say the hypergeometric test to do that. And then you take these P values and these gene sets and you sort them from smallest to largest, right? And then for each one you compute what the rank is, right? Okay, that's gonna be the same as old slide, old slide, new slide, okay. So I've gone three slides forward. So now I've just changed the names of the gene sets a little bit, I've changed the same gene sets names, I've changed the P values, right? So the original P values are here and then for each gene set and each P value you compute what's called an adjusted P value. And you do it in the following way. You take the P value, you multiply by the number of tests that you've done, which in this case is 53. Maybe I can do this thing. This is what this is useful for, right? Okay, so in this case is 53 because that's the number of tests that I've done over here. But then now you divide by the rank of the P value in the sorted list, right? So the adjusted, so this right up here this is a Bonn-Froni correction, but the adjusted P value, the adjustment that you make, the correction that you make gets milder as you go down the list. And when you get to the bottom of the list you're only multiplying by one because it's 53 divided by 53. Okay, so that's the adjusted P value, right? And I've chose these P values so that it's .053 for the first three. And so these don't reach any significance. But here, .04, now the adjusted P value is less than .05, okay? So now once I have the adjusted P value you can use these adjusted P values to compute the false discovery rate or the Q value that's associated with each P value, okay? And you do that by taking this list of adjusted P values for each row and you take the minimum P value that you see with that rank or lower. So here, the smallest P value you see at this rank or lower is .04, right? That's the same here, that's the same here and that's the same here. And now the smallest P value, just trust me, here is .05, so you use that. We go down all the way to the bottom of the list. The smallest P value at this rank or below is .99 so it's Q value is .99 and here is .99, okay? Well, I've just ranked them. I've ranked them by their P value. So I just sorted them by their P value. Just put it into, oh, so I mean, I have a procedure for computing the Q value and that just says the Q value associated with an adjusted P value or with a rank is this P value or one that's smaller with a lower rank. Right, but you go all the way down the list. So it's like you take all the P values from all the tests that you've done, you sort them all and then go down the list. What you're getting confused by is the fact that I have these dots here and I'm claiming that there's no P value below here that's smaller than .04. Or it's a lowest concern in the model. But it's always, so if you're in row one, you're calculating it based on everything, the adjusted P value in row one or below. If you're not compiling the FDR number two, it's everything that value over below. And so once you pass number four, right? So now you're doing number five. The .04 from the adjusted values are relevant. Great, thank you. Oh, right. Okay, and so what I've done here is that once you compute the Q value, you can decide what your threshold is. And so if you're testing an FDR, if you're correcting for an FDR less than 5%, well all of these pass the threshold. And then the P value threshold that's associated with FDR less than .05% is this one right here. And because when you assign the Q value, you look at this row or below, the Q value is always going to be at least as small as this one as you go up, okay? And that's the Benjamin-Hanshberg procedure. That's it. Okay, and so these would be the ones, these four up here would be the ones that you report as being significant at a false discovery rate of 5%. So with respect to, say, David remembers, this stuff that goes on, is the stuff we choose? So this is called the Benjamini. Yes, we select that now. Exactly. That's labeled Benjamini as the FDR Q value row. Exactly, yeah, yeah. Just wondering how you would explain, like on a list by this basis, I find that's always confusing what the FDR means. Say if you sorted that list by category just alphabetically, so then you embed your Q value in your FDR, correct it, your corrected Q value, how would you sort of describe that? I would say that is an estimate of the probability that this specific term is a false discovery, is due to random chance. Given all the tests that I've done. Does that make sense? Okay, so this is the significant threshold, it's called alpha, so the correction to this threshold or the correction you have to do the P value depends on the number of tests you do, so no matter what you do, no matter what the more tests you do, the more sensitive the test needs to be. So you can control the stringency of this correction by just reducing the number of tests, and I've gone through strategies like this. You use Go Slim, you restrict the testing to appropriate Go annotations or appropriate gene sets, you could select only the larger Go categories, or you could set a minimum threshold below which if the gene set is too small, you're not even gonna test it, right? And that minimum threshold thing actually works pretty well because as you go down the Go hierarchy, or there's more and more gene sets that have smaller and smaller numbers of genes in them. So once you choose that threshold, you're really reducing the number of tests that you end up doing. Well, I mean, really the threshold should depend upon a power calculation, but I use 10. That's what I wanted to say. So, and David, the false discovery is the Benjamin Hoshberg column. The false discovery rate that David reports, I've never seen them report with that means. So I don't trust it. And when I've done some computations on it, I can't figure out how they've derived it. So I don't use it. David also does the Bonn-Froni correction. So when you open up that options thing at the top of the chart, which I very quickly went through, but I can go through again, you can click on Bonn-Froni and get to the Bonn-Froni correction for you as well. Okay, so they do multiple test corrections separately within each category of gene sets. So if you add more categories, so what you're saying is you click off more check marks, that actually doesn't ultimately change your any of your corrective P values because it's correcting within the category, not across categories. So when it's counting gene sets, it doesn't count all the gene sets you use. It only counts the gene sets within that category. So I mean, I would be careful how you report that. You should certainly say what categories you've checked off in David. Okay, and I'm sorry because when I talked about false discovery rate, I called these things categories and they were actually gene sets. In David, a category is a set of gene sets, a group of gene sets like go ontologies. Okay, questions, we're gonna go into GSEA. No, it's the take home messages, David does give it to you. David gives you the Benjamini, David gives you the Bonn-Froni. Well, when you're reporting the analysis that you've done with David, you need to report which categories, which check marks you have listed off. Right, because the P value that you get out is only valid within the category, it's not valid across categories. It's fine, we just need to say what categories, what check marks you made. What we've talked about so far is we've talked about where the enrichment test is based on a gene list that you define based on thresholding, say a continuous value, which is the measure differential expression in this case. And what I'm gonna talk about now is if you have a rank list. So somehow you're able to rank the genes from like most differentially expressed, the least differentially expressed, most highly methylated to, least highly methylated, I guess. And so there are a number of statistical tests for this that have been around for a long time. One is called the Wilcoxon rank sum test, and that's identical to what's called a Mann-Whitney U test. And this test, this is like a T test. These are essentially ways of doing a T test, it's called a U test, but it tests for differences of medians instead of differences of means. There's also something called a Komagorov-Smirnov test, which most people call a KS test. And that's actually almost identical to what is done in the GSEA. And the reason I'm mentioning these things is because you might see them, right? People do use these types of tests. And the difference between these two, these two are exactly the same test. The difference between these two tests and this test, this test for differences of medians, this test for differences of distributions. We'll get back to that in a second. But I'm gonna talk about the GSEA test right now, which is essentially identical to a Komagorov-Smirnov test. Okay, so why would you do this? Well, the reason you would might do this is if there's no natural value for a threshold. You gotta establish a threshold in order to establish a gene list. If you don't know how to do that, well then maybe you should just look at the ranking. So you could get different results at different threshold settings. And in theory, maybe you lose some statistical power due to thresholding. But it's not clear that that's always the case. But there certainly are difficulties in choosing the threshold. Okay, and so what's the GSEA enrichment test? It's exactly the same thing as before, except you have a rank list you put in. You have gene set databases. You get an enrichment table that has the gene sets, the p-values, and GSEA does its own multiple testing correction for you, so you get an FDR. Okay, and so in four bullet points, what does GSEA do? Well, it takes your rank list, calculates an enrichment score, and then it asks whether or not this enrichment score is surprisingly large. And this is one of the situations in which you actually have to do a permutation test in order to compute the null distribution. You can't do it analytically like you can with everything else. And then you calculate the empirical p-value. This is the nominal p-value that you then correct using the false discovery rate correction. And the paper that first described this and described the software to do this is down here. So this is the citation that you would use. They got about 4,000 citations, so it's a widely used. Sorry, I didn't catch you. So you could, so it depends what you're asking, right? So often it's differential expression. Okay, so this is how it works. I keep talking about something, so I'll change this slide. Okay, so here's the rank list. Yes, I should really change this slide. Okay, so what the color is just indicates rank. And there's a whole, like this is a really long list and let's say the pixels are so close together that you can't actually see the individual genes in this. So it's like 20,000 genes in this list. And now some set of these genes are shown in the black line. So the black lines are the positions in this rank list of 20,000 genes of the genes in this gene set right here. And then the question is, is this gene set surprisingly close to the start and or the end of this list? That's the question we're asking, right? Or is there distribution in the list random, right? And so if I were to look at this, it looks like it's surprisingly close to this end of the list, right? Because there's very few genes in the bottom half of the distribution, right? And there's a lot of them are kind of pushed up against the top there, right? So how do you quantify that? And so the way you quantify that is with what they call the ES score plot. And essentially with that, this plot is, is you start at the first list and you start at the genes in this list and remember you can't actually see the lines for most of these genes. And if the gene is not in the gene set, you go down a small step. And if the gene is in the gene set, you go up a small step with a slightly larger step. It's proportional to the size of the gene list. So as you're going through the list from the top to the bottom, you go down, down, down. You hit the first gene in the gene set, you pop up, maybe go down a bit. You pop up when you hit the second gene in the gene set and down, down, down, pop up, up, up, up, up, right? And so if you have enrichment towards the top of the list, you'll start going, here's sort of the zero line. You'll start going up, right? And then at the end, the step size is chosen so that you get right back down to the bottom at the end of the list, regardless of where your gene set is, okay? And so the ES score, so this is the, the y-axis here is the ES score. I think it's called an enrichment score or something. The, for this particular gene set given against this ranking is the furthest that you get away from the line, right? Okay, so now notice if the gene set was actually enriched over here, you go down, down, down, down, down, down, down a lot until you get to the first thing in the gene set, then you come up a bit, down, down, up, down, down, up, up, up, and come back to the end, right? So the ES score that you come up with would actually be negative because that would be the furthest that you're away from the line, but in this case, it's positive. Okay, right? And sometimes this thing where this cutoff is chosen is called the leading edge. You'll see that sometimes. So you have this leading edge, which is the set of genes that are above the list for this ES score. Okay, right? And so this is the leading edge. These are, and then, but this here means this is the ES score we're gonna test the significance at. Okay, and so then how do we, so now to figure out whether or not this ES score is surprising or not, we have to ask the question, if we had a random ranking, what would be the distribution of ES scores that we would expect? And is this ES score surprisingly larger than what we would expect against a random ranking, right? And in this case, for some versions of the GSEA tests, you have to do a permutation test to actually figure out what you would get in a random ranking. So what you do is you randomly re-rank things thousands of thousands of times, calculate the ES score for your gene set, plot those on the distribution, and then you ask, how many times did you get that ES score or higher in all your random rankings, plus the one gene set that you already have, right? So what that means is in order to get a p-value of like less than like one in a thousand, you have to do at least a thousand re-rankings, at least in the standard way that they do this. Okay, and so you randomize the data, calculate the null distribution, and then calculate the p-value there. That's your nominal p-value for the gene set, and then GSEA will, and so this is called an empirical p-value because you're not computing analytically, you're actually, you're sort of randomly reordering lists a whole bunch of times and you're just counting and observing how often you see something that's got the highest higher ES score. And then GSEA, they actually do the FDR correction for you. Yeah. The score. So this is the threshold. So where the ES score is the highest gives you a point on the ranking, right? And the leading edge would be anything that's above this. Point on the ranking, yeah. Your ranking, that's not. Yeah, so it could be the absolute full change. So this could be the absolute full change is higher than one. Well, sorry, sorry, let me answer your question because you didn't ask the question that I thought you did. I just wanna, at first I thought that was up-regulated on one end and that's down-regulated on the other so let's leave it to the street. This is just a very general and vague way of representing a ranked list. So you get to decide how you're gonna rank the list. So this could be up-regulated and this could be down-regulated and so it could just be full change. But then wouldn't it be each end would be... Depends on your gene set. Yeah. The coloring is throwing. You can just ignore the color and think of this as rank one, rank 20,000. You could have it in your list like if you're basically have like, if you might see a different... If your gene set was, say, down-regulated then you would see that on the other end. Yeah, you'd see it on the other end. Yeah, you'd see it on this end. And the way that this plot would look like is it would go down into the negative and then come back up. And they could just be ranked relative to each other or can you have some genes in the middle of that continuum that actually have equivalent ranks to each other? Yeah, I think you can tie rank. Yeah. What's your choice of just how you want to rank them versus let's say you do gene expression and this is going to be full change versus the data you're looking at. Yeah, you have a choice in how you're going to rank them. You can rank them by p-value. You can rank them by the test statistic. The difference would be what question you're trying to ask, right? So if your p-value is with two-tailed tests says whether it's up or down regulated, then you might have the situation that you're talking about so that this ranking would represent something like absolute logful change, right? So genes that change the most, whether they're up or down regulated would be at the top and the genes that change the least would be at the bottom. But if you want to rank it by say the t-statistic, genes that are at the top would be the ones that are most up-regulated and genes at the bottom would be the ones that are most down-regulated. But you get to choose how you want to rank things, right? And it depends what question you want to ask. You have to choose one. If you choose one score that you're going to use to rank. If they were in the center. So if you do a KS test or you do a Wilcoxon-Man-Whitney test, these are the other tests that I mentioned to you before. It can deal with what you're doing is, what you're saying is what's called a tied rank. So it's like you have genes that are expressed at a high level, genes where there's no change, genes that are expressed at a low level, right? And everything, all the genes that are no change, you have no way of distinguishing between them. So you could do it in one of two ways, right? You could take those genes out of the ranking altogether, right? That adjusts your background a little bit. Or you can just assign them all the same rank and that's called a tied rank. And so statistical tests was essentially the Wilcoxon-Man-Whitney, makes a correction for what's called a tied rank. So it can handle that, yeah. I mean, it's like, so the blue circle at the top, this is the gene set that you're asking where a gene's in the range. So you would, as you change that gene set, right? Like let's say you take the example of having was the difference between our genes all up at one end, down at the other, things that don't change in the middle, right? If you're picking a gene set where in a pathway, that represents a pathway that's giving you either a flat one or a cluster in the middle or not, right? But then when you move to a gene set that is, say, interferon-stimulation, responsive, that would cluster at one end of the gap, right? So is that correct? Yes, right, yeah. You got it. Michelle? Well, yeah, I mean, the nice thing about this is the background is intrinsic in the list. The list provides its own background, right? So anything in the list provides the background. And what the background is, is it surprisingly at the top or surprisingly towards the bottom of the list, right? So anything, you know, so thanks for that question, Michelle, but it does, it solves that problem for you. If you had something say cluster in the middle of that, like local change, local change, it might just be uninteresting because it's not changing. What could be interesting, right? If you see something that is in the center of the list, there are tests that can detect that, right? So the Kolmangorov-Smirnov test detects it. I'm being a little bit hesitant on whether the GSEA detects it. It should be able to detect it in theory, but it has a way of balancing of giving greater score to enrichment towards the top or the bottom of the list, which is why you have to do this permutation test. If it didn't have that, it would be what's called the Kolmangorov-Smirnov test, which is test for differences in distribution. So in that test, if you see something enriched towards the middle of the list, you can actually detect that as a significant difference from them being uniformly distributed in the list. Does that make sense? I don't know the answer to that question. Veronica, do you know? So I can look into that actually and get back to you. Like in terms of the math, there's no reason that it can't. You could imagine weighted data, depends what you mean by a weight. So are you weighting membership in the gene set? Like I don't know for sure if this gene is in the gene set, or are you weighting the genes in the list and somehow saying it's more important to me that. OK, so let me look into that. I might be able to answer you later today, but I don't know the answer to that question right off the top of my head. So here's one way to think about it. If you want points to be more important, you want genes to be more important, you just duplicate them. Yeah, and then what would happen if that were to happen? Well, that would mean that if it were in the gene set, there'd be two lines here. So you get more of a jump up. So if you have a greater impact on the S score. So as long as you had the right permutation, as long as you had the right null distribution, you had the right permutations, I don't see why the test shouldn't work with that just fine. So whether or not that's actually an option that's available to you on GSEA, I'm not sure, but there's no reason the test shouldn't work. OK, questions. All right, so what did we learn today? We started talking about enrichment analysis. I told you about statistical tests. If you have a gene list, there's really only one test. It's called either a hypergeometric test or a Fisher's Exact test. I've used both of my papers and no one's ever complained. Gene rankings. If you have gene rankings, you have to make a couple of choices. And there are your choices. Are you going to use something like GSEA or Cormor-Barmes-Smirnov test? And if you make that choice, that's a test of distribution. So that's asking whether or not the gene set is different from being randomly distributed equally along the ranking. Or if there's enrichment either at the top or the bottom, or perhaps at the middle, or at the top or the bottom. The Cormor-Barmes-Smirnov test can recognize all of those things. The other gene ranking test is called the Wilcoxon rank sum and or the Mann-Whitney-U test. And the way to think about that is it's just a T test. And in fact, you can probably, except for tied ranks, you can compute it by just taking your gene scores, converting them into ranks, and running a T test on the ranks themselves. So it's very much like a T test. It just doesn't make the same distributional assumptions. It doesn't assume Gaussian distributed data. And so you can do that on gene rankings as well. But that will not detect things like gene sets that are rich at the top or the bottom of the list. Okay. And so we also talked about multiple test corrections. There I explained two corrections to you. The Bonferroni correction, which is very easy to calculate. And you can either multiply your nominal p-values by the number of tests that you've done, or divide your significance threshold by the number of tests that you've done. It's the same thing. And then there's also the FDR test, which is a little bit more forgiving than Bonferroni. And this controls the expected proportion of your data that are false positives, which I mean, you know, a type 1 error. And this typically uses the Benjamin-Hosberg procedure, and it's much less stringent.