 Okay, my name is Kuwait Morris. I'm an assistant professor at the University of Toronto. My expertise is in computational biology, primarily in machine learning statistics. And I'm going to be talking to you about analyzing gene lists, so doing enrichment analysis or overrepresentation analysis. And I like it when people stop me with questions if I'm unclear, so don't hesitate to do that. So my computer crashed last night, which is why I wasn't here this morning, as these things always seem to happen when you need the most. It wasn't a silo scape related problem, it was, I have a Mac Air, and their shelf life, 50% of their shelf life is about a year or two. And so I hit the year limit. Apparently I ate too much on my keyboard, too. But I don't think that's the problem. Oh, okay, okay, now I have this thing as well. All right, so just use what kind of motivation slide examples of the source of the gene list. You guys probably already know this because you have gene lists in your hands. Well, one way to get gene lists is you do some sort of microarray profiling studies. So here, I guess, the relevance variable is time. Each one of these rows represents some sort of microarray. It looks like a two-color array, and that's probably from yeast. And the genes are sorted by the similarity to the adjacent genes. This is just a cluster plot, and you grab particular parts of this cluster so that you have a cluster of genes. You can do a gene list this way, or another way to do this, and this is from a CoIP study with, in this case, familio. You just run some sort of differential expression analysis. In this case, this is expression or presence in a fraction of CoIP of familio versus a mock CoIP of fraction. And you can see here red indicates high expression is measured by microarray, and green indicates low expression, or black means absolute expression. You just get any consort by the difference between red and green, and this will also give you gene lists if you do some point to cut off it. And the only reason I wanted to show this plot is that one thing we're going to talk about is how to do enrichment analysis if you have a gene list, but another thing that's developing is that people say, well, choosing this top is a little bit problematic, ultimately. And so what you might want to do instead is you might just want to capture this gene score, and this gene score here is full of enrichment, and just use that gene score and be able to measure enrichment given a ranked list of genes. So we can talk about one tool today to do that as well. And so what is an over-representation analysis? That's also a gene list for a set of gene scores and a set of gene annotations. So what I mean by gene annotations are just binary variables that have been associated with all the genes in your list, meaning, for example, does this gene have a particular transcription factor in its promoter site, in its promoter region? Does this gene associate with a particular functional annotation like the gene ontology annotations that Gary just talked about? And so you can compare this gene list or these gene scores to these binary values, and you're going to ask whether or not if any of these annotations are surprisingly enriched in the gene list. And the details of this is how you assess surprisingly, and that's all statistics, and how you correct for repeating the test, because when you're going to be testing for surprise, you want to find what annotations your gene lists are a rich start. And if you do a lot of tests, you have to correct for that fact. And I'm going to tell you about the major ways that people do that type of correction. Okay, so here's my overview slide, and sort of the theory is for practice, we're going to have three labs. And the one that we're going to, it's going to be actually most important that we're going to concentrate on is the lab number two, and that's going to come at the end, right, where you're going to be analyzing your gene list for gene ontology and around enrichment, right? But I wanted to talk about a couple of these other labs, and we'll go through those as I go through the talk. So the fact that it's separated into module and lab, well, that's going to be a little bit flexible. So we have a little bit less lab time at the end, but we're going to have some labs embedded in the middle of the module. And now, as one of the things that Gary said, and I think is true, is that the tools for doing these types of analysis are still evolving, right? So what's going to be most important for you to get out of this is the concepts that I'm going to talk to you about, because these concepts are going to appear in every single one of the tools that you're ever going to see, right? And if you know the concept, it's actually pretty easy to pick up the tools. Okay, but we'll show you specific tools that you can use. And they're just a subset of the tools that are available. Okay, so I'm assuming by your presence in this room and the fact that you've got this far, you know what a p-value is and you know what the t-test R is, right? I'm just going to go quickly over to review what those things are, but I'm assuming that's where you're starting from. And then when we start from those and say, why can't we use the t-test to do this over-enrichment analysis? And what are the right tests to use? And then how do we do this correction for doing multiple testing? And people have actually come up with two separate corrections, and we'll talk about both of them. Okay, so this is probably text you've seen in a statistics book before, but in a complicated way of saying it, the p-value is abound on the probability that the so-called null hypothesis is true. And the null hypothesis is essentially saying that whatever it is that you're testing for, in this case, enrichment, there's no enrichment there, so that there's the same distribution of annotation in the genes outside your list as genes inside your list. And that's a very common null hypothesis, right? And it's calculated by talking to these statistics using your data, and you're testing the probability of observing these statistics for one more extreme, giving the sample the same size distributed according to the null hypothesis. So by defining the null hypothesis, you're able to calculate this distribution on the statistic that you're going to measure in your data, and then ultimately you're going to find out whether your statistic that you do calculate is more extreme than one that you would if the null hypothesis were true. It's complicated, but intuitively, all it is, the p-value is probably the false positive result. Some people call this type 1 error, right? But that's just a summary. So if you do an enrichment test and you get some score, it's a probability and the p-value using that statistic is a probability that the result of the enrichment test is a false positive. Okay, so here's the good old three tests. So I'm assuming this is your background distribution, right? So these are all the genes, say, in your genome, or these are all the genes that are on your microwave, and some of those genes are annotated in a particular category, we call it the black category, and some of those genes are not annotated in that category, and I call that the red category. And then how would you apply the good old t-test to this when you ask, okay, well, it's the score that I assigned to the black gene, is the mean of that score significantly different from the mean of the score assigned to the red genes? Because I wanted to put the numbers inside of the balls. I could only put numbers that were discrete values, so now we have a histogram that shows the distribution of those numbers on the black balls versus the red balls. And everyone who's seen a t-test before, know that these curves have to be valid-shaped or approximately normal, and then once you do that, you can calculate a t-statistic which takes into consideration the number of black balls, the number of red balls, and the mean of the gene score on the black balls is standard deviation on the black balls, mean on the red balls is standard deviation, and there's this kind of complicated equation which will allow you to get the t-statistic. Right, and basically what it is, is the difference between the means divided by some measure of the variation of the data, right? And the question you're asking is, what's the probability of serving the t-statistic that you measure, or one more extreme, if it means the two distributions were the same? Right, so this here is a distribution of what the t-statistic would be if given n, the number of black balls, the number of red balls, if the null distribution was true, meaning that they came from a distribution that had the same mean. Okay, now this should be review, but I'm happy to answer questions about doing a t-test. Okay, so I mean, this is actually kind of a simple over enrichment analysis, right? This is saying, if I have some annotations and I look at some score, which might be a fold enrichment, do I see higher fold enrichment set A versus higher fold enrichment set B? Unfortunately, this test isn't appropriate for most of the types of enrichment analysis that you need to do. Okay, but be comforted by the fact that you almost already know how to do enrichment analysis. Okay, so now we're going to talk about the bread and butter of the over enrichment analysis. What people use is called Fisher's Exact Test, and it's sometimes called the hypergeometric test, or the hypergeometric p-value. Okay, and so here the idea is you do away with the gene scores, you have a gene list, the set of genes that you've decided are significantly enriched under the conditions that are regulated differently under the conditions that you're measuring, right? And then you're looking at overlap. So say this is your gene list, and you say, and the question is, what is the probability that this selection that I took from the background distribution population is one that I would see by random? And the way of asking that formally is what the probability of finding four more black genes in a random sample of five genes. And you can actually calculate that probability using something called the, well, using a hypergeometric distribution. You can calculate if I were sampling randomly how many of the sets that I came up with would have four black balls and one red ball, right? And then to turn that into p-value, you say, okay, so here's the probability that if I were sampling randomly out, I'd get zero black balls, because there's not many black genes here. Here's the probability that one black ball, two black balls, three, four, five. And when you calculate this p-value, you just say, okay, what's the probability that I got four more black balls? And then that p-value is just some of these probabilities. And I guess in this case, it's 4.6 times 10 to the minus 4. But that's what's going on ultimately when you're doing this enrichment analysis. And this is called the Fisher's Exact Test. And all I want you to understand is that's what's under the hood. Okay, any questions about that? This is the first time. Who's seen this before? Is this too fast? No? Okay, all right. So that's almost everything that you need to know about gene enrichment analysis. The rest are sort of details that these are things that sort of developed from this initial point of Fisher's Exact Test. And Fisher's Exact Test is related to the chi-square distribution. And if you want to know how that is, I'll tell you about it after. Okay. So here's some random important details. So, so far, when we tested for this, higher enrichment of black, so more black balls than we would expect. If you want to test for more red balls than you expect, which would be an underrichment of black, you test for enrichment of red. So now here's, I think the most important thing on the slide is you have to choose the background population appropriately. Right? So what I showed you before is there's just a big bucket that's got black and red balls. But when you're doing this exact test, you have to know what you're back, what the possible, what that background population was. So for example, if you have a microarray and you're looking for differential expression under some conditions and that microarray only queries 500 genes, well, your background population is at those 500 genes. Right? So your background population is things that could have been measured, that could have been on your gene list that weren't. And when you're controlling for the background population is actually really important. So I've done work on like immune arrays that were targeted towards genes that were important in the immune system. And of course, if you don't do the background population, any gene list that you choose from this immune array is enriched for genes with immune function. Right? You don't want to publish that and get someone to look carefully at your statistics and realize you made a mistake. I didn't make a mistake, but I could have. Okay. And then if you want to test for the enrichment of more than one independent type of mutation, so instead of black versus red, you ask, you know, is this ball circular or more square? I guess we'll make it not a ball. You just apply this exact test separately for each type. Right? And that's kind of obvious. Okay, so what have we learned? And this is what we've learned. Why don't I skip this slide because we're all pretty smart. Okay, so just to get you warmed up, I'm going to show you something. This is one of the first interfaces that will allow you to do versus exact tests. Now, there's some... And here's the URL that goes to that website. So I'll just give you a chance to type in that URL and you probably have in your slides. So this thing was published about seven years ago. Okay, you actually have this in your book too, so go in the oven and get this thing out. Okay. Okay, what this is called is FunSpec because it's fun. And FunSpec has actually been updated more recently than my slides were updated. So it was updated in September 2008. So one of the problems with FunSpec used to be that it was never updated, but now it's updated. So... And this is the easiest possible interface. So if you are a yeast biologist, you're in a lot of luck, right? Because you just put in your gene list here and there's only two different ways you can name each gene more or less. You can use the common name, which is, as Gary pointed out, it's called the gene symbol, or you can use the systematic name, which just depends on chromosomal location. The systematic names all start with Y and end with W or C, indicating the Watson or Crick strength. Okay. And then, hopefully, they've actually already provided you with the gene list, but we can copy and paste another gene list in and we're going to do that. And then you get to also click off all these annotation sources. So they've hopefully collected all together all the annotation sources. So here's the gene ontology biological process, cellular component molecular function. These... Yeah, and these annotation sources, this is MIPS. So these were... This was a precursor to gene ontology for gene annotation, but they're still going. Smart domains, epiphan domains, these are conserved domain... protein domain sequences within the gene. There's just been analysis there. And then there's other ways of annotating the genes. And there's something down here that says bone fronic correction, and we're going to talk about that later. You can say yes or no if you want. I would put no. And then you could just put your p-value cut off. And once you submit the query, so let's just... Let me just show you how it is if it's not cached. I'm going to take like two of these genes out, one of the genes out. Okay, so now that's not going to be cached, right? But when you submit the query, it's pretty fast. Right? So it's been... It's tested in that time in Richmond in probably about a thousand different categories of annotation. So someone asked me about whether or not you should use a bimonomial approximation to this Fisher's exact test. This is what you used to have to do before computers became fast. But now that they're pretty fast, you don't have to do that approximation anymore. And people calculate exactly. Okay, so let's just start at the top. I've done that. We put in the gene list. And as the background population, what FunSpec assumes is it assumes that our used genome is the background population. That's probably... That's another limitation of FunSpec. You can't put a different background in. But there you go. And let's go down. So this is MIP. So if we go down in the Go categories, they're going to... fairly... Oh, there we go. Looks like it hasn't got through the Go categories yet. Okay, so let's look at the MIP functional classifications instead. So here's just the category name. Right? Regardation. This is the p-value that's been associated with it. And you see that there's a less than sign here. That's something you'll see occasionally. What that means is the computer doesn't have the accuracy that it needs to calculate p-value smaller than that. These are the genes that were in the cluster. And these are... F is the cluster size. And K was the number of genes that overlap between the gene lists that you put in the cluster. And you can see it reports a bunch of different categories. And one of the problems here, which we'll talk about in a second, is that RNA processing, RNA degradation is actually a type of ribosomal RNA processing in some ways, right? Part of the ribosomal RNA processing is getting rid of the stuff that you don't need. And you can see there's a lot of overlap between these two categories. Okay. Oh, K is the overlap between the gene set that you put in and the members of that category. Yeah. And F is the size of the category. These are parameters that you have to put into the hypergeometric p-value calculation. And what is the F? F is the size of that category. So they're... In the MIPS functional classifications, there are 52 yeast genes that have been assigned the annotation at RNA degradation, meaning that they're involved in RNA degradation. And 11 were in the list? Yeah, 11 were in the list. What's the p-value? How's the p-value calculated? The p-value is calculated by saying if I were selecting randomly from the yeast genome and I selected sets of size equal to the original size of the gene list. And I can't remember how... Oh, here we go. The original size of the gene list was 25. So in random subsets of the yeast genome that contain 25 genes, what proportion of those random subsets have 11 or more genes from this category RNA degradation? You can see that it's actually a huge enrichment, right? So 11 divided by 25 is... That's about 40%. So 40% of the genes in this list are annotated as being involved in RNA degradation, but only 1% of the genes in the yeast genome are less than 1% of the genes in the yeast genome are involved in RNA degradation. So it's like a 40-fold enrichment in the list. It's an even bigger enrichment for ribosomal RNA processing. So all 25 of the genes in the original list are RNA... ribosomal RNA processing genes. Some of these software also give a Z-score into that. Is that important? I'm not sure what software you're talking about. So one thing... One way that you... So... Just address Z-score. You want to know what a Z-score is? Yeah. This is important. It's a way of measuring enrichment. Sorry. Wyatt, do you have an answer to that? You just come up. Oh, I see. Okay. Okay. So a Z-score is only for the yeast? Yeah, it's only for yeast. But this was just to get you warmed up for... If you're a yeast biologist, your day's done. Unfortunately, most people aren't yeast biologists. Yeast is so easy. I don't know why everyone just doesn't work on it. Okay. But there you go. Okay. And then while we were waiting, it actually found the time to compute go enrichment. And you can see the go biological processes are very similar. Right? So here's ribosomal RNA processing, which was also MIPS category. All 25 genes in the list are involved. But go has like twice as many genes involved in ribosomal processing. It's because go is a bit more advanced at MIPS. Okay. So that's fun, Speck. You can have fun with that. So if you... Does anyone want to see me paste the gene list into the box? Because I can do that. If you do that, you get different results. So one important thing that I want to show you, actually with this, so let's just look at this. So there's six different categories here that are down in the bridge. And you can see that these p-values are kind of high. So our cutoff is 0.01 and this p-value is pretty close to cutoff. Right? So one thing that really determines significance is the size of your gene list. Right? So if you have a larger gene list, as long as that gene list has the same proportion of genes that are, say, involved in RNA processing, your p-values even get much larger. And actually the p-values are very exponentially with this number of genes in the gene list. So if I cut the number of genes in that gene list in half, I'm not changing the number of genes that are involved in RNA processing, but suddenly the categories that were significantly enriched goes from six down to two. Right? And that's just the size of the gene list. And this is something that happens a lot in genomics is people report these massive p-values or very small p-values, primarily because they have very large gene lists. Right? But certainly the size of your gene list is what's going to give you significance or not significance, as long as the proportions don't change very much. Okay. Any questions about... How long do you think the gene list is? How long do you think the gene list is? In your gene list, right? Yeah. In the East case. Well, it's... Is it more than 100 genes? Yeah, that's it, right? Yeah, about 100 genes. But again, it's the proportion of your gene list more than anything else, right? So if you take a gene list that's enriched in some category and you add 50 random genes, your p-values aren't going to go... Your p-values aren't going to get better. They're going to get worse. Right? So you want to sort of trade off those two things. But if you have a gene list and when you have three genes on it, even if that, those three genes all have the same process, it's not going to be significantly rich for anything. Right? But the best thing is actually to use one of these methods that does gene ranking. Right? Because then that considers your whole gene list and you don't have to choose this arbitrary cutoff. Okay. So back to the talk. Okay. So it leads in nicely to enrichment analysis with gene ranking. Right? So it's just an overview of what I'm going to talk about. There's actually... Oh my goodness, what happened to slide? So there's only three things that you need to know about for enrichment analysis with gene ranking. So the first thing is why you can't use a t-test. I can talk to you about that. But there's three tests that people use. One of them is called the Wilcoxon-Man-Whitney test or some people call it the Man-Whitney test or the Man-Whitney-Wilcoxon test or the Man-Whitney-U test or the Wilcoxon-Rank-Sum test. But if you see any combination of the words Wilcoxon-Man and Whitney, that's the test that they're using. I'm also going to talk to you about the Komagorov-Smirnov test. People like to call it the KS test because it's easier to say. And then I'm going to talk to you about GSEA, which is another gene ranking style test. But it's primarily software driven rather than statistically driven. And we're going to show that. I'm going to see that in a moment. So the Wilcoxon-Man-Whitney test I'm going to explain is kind of a complicated looking test. But the thing that you really just need to remember is the Wilcoxon-Man-Whitney test is a t-test on ranks. So if you take all of your gene scores and instead of the gene score you replace the score with rank and then do a t-test, the answer you get is something very similar to the Wilcoxon-Man-Whitney test as long as you have a large enough positive and negative set. And the Komagorov-Smirnov test so the Wilcoxon-Man-Whitney test because it's basically a t-test on ranks carries with it all the problems that you have when comparing distributions with t-tests. You can only really ask the question is the set of genes that I've annotated are they larger or general smaller than the set of genes that don't have the annotation? But if you're asking a more complicated question like are the genes that have this annotation do they tend to be expressed more highly or more lowly or do they tend to be expressed around the zero distribution? Unless you do something fancy you can't really ask that question with the Wilcoxon-Man-Whitney test. But there's another test that will give you the answer to that question that's a k-ass test. The k-ass test asks in general what are the distributions between the gene score distributions? Or sorry, are there differences between the gene score distributions? Okay. And then GSEA is a way they call themselves a Mobarov-Smirnov test but actually what they're doing is doing something similar to the Wilcoxon-Man-Whitney. And we'll see that in a second. But anyways, those are the three tests that you'll see. I don't know any other tests that people like probably, but Okay. So everybody remembers the T-test, right? So here's the distribution of scores the anti-gene distribution scores the unantited genes. These are histograms instead of smooth distributions because I couldn't fit numbers with more than one digit on the ball. But in general you might have smoother distributions especially if you have a lot of genes. Okay. And then the p-value is just the stated area, right? Is the statistic that I calculated more or less extreme than I would expect? Okay. So why can't we just use that test? Well, there's this assumption that's implicit in the T-test in that these distributions are going to be, the two distributions are ball shaped. And there's lots of situations in which these distributions aren't going to be ball shaped. So you want something that's a little robust to those sort of changes. The other problem is as I said before test for differences in the means of the two distributions but doesn't test for arbitrary differences between the distributions. Okay. And so here's one instance with the T-test probably won't give you the right result. So say you have a few outliers and I've shown so here's the black distribution and you can see most of the mass of that distribution lies to the left of the red distribution, right? But there's this awful tail that goes way out here. Maybe you have a few outliers like someone didn't take off the bad spots in your array or something weird with some of your genes or for whatever reason you have this kind of long distribution here. You want to test for differences between those distributions but because the T-test is based on means and it's not robust to these sort of outliers your means might actually end up being pretty close to each other. So you're not going to, the T-test isn't going to report a distance difference. Now this is the test where so in this case your annotated, your set of annotated scores are more extreme than your unedited ones. And there's different ways of dealing with this. You could take the absolute value of full change but really in sort of arbitrary cases you can see distributions like that and you want to be able to measure differences to say something different is going on here and then it's going on here. That makes sense so far. Okay, and here's another one where the T-test is not really that bad. It doesn't really work that well. And so these are things where the gene scores are all positive, right? So let's say you have like sequencing data where you have accounts or you don't have full enrichment or tamed in full enrichment. So now you have two strictly positive distributions that have their mode at zero but one of them has got a lot longer tail. So this is sort of biased towards this rather than that. That was transformations you could do but maybe it's just better just use a test that isn't sensitive to these sort of things. Okay, and so I'm just going to talk for the next little while about the WMW test and the KS test but for a GSEA anything where the WMW test is appropriate GSEA is also appropriate. Okay. Oh, the slide used to have okay, the slide used to have the answers but I took those off. Okay, so here you can use either the KS test and the WMW test, right? Here you can only use the KS test because the WMW test is basically asking if one distribution is the right or the left but the other, right? It's asking if there's a difference in means. You can't use that here and then here you can use KS WMW. Okay, so what is this famous WMW man would need you, Wilcox man would need Wilcox home rank some test. And then the question is as I said already are the meanings of these two distributions significantly different? And it's really easy to calculate this stuff. So you take your gene scores oh, you can't see that, can you? And you sort them in this case from highest to lowest, right? And then you translate once you've done that sort you can translate the gene score into rank and then all your computations are done on the ranks, the gene ranks. Okay. And so here you have got the rank sum and then there's a complicated equation that gives you your statistic. And if you're doing the man-witney-u test this statistic is called the u-statistic. If you're doing the Wilcox rank sum test this statistic is called the rank sum statistic but really they're just related by this like linear combination this linear function so that's why people call it the Wilcox man-witney test. And as I said before here's kind of a complicated way of doing it but really where it ultimately ends up being is you calculate the ranks and then you do a t-test on the ranks and then you get a score and negative sets are sufficiently large. So all this stuff is I mean I'm showing you all this stuff because I want you to sort of understand what's going on under the hood but probably no one's ever going to ask you to do this. Okay. So yeah so for the described method I told you this t-test on ranks is only appropriate when the black and red sets are sufficiently large so you don't have to do any additional tides scores. But most of this WMW software so Fuse R has what's called a time rank correction and does the right thing when the sets are small enough. Okay. And it's robust to a few outliers. Okay. So that's the WMW test. Any questions about that? Alright. Now I'm going to tell you what the Komagorov-Sneermannot test. Alright. So here's our probability distribution. Or you separate the histogram into one with the annotation which I call black and one without the annotation. You take this probability density and you make what's called a cumulative distribution from that. And so here I mean the idea simply is this is that any pointed gene score on this plot is the sum of black is the proportion of black genes that have that score or less. So here the proportion of black genes of that score or less is like 0.1. So we put 0.1 there. Right. So now the proportion of ones with that black score or less is a bit higher because we're going up more so now we're say at 0.2 even though the gene score hasn't improved that much. Anyways you continue through this thing and the cumulative distributions of these probability density functions looks something like this. Right. The KS test asks is it asked what the largest difference is between these two cumulative distributions? Right. And so you can see that's going to work for sort of arbitrary distributions. And then that statistic is this that largest distance difference is the statistic that the KS test calculates and then it's got some if the null distribution is oh goodness. Okay if the null distribution is true that statistic has a certain distribution you can assign it a p-value and that's your KS test. Okay. So this is why you don't use these two tests for everything. Neither test is as sensitive to the t-test. What that means is if there are real differences and the t-test is the appropriate test to use these tests will are less likely to detect those real differences in the t-test. So they're going to have more false negatives. Right. They can also give you different answers. Because WMW detects difference of medians KS detects difference of distributions but in that case you probably want the answer to the KS test. Okay. Here's the slide that I thought the other one was. So if you have something like this WMW or KS is appropriate. Here KS is appropriate. Here WMW is appropriate. Most of the time I use the WMW test. Okay. So anytime you see these things now you know what people are doing. Okay. So five years ago sorry four years ago now this paper was published describing GSEA which was really is a new way of doing exactly the same thing that the WMW test does. Well what's nice about GSEA is it's a really nice software package that goes along with it. And it solves many of the problems that you need to solve when you do the WMW test. And so let me tell you how this works. So here is your let's say you're doing some sort of in this case say you're doing a microwave study and you have a listed gene and in this case you've done 20 different experiments and these are divided into two phenotype classes. So say this is some mutant and then this is the wild type. And then you can make a cluster gram as per usual or in this case what they've done is they've ranked the genes by how well they're correlated to these phenotype classes. So what I mean by that is things that are high in A and low in B are at the top of the list and things that are low in B and high sorry low in A and high in B and then this is the score is the correlation with the phenotype. And then you can rank it by these scores and then now what you do with this ranked gene list which is scored is compare it to a gene set which I've been calling gene annotation. So here are all the lines indicate genes that have the given annotation. So let's say these are all genes that are involved in ribosomal processing. Okay. So then now what GSA does we make this really nice figure and the way the figure works is that as you go from things that are highly correlated with the phenotype to things that are highly negatively correlated with the phenotype every time that you see a gene in the gene set you go up a little bit so that this is a slight enrichment and every time you see something that's not in the gene set you go down a little bit and then you just make this plot. You know, you're going up and then you go down and so what they calculate is they calculate the maximum deviation from zero in this enrichment. So if you're seeing a random walk you would go up, you'd go down, go up, and go down so if there were no enrichment in the gene in the gene set they would be sort of evenly distributed throughout this ranking. So you shouldn't be able to go up that much before you came back down. Okay. And so the statistic that they use is the maximum deviation from zero which they call this enrichment score. Okay. So for the WMW test and the KS test we know what the so we know what the correct distribution is to use for the null hypothesis. Here they don't know what the correct distribution is to use but there's a trick that you can use. There's a trick that you can always use in statistics. If you don't know how to guess the distribution of the statistic when the null hypothesis is true you can just generate a lot of cases where the null hypothesis is true look at the value of that statistic and then generate an empirical distribution. Right? So the question you're asking is is this ES score significantly different from random? You just create a whole bunch of random sets measure this ES score and see what the highest ES scores you get is like the top 5% of the ES scores. And this is exactly what they do is that they take this phenotype and then they sort of randomly distribute it so that they take these genes and these samples and they randomly permute it, calculate the maximum ES score that they see and then the p-value that they assign to the ES score is the proportion of time in random samples that you get in ES score higher than the one you measured when everything was correctly ordered. It's just a trick that's used to calculate the p-value. But ultimately what it's asking is are there more things in the genes at the top of the list or at the bottom of the list which is exactly the same question that WMW answers. Okay. So one other thing that I want to say is that this ES score wherever it reaches the maximum they call this the leading edge subset. These are things that genes in this set are ones that have appeared before you get to this highest enrichment level. So they call this a sort of an obvious cutoff. Okay. So what we learned so far. The T-test not valid when one or both the score distributions are not normal. So all you need is a robust test to test for different submediates. You can use either WMW or GSEA. In this case they're interchangeable. It's a test for overall difference between distributions using the KS test. Okay. And there's other common test and distribution that people use. And then you might encounter binomial or chi-squared and I just put notes down here but I'm not going to talk about them. Okay. So now we're going to skip over that too because we're going to come back to it. We used lab 3-3 which is using GSEA to evaluate rank lists. Okay. So this is the website from which you get the you can download GSEA. But what we're actually going to do is we're going to run GSEA as a Java Webster. Okay. Now actually in order to use the software you're going to have to register. But the registration process is not that hard. You have to give them your email address and your name. If everyone's okay with that you can just go ahead and do that right now. If not you can use my email address. I'm not sure if you can use multiple people with my email address can do it at once but wait. Okay good. All right. I'm glad that I asked for you to install it. And I'm too bad that I forgot that I asked you to do it. Okay. You're ready to go? Do you want to install the sample data sets? Okay. All right. Okay. Open it up. You should see something that looks like this. Yep. You probably don't have these data sets so I'm going to show you how to get them right now. Okay. Here's the website that you want to go to. It's also on the wiki. Can everyone see that? Is there anyone who can see that? They're still there right now. Okay. So I don't know if you've gone through the tutorial but we're just going to download the data sets suggested by the tutorial. This first data set here is microwave data. P53HGU and just download that and save it. Okay. Save it. Remember where you save it because when you open up GSEA you'll have to browse to get the data set. Yeah, that would be a good place. Sometimes you don't have control over that. Yeah. P53 underscore HGU95AB dot GCT. So the dot GCT just indicates that it's one of these at GSEA specific data formats. It's a very complicated data format. I've just opened up the text file so you can see what the file that you are downloading actually looks like. I just opened up a notepad underscore HGU. Yeah. Typically if you want to do that you already have a file that looks a lot like this. Okay. So the columns correspond to different samples. Right? And the numbers are expression levels. And the rows correspond in this case to AFI probes but they can also correspond to G names. But the file format requires two things at this top of the file. It requires you to have this is the this is the number in this case probes if it's not probes genes and then this is the number of samples that you have. Right? And then here these are just descriptions of the samples. This is the gene name and so the gene description they've actually not put any gene descriptions in. You can see this NA but then these are all the numbers that associate with it. It has to be tab-limited text. So so there's one other file that you need oh and I'll write it up on the board here. The third one p53.cls Okay. I've lost my internet. p53.cls is the third one in that list. Okay. So I'm just going to open it up so you can see what it looks like. Yeah. You just load it. You should save it on your desktop. I just open it up for people who want to see what the format of this file is. But this file all it tells you go back to the slide there is this phenotype that we are looking for correlation with in order to rank the genes. You have to provide GSEA information about what that phenotype is. So in this case some of these samples are classified as mutant and some of the samples are classified as wild type. And so what there is in this file is again you're saying how many samples you have which is 50 and how many different phenotypes there are and then how many rows there are. So here the samples have been classified so we've got 50 of these text labels and they're classified as either mutant or wild type. Okay. Now you should be in download data. Now I've already preloaded it but we're going to redo it again. So browse for files It's from the software. So press browse for files and you want to load in these two files that you just downloaded. Yeah you've got to press load data and then press browse for files. I'm just going to move around so I've brought load data there and then press browse for files and you need to open both of these files. The dot GCT which is your microarray data and the dot CLS which is your phenotype classification. I've actually done all the hard work so what I'm going to do is I'm going to start from the beginning and kill this and relaunch. No I'm just doing it because I've made it too easy. So this is what the screen should look like when you press run GSEA. Okay so the expression data set that you want is the one that you previously loaded. Okay. I have to actually reload the data. And I'll just do that quickly. Okay. So now you have a bunch of required fields. You need to put the expression data set in and the gene sets data sets. So these are the files that specify the annotation. And this is why this is one of the reasons people use GSEA so much is because they have very extensive gene set data. Okay. So number of permutations so second you actually want to set that to five and the phenotype labels is the file that you downloaded that has the CLF. Okay. This tells you what phenotypes to use. Okay. So now this is collapsed data set to gene symbols. This is just saying that this is doing this identifier mapping that Gary talked about previously. So the expression data set that we downloaded all the identifiers and the probes were AFTI identifiers. But gene sets you tend to have gene sets associated with gene names. So what collapsed data set to gene symbols does is it takes whatever probes or however you specify your genes and it translates them into gene symbols. And they say collapsed because in this AFTI array there's actually multiple probes for some of the genes. So if you have multiple probes with some genes you do the averaging. And then the chip platform tells you how to collapse the data set to gene symbols. And the permutation type you just can leave in no time. So I'm going to fill in these fields. So the expression data set should be pretty easy because you've already loaded it up and you only have one choice. Right? So it says there's 12,000 different probes. 50 different samples and I guess I don't know what chip NA means. Okay. So now for the gene set we're just going to follow what the tutorial told us to do. If you don't mind when you look at the text file I know that you said but I might not have heard the different probes were not not to transform, they were just raw values. They were raw values, yeah. They can be transformed though. Yeah. There's no problem with them being transformed. All you need is the probe set and the not transform expression Yes. And you don't really need the description line. No, it didn't look like the use description at all. Yeah. But a column probably needs to be there to satisfy the formatting. But what you can do is if you have Excel you can just copy the gene names Make an extra column. Make an extra column. Put in some rubbish and call out descriptions. But what you want to make sure is you don't put any spaces in there. Sometimes space is confusing. Yeah, the NA is fine. Okay, so here is a list of the gene set databases. So some of the things there's keg is included here. Go is included at some point. I guess Mir corresponds to microRNAs. We're going to use a very simple one. Use the top one in the list. The c1. And this just classifies genes by the promos on the location. Okay? The first one. So the gene sets are actually chromosomes. Which is important say if you're looking for differences in gene copy number and different cancers. Right? Is that the best one to use? I'm using it for this particular data set. You can see descriptions if you want. You go to molecular signatures database on the GSEA website. And here they talk about all their different curated gene sets. So like I said, what we're using is the chromosome one. Right? So there's a whole bunch of different gene sets that curate. You can actually select the gene set that you want to look at based on what you're interested in. Okay. So GSEA takes a long time to run. Right? And remember I told you to use these permutations the more values it takes. So the more permutations you give it the more accurate your p-value estimates are going to be. But unfortunately because it takes a long time to run we don't want to sit around and wait for a long time. So we're going to put 5 instead of a thousand. And it's still going to take about 5 minutes. Okay. Now for the phenotype labels I guess you all only have one choice here. But this is the source file of the phenotype labels. And then here you're going to ask what comparison you want to do. Well in this case we only have two different phenotypes. So there's actually only one comparison we can do. But we can make you go both ways. Okay. So you can choose multiple by being controlled. Let's find out. Nope. What? There you go. Only one at a time. It doesn't really incorporate multiple phenotypes, right? You get stuck at that one enrichment plot. Okay. And so now for the chip platform because we've asked it to clash the data set with gene symbols GSEA has to know how to do that. So you have to tell it what chip platform that your micro-radar is from. So here this is the P53 study on this chip platform. That's how they named this data set. So I give this is actually an easy way to locate the chip platform. It's this one right here. HG underscore U95 AD2 dot chip. Yeah. And if you haven't found it yet, you just look at this, right? Ah, you press run. Right down there. Okay. And this is going to take about five minutes. But you can see it's telling you down here what it's doing. And so this here tells you the status. Well, it comes across any problem that's going to say error. Okay. Has anybody got running? Yeah. Okay. Has anybody got error? If you're getting the error that the gene set sizes are too small, that seems to suggest that what's happening is that when it's trying to do the collapse, it's not collapsing appropriately. Right? So it's not getting the mapping from the probes to the gene names appropriately. So they have helpfully provided on the website a collapsed version of the file. And so that's what the second one is. P53 underscore collapsed. So I'm just going to save that to my wherever it is that stuff saves. And then I'm going to open it with notepad to show you what the difference is with this file. Don't open it with notepad yourself because it seems notepad might mess your file name up. But now, all the probe names, or most of the probe names have been replaced with gene symbols. Okay, so now let's go back to load data. Browse your files and get the collapsed symbols instead. So once you have the collapsed symbol file, you can use that as your microwave data instead. And now, because you no longer need to collapse the symbols, you put false. I think it might be this stuff. Because everybody has problems with Vista, I think. You have Vista, right? Okay, I have absolutely no idea why it doesn't work for some people. You have Vista and yours worked? Yeah. Oh, that's because of your permutation. Looks like it's 5, 6, 5. Oh, it's random. So you had different permutations that they did. Yeah. So the permutations, this is how the assessing is significant. And so if you have a lot of permutations, you're bound to get the same answers. But if you have a small number of permutations, it's sort of randomly decided. Okay, so I don't know why it's not working for some people. In normal, when you do your own data set, basically what you're going to be able to get is that you got GMT5, right? No, no, sorry, the first one. This very much depends on what microarray you're in. Yeah. So if you're supposed to use the human genome, you want it to last to change it. Okay. So then, in that case, you put your data in. You might have to update the, or spread sheet in. Spread sheet would then now it would correspond to the absolutely raw data with the raw expression values. You could use the wrong expression values, yeah. So the first column would be your gene ID, pro-set ID. Second column would be the description. And the rest would be all the raw expression values, either log transfer or not log transfer. Yes. Yeah, and that's kind of up to you. Okay. It'll give you different answers, but it should give you primarily similar answers. So that is the one you're going to note in the first expression dataset. The gene sets database would be the gene sets database. What do you have to decide which one you want to use? From the GSBA. Yeah, if you go to the website. And then the computation is fine. So you want to use phenotype for your permutation type. Okay. It's on the website. Yes. You need to have the gene symbols match the gene symbols using the gene sets. Yeah, but you've designed your own microarrays, right? Right. If I have a copy of the gene, I can use it. Okay. Right. If I have a custom, for example, a custom microarray right. Yeah, so I'm not actually sure how to update the database. One thing that you can do is that you can translate from probe-made into gene symbols. Right, and so that's what's being done. That's what's done right here is the fix I made for people who are having problems downloading the chip platform. These are class symbols. And then those are just the gene symbols themselves. So the primary problem that you have is mapping between the gene symbols or the gene identifiers that are used in the gene sets database and the gene identifiers in your microarrays. So but if you can do that mapping yourself, say using synergizing, which Gary's going to show you shortly. There's no reason you can't use GSEA. So so it does make a little bit of difference because it's covered in the correlation in order to rank the genes. So if you use fold enrichment or log fold enrichment that's going to change the outcome of the correlation and in some cases it might change the gene rank. There's a way to pre-rank. There's a way to pre-rank. So is that advanced fields Gary? Yeah. So if you rank your genes outside of GSEA and you don't want GSEA to rank them. Okay so these are the pre-ranked pre-ranked plots and these are exactly the same plots that I went over in the slides. Oh, okay, sorry. This is very easy to do. You just press, assume you get success. Yeah. You press on success and I went to the snapshot of enrichment results. Here are the plots. So this is for a specific gene set. This gene set is called chromosome 11 Q13 which is just a chromosome location. These black bars correspond to genes in that location and this is the enrichment score plot. Right? And this is the maximum enrichment and that's what's been used to assign the p-value. And the p-value they assign is 0 which means it's probably less than 10 to the 16. It's probably less than? It's probably less than 10 to the 16. So, but it might be just let me have a quick look. No, okay. Sorry. I'm wrong. It's an empirical p-value. Yeah. Exactly. I'll let it. I'll show you. Okay, so I got the figures by going snapshot enrichment results. I got this table by going to detailed enrichment results. Okay, so the thing about these p-values is what the lowest that the p-values are as I mentioned before calculated by calculating the proportion of times the ES that you observe using the data ordered according to the phenotype. How often that ES is larger than the ES you get when you permute the phenotype assignment to the samples. So, can we have coffee break and come back to this during the next lab because now I want to tell you what the FDR is, which is something that you've seen and the family-wise air rate. Right? And then in the following lab, we're actually going to analyze the gene list. So, what? Yeah, we're going to have a coffee break. If you go back to this measure, the gene's rate is double than very often. Well, the problem is that we've only did five permutations, which means that your p-values can only be of one of five different levels. So, how many permutations should you be doing? One thousand. One thousand. One thousand. Actually, I would recommend doing as many permutations as you're willing to do. So, if you run this thing overnight, you could do ten thousand. Really? Why should it? So, then how is it possible that the people who sell these as a package with money? This is free. Oh, that's why it's free. Is that what you say? This is totally free. I know. They're probably doing FISH's exact test. Yeah. I'm going to show you a free package that does that in five minutes. Okay. Okay. But, people are increasingly using this package. So, it's good to know what's going on here. Yeah? I don't understand how. You're not alone in that. Okay. All right. Well, why don't we have a 20-minute coffee break? And if you want, you can set it up to run a thousand permutations, or five thousand permutations, because we're not going to get back to this for about an hour. Okay, I'm going to go back to the lecture. I'm going to tell you about correcting a little that. Hi, everybody. Okay, we're back. All right. So, we talked about FISH's exact test, and those of you with gene lists and only gene lists. That was what was most relevant to you. We also talked about enrichment analysis between rankings. So, some people have fold enrichment, or some way of scoring their genes. That was what was relevant to you, and then we looked at GSEA. So, if you have a lot of microarray data, GSEA is the tool for you. And one thing that I should have mentioned that I did mention is what we used GSEA was in the default mode, where GSEA is actually calculating the ranking based on the phenotypic classes that you give it. But there's other modes that you can use where you supply it with the gene ranking, or you supply it with fold enrichment, or you simply supply it with two different microarray experiments and ask it to calculate fold enrichment. And if you look closely at the options, you'll be able to detect those different ways that GSEA is working on. But at the end of the day, the GSEA is just really a way of taking gene rankings and transforming them into enrichment. Okay, but now we're going to talk a little bit the last part is correcting for multiple testing. Now, you've seen these corrections in both the labs so far, but I haven't really said too much about it. But the big problem is, is that when you're looking for enrichment or you're doing enrichment analysis, you're going to be asking for enrichment in a lot of different gene sets. And if you do this again and again and again, you're eventually going to win something called the p-value lottery. And so we have to correct for the fact that you can do this. And there's two corrections that people use in genomics. There's a correction for family-wise air rate and there's a correction for false discovery rate. I'm going to explain what those two things are in this next part of the module. Who's seen FDR before? Okay, so everybody, about half of you have seen FDR. What it stands for? All right. It's good to know that everyone can read. All right, so we don't really need to go through this slide, right? The point is that if you have a false positive rate of 5%, and then you do 20 tests, you expect one of those tests to be a false positive, right? So if you're testing for enrichment in a thousand different gene categories, you get a p-value of 0.01, you'll probably get 10 tests that are going to give you a p-value of 0.01, right? We need to correct for that. There's two ways to correct for that. One is a very easy way of correcting for that. And that's called doing... So let me just tell you what the corrections are for it. One is correcting for the family-wise air rate, right? And this controls the probability that any one of the tests is a false positive. It's really stringent. So if you do a thousand tests and you get p-values for each of those and you correct those p-values for a family-wise air rate and you get a final p-value of p0.01, then there's only a 1% chance or at most a 1% chance that any one of those thousands test is a false positive. So it's a very, very stringent correction, right? But it actually turns out to be too stringent sometimes at what's called the false discovery rate. And this controls the proportion of tests that are false positives, right? So if you do a thousand tests and you get a false discovery rate of 1%, that means that on average, you wouldn't expect more than ten of your tests to be false positive. Ten of your successful tests. Sorry, so let me rephrase that. If you do a thousand tests and ten of them are deemed to be... ten tests to be true or ten significantly enriched categories at a false discovery rate of 0.01, then you would expect on average only 0.1 of the ten tests to be false. So a hundred tests, you expect one of the tests to be false. And it turns out that this false discovery rate is much less stringent tests, stringent threshold in some conditions. Okay. So there are a couple ways to control family-wise air rate, but the most common way that people use and the easiest way is something called the bone-prone incorrect. Everyone's seen the bone-prone incorrect before, right? You just take your p-value and you multiply by the number of tests that you've done. Seems totally reasonable. Okay. And if you do that, then this is a bound on the probability that any of the one test is a false positive. Okay. So instead, as I said, you use the false discovery rate and what the false discovery rate is, so the false discovery rate starts off being similar to the bone-prone incorrect for the first test. What you do is you take all the tests, you take all the p-values and you rank them from lowest to highest. And then you step down this list and then you compare the p-value to something called the q-value. So at the top of the list for the q-value, it's sort of the wrong way. Okay. So this is sort of from top to bottom. So this is what I was actually talking about. So say you've done 100 tests, right? So your q-value here, so the bone-prone incorrect for 100 tests would be taking your p-value and multiplying by 100. Or in this case it's taking your p-value dividing, the q-value dividing by 100 and comparing that to your p-value. Should have. Okay. Does that make sense? So the bone-prone incorrect says you take all your p-values and you multiply it by the number of tests, right? So say you've done 100 tests and your smallest p-value is 0.01, the bone-prone incorrect says that your smallest p-value has now become 1. Which is kind of a weird p-value but that's what happens. Okay, so what you can do instead is you can say, okay, I'm going to compare my p-values to this threshold 0.05, right? So you want the p-value to be less than 0.05. The other way, so after you do the bone-prone incorrect instead of multiplying the p-value by 100, you change your threshold. So instead you compare it to 0.05 divided by 100. And that's another way of doing the bone-prone incorrect. So that's equal to saying 0.05 times 0.01, right? Okay, so that's what's going on with this slide. So down here, we're comparing the smallest p-value to a threshold that keeps increasing as the p-values get larger. So the smallest p-value is compared to the threshold of 0.05 times 0.01. So this is the bone-prone incorrect if you have 100 tests, right? And then the p-value you compare to 0.05 times 2 divided by 100. And the third p-value you compare to 0.05 times 3 divided by 100. So that the threshold that you're comparing against increases as you go up the list. Right? And so the continue going up the list until you find a p-value that satisfies your original the original thing that you're testing is 0.05 times the I guess 100 minus 1 minus the rank plus 1 divided by 100. So in that case that's 97, right? 0.97. So this threshold is greater than this p-value. So this actually becomes our false discovery rate threshold. And so the false discovery rate if we're able to find a p-value less than this test is 0.05 and the corresponding q-value is 0.05 times 0.97. That slide is a lot worse than I thought it was. Okay. I'm just going to do it on the board, okay? Who understood that explanation? Question. This is a very funky take. This is deeply grounded in statistical theory or is this the heuristic of somebody found in this? This is deeply grounded in statistical theory. I'm just telling you how to do it. So what we're doing right here is called the Benjamini-Hosper procedure. Right? And then if you perform this procedure, which I'm going to do a better job of explaining in a second, you're controlling something called the false discovery rate. So you're controlling what's known as you're controlling a bound on the expected number of false positives of the tests that you deem of all the tests that you deem to have been successful. So if you do a test on 10,000 different gene sets and you find that 100 are significantly enriched and you bound the false positive rate at 5%, then you would expect that even 5 of the tests would be false positives. Okay, but then that's still the false discovery is a bound on the expected number the expected proportion of the tests that are false positives. Okay, I've said that in kind of a complicated way because I wanted to be clear about what it is but this procedure, the Benjamini-Hosper procedure guarantees this bound works under the condition that either all your tests are independent or your tests are positively correlated with each other. For the most part so when you're doing gene set analysis you tend to fall on the condition that your tests are positively correlated with each other. So people forget about this condition. But nonetheless this is the procedure that everybody uses. Okay, and so let me just describe it for you on the board because the slide was awfully confusing. Okay, so you take your 10,000 categories you calculate your p-value for all of them and we're going to use Fisher's exact test because we're going to compare the overlap of your gene set to the category itself. Right? And then Fisher's exact test is going to give us a p-value for each one of those 10,000 tests and we're going to take all those p-values and we're going to sort them from smallest to largest. So let's come up with something like good p-values here. So if I did a thousand tests I'd expect to get a good p-value of like 2 to the minus 2 to the times 10 to the minus 5. So 0.0002 right? And then the next largest p-value let's say is this and then so on, so on, so on. At some point I'm going to get to some awful p-values that probably are never going to be significant. And say okay, so let's say that's one and now we've got this is the test, this is the rank here. Can you see this? If I write in black you can do it. Maybe I can take this off the floor. Maybe I can take this off the floor. No? Okay. Okay so here's the rank here's the p-value and TSEA calls this the nominal p-value which is a good enough name and I'll use that. And let's say our lowest p-value is like 2 times 10 to the minus 5 our second lowest p-value is 3 times 10 to the minus 5 and I'm just going to do et cetera and then we're going to get down to the bottom of the list and at the bottom of the list we have very high p-values and the highest p-value where we can get is 1 or highest nominal p-value. Okay. So if we were doing the bomb for only correction Alright! I don't know why it slides when I have white boards. Okay. Okay so these are the nominal p-values. Can everybody see this now? Can anyone read my writing? Okay. Here's the rank remember we did 10,000 tests here's the nominal p-value so it goes from 2 times 10 to the minus 5 all the way up to 1 and then here is the threshold. This is the p-value threshold that we're testing against and I'm going to come up with something called the q-value threshold. Okay. Now as I said before if you want to correct for multiple tests using the bomb for only correction before you compare to your threshold you have to multiply the p-value by the number of tests that you do. Okay what I'm going to do instead is I'm going to divide the threshold by the number of tests that I did. So I'm going to take my p-value threshold which is 0.05, that's the standard one that people use and then I'm going to divide by 10,000 because that's the number of tests that I did and then I'm going to ask the question is this p-value smaller than that? Okay. 0.05 divided by 10,000 is going to be 5 times 10 to the minus 6. So the answer to my question is no. This p-value is not significant after the bomb for only correction. In fact none of these p-values are going to be significant because this is actually the lowest p-value. Okay. So now for the q-value threshold what we're going to ask is we're going to control instead of the family-wise error rate we're going to control the false discovery rate of 5%. Now there when we're testing the threshold we're going to do this original threshold and then we're going to multiply it by the rank of the p-value. So this is like p-value threshold you're never going to be able to read this but times rank. Okay. So now this, so the test we do here is 5 times 10 to the minus 6 the test we do here is 1 times 10 to the minus 5 because we multiplied 5 by 2 the test we do here is 1.5 times 10 to the minus 5 then we need 2 times 10 to the minus 5 so you can see that this p-value threshold increases as we go down the list and now at the bottom it's going to be 5 times 10 to the minus 6 times 10 to the 4 so we're going to get back to our normal p-value threshold which is 0.05. Right. Now the advantage of doing this is let's say that these nominal p-values are all going to be kind of similar to each other so this is like 3.1 times 10 to the minus 5 3.2 times 10 to the minus 5 so now when we get down to like the rank 3,000 p-value we're still not that high right, say now we're only at 4 let's choose a number that's going to come out more nicely I can say that rank 2,000 so now we're at 4.0 times 10 to the minus 5 so we got kind of lucky with all these p-values we got a whole bunch of p-values that are all sort of similar to each other so now the q-value threshold that we compare against is 5 is our Bonferroni corrected threshold multiplied by the rank so but in this case the rank is actually 2,000 so this makes this number very high right this makes this number 10,000 times 10 to the minus 6 which 10,000 times 10 to the minus 6 is 10 to the 4 minus 6 so it's 10 to the minus 2 so this new q-value threshold is actually 0.1 which is higher than the nominal p-value right and now suddenly our luck ran out and our p-values are like 10 to the minus 2 at 2,001 right and then all of them are going to be larger 10 to the minus 2 so now we cut our threshold here so our new p-value threshold is actually 0.1 so let me be more clear about this now because this is the lowest rank p-value which is below this q-value threshold we say that any test with p-value the rank is higher than the 2,000 and it's only past the test controlling at an FTR of 0.05 FDR of 0.05 and that's the Hausberg-Benzmenig principle so what it says is that even if your top p-value isn't significant, if you have a lot of p-values that are approaching significance or you have a lot of small p-values near the top you can still get significance under a different test so what the guarantee is here of these 2,000 tests which we deem to or these 2,000 different categories that we deem to show a significant enrichment no more than 5% which is 100 on average are going to be false positives and often the FDR gives you many more significant categories or can give you significance when you don't get any significant categories with the family-wise area or the bone-proney correction okay why? you'll see the FDR with enormous numbers of p-values but then how much does it start with the public and the FDR? yeah bone-proney corrections which you use first because it's easy to do and it's a much more stringent condition but typically people end up as why I said with the FDR okay and then the so when you do the bone-proney right suppose two particular genes pass that test and come out to be significant but not genes, it's categories your annotations, yeah does that mean that those really have passed your stringent test and they should be taken you should be paying attention to that or is it when you go back to your FDR and get a bigger list just go with the bigger list and don't even think about the FWER category you know what I'm saying? I understand what you're saying I'm not quite sure but so there's two conditions there's two conditions, yeah so anything that passes FDR is also going to pass FWER right that's always going to be true the FDR is only going to give you more things right because the least stringent test sorry the most stringent test you do in the FDR is the bone-proney correction so if you've done the bone-proney correction and you have things passing the bone-proney correction they're also going to pass FDR at the same level of significance does that make sense? also biologically if you have additional information about this test it might help you make that decision okay, yeah I think the biologically this would be probably one of the things you would do if you know a lot about the system of the expert coming up that you would expect yeah I absolutely agree with that but then there's I mean I agree that you want to look at what things that pass your test and see whether or not they make sense but there's one other thing that you have to be concerned about especially if you're using gene ontology sets as Gary pointed out the gene ontology range is a hierarchy so there's more specific and less specific sets so it means that there's a lot of overlap between the sets this is what induces the positive correlation between their p-values so sometimes you'll see a lot of categories coming up as being significant simply because they contain the same set of genes right so you'll get like RNA processing and ribosomal RNA processing which is a subcategory of RNA processing right and so then after the fact you have to go back and try to figure out what all these categories mean and there's tools to do that so Gary's going to talk about one of those tools that someone in his lab, Daniel in his lab has developed called the enrichment maps and then the tool that we're going to see in our lab there's a different way of doing that sort of clusters the gene sets together so but this is I think this is still an unsolved problem if you have a lot of gene sets that pass significance and you have a lot of overlap in the genes in those gene sets how you relate that information or how you understand what that test is told you ultimately okay so that concludes multiple test corrections about that okay so we are going to start our last lab and we're going to attempt to do a enrichment analysis on a gene list but in order to do that enrichment analysis you're probably going to have to do ID mapping from whatever identifiers you used in your gene list to identifiers that are going to be recognized by the tool so Gary's going to spend a little bit of time telling you about various ways of doing that ID mapping and also the tool that we're going to be using is GoMiner sometimes requires you to put in a background set so if you have if you know the genes that are on your array you might have that background set in hand if you don't have that background set in hand you can get that background set from BioMarkt and so Gary's also going to show you how to use BioMarkt yeah that's problematic right that's not all genes for example are going to be expressed in that tissue so that depends on your analysis yeah so I mean are you looking for so what I would do in that case is I would get the list of all genes that are expressed in that tissue and then I would get the list of all genes that correspond to that respond to your permutation and the genes that respond to your permutation is your gene set background is the set of genes that are expressed in that tissue but with BioMarkt I think you can actually filter for genes expressed in a given tissue you may also have that from your experiment if you are analyzing a specific tissue you may that's hard it's actually hard to figure out what's in the tissue because not every method is very sensitive as you would expect so some of those levels being zero so um oops so this is a continuation from this morning I just wanted to spend 5 or 10 minutes showing you the synergizer and BioMarkt tool that I mentioned just because we didn't get a chance to do it this morning and some of you I know I've seen already figured these tools out they're very simple so hopefully I'll only need to just quickly show you this and then see so the idea for that sort of exercise and you can do that later as part of this lab or whenever you have time was to take a a gene set converted to which the one that's available on the Wiki is a set of yeast genes and then those yeast genes are represented as gene names use synergizer to convert the gene names to entrate gene IDs and then input the entrate gene IDs into BioMarkt get the gene ontology annotation and then look at the results to see how many times the different gene ontology terms appear and how much of the evidence is different evidence codes and that's printed in your book and that little exercise just gets you to try the goal of that exercise is to get you to try to sort these tools with this a given gene list if you have your own gene list you don't really need to do it that way you can use your own gene list so let me find the yeast gene list that's on this computer I'm just going to cut and paste it yet oops so I just noticed that this file that I the yeast genes that you may download from the Wiki text editor it's just all one line and that's just a Windows Macintosh formatting issue so you can open it with WordPad and it should fix that problem so now each gene is on one line so I'm just going to copy all of these genes which are yeast gene names and I'm going to put them in the Synergizer wherever that went okay so Synergizer the first thing you do is pick an authority this is the place the source of the gene mappings, gene ID mappings so remember this is really just all about mapping different identifiers like Affymetrics to Entregene or Entregene to Uniprot and the one that we typically use is Ensemble because Ensemble supports a lot of different species so if we pick Ensemble then we have a lot of different possible species that Ensemble supports and you pick the species that you're interested in in this case just for the purposes of this exercise I'm picking yeast and then you it says from namespace and to namespace from namespace is the identifier that you are giving to Synergizer and to namespace is the identifier that you get from Synergizer and the nice thing I like about Synergizer is that it shows you examples of the identifiers here so it actually gives you examples so one of the problems sometimes with these other tools is that they don't give you examples and there's 20 different ID types but you don't know the difference between Uniprot accession and Uniprot ID and Uniprot there's like three different ones that are called Uniprot so this is one of the reasons why I recommend this tool because it's basically just easier to use for the simple reason that it gives you an example so you can look at your you can look at your file where'd it go so in this case I look at the file this WordPad file and I see everything sort of is of this format all the genes look like they're YHR and a number and a letter so in Synergizer I can pick the one that looks like that these guys, ensemble gene ID peptide ID or transcript ID so I'll just pick one, ensemble gene ID and then I want to convert it to something else, say I want to convert it to entree gene ID which is a number but there's other ones here you can play with let's pick entree gene ID and you just paste the files that the genes that you want to you want to convert in this box it's the source of where Synergizer gets the mapping information so some of these other authorities are related to specific organisms like palm bases just for yeast, palm bay yeast and Ecosychus just for E. coli worm bases just for worms ensemble is a general all around one this ensemble 49, ensemble has different versions every few months they come out with a new version of the ensemble website you can see what the current version is I think it's at 54 or something so I'm not going to pick 49 because for some reason it's in there but it's probably older so I'm just going to pick ensemble and hopefully this is the latest version of ensemble but it doesn't actually tell you that here that's the the question is sort of what is the authority so that's the source anyway so it's pretty simple you can choose to output the results as a spreadsheet in which case you get an excel file back might be useful for you but if you don't do that when you submit it just gives you a it just gives you all of the results here on this side and you can you can copy and paste those into excel or it gives you the original ID that you gave and the entree gene ID and it says that IDs in red are not by ensemble and in yeast so if you go down here there's one thing that wasn't recognized here so this one this one wasn't recognized it doesn't get an entree gene ID but all the others actually get a nice entree gene ID so that's good I think all of the others too one doesn't get an ID so that's a good example sometimes it recognizes it recognizes the name it's not red but it doesn't know the ID and so that's a case where that ID may not be an ensemble sorry may not be an entree I keep on mixing up ensemble an entree gene ID the entree gene ID for that gene may not be an ensemble ensemble doesn't know about it maybe there is actually an entree gene for that and you could find it perhaps by searching an entree gene and so that's what we were talking about before with the usually when you do this the majority of the majority of genes come back with nice one to one correspondence you have a gene and you have an entree gene ID but if you were able to put this into excel and sort it you would see the exceptions so there's some exceptions that are red in which case you may want to fix this gene name maybe it's a problem with your gene name maybe this gene name needs a dash or something in it some things where they're recognized but you don't get anything back so you can verify that there is in fact an entree gene ID for that one by going to entree gene and searching for it and I actually noticed one other exception here which was there's multiple gene IDs for this gene and I'm not sure what the reason is but this is sort of where the just investigating it more comes in so if we actually put these different entree gene IDs into entree gene you would see that and compare the different records you'd be able to see why they're different entree gene is not supposed to be redundant it's supposed to be non-redundant so it would be surprising to me and it would be a mistake in the in entree gene database if there was really a redundancy but so let's actually just the gene name is YCL 067C 067C it may have been it may have been splits so anyway we can look oh okay this is for you spiral just this is a known this is a known gene that's present in multiple loci so it's the mating factor gene that skips around so there's actually three different places for this gene the active site and the two different mating factors so that's probably where entree has those different loci in different places it makes sure Francis can tell us more about that sorry yeah okay so that that's an explanation for why that happens it makes perfect biological sense and you can decide whether you need to correct for that or just choose one or the other there's another example of two different entree gene IDs which have got yeah you mean an entree another some other examples where two genes come back not only two genes but two so look at the 7th one from the 7th or 9th something like that right from the top those YPR 080W and YPR normal names yeah so these these also have two this is this is the same problem same two numbers okay yeah so yeah again you'd have to go look at these cases and some of them may be biological sense like the other one and some of them may be mistakes in the database which you could fix so this is as repeating what I said before typically you get 90% or 95% success and then the extra little things it pays to go and manually check them over just to make sure that that you you're using the right identifier as I said using the wrong identifier will potentially make a serious error so that's Synergizer you can try it out with your own gene lists it supports plants and animals quite a lot of different species here all the sequence genomes basically actually a rabidopsis is not present in here because I guess the label is not currently using it so yeah maybe this isn't the best one for a rabidopsis but for many other genomes it's pretty good I'll mention another later we'll mention another resource that's useful for a rabidopsis and I just have to test if it recognizes if it does gene ID mapping it's the botany or bio array resource bar at University of Toronto the link is on the wiki for all the people that are working in plants there's a wiki link about that anyway so that's Synergizer I'm going to do I'm going to click this one column view here and I'm going to copy all of these entree gene IDs oh okay that's a good one I haven't tried NCBI in this particular case but yeah good point so the point was that if you use another authority like NCBI they have a rabidopsis here and a bunch of other organisms so that's nice right that's a good point so that's something I mentioned this morning that it's good to try you may not get 100% coverage from one authority and also there may be version issues one authority is using an older version so if you use different authorities and compare them you might sort of identify some mistakes so you can cross compare mappings from different authorities and see if one is better than the other or you can evaluate if one is better than the other for instance or if you should combine them both okay next I'm just going to next I'm going to show you Ensembl Biomart Biomart as I mentioned is a one-stop shop for a lot of different types of gene annotation data and there's a lot of places where Biomarts exist but the one that I like using is particularly useful is Ensembl so it's Ensembl.org EN, S, EMBL .org and then when you go to the Ensembl site you can there's a link here that says mine Ensembl with Biomart so Ensembl is a genome browser you can browse around Ensembl if you're not familiar with it but if you click mine Ensembl with Biomart you get the Biomart screen popping up and you can enter as I showed you this morning you can enter different types of data how much time do I have? you're going to come back 6 5 so I reached Ensembl and I'm just going to take you through this fairly quickly but the key thing it's very easy to use once you understand the sequence of events that you have to follow and that sequence of events I find is really not obvious for first time users so I'm just going to show you a sequence of events I mentioned it in this morning as well so first you choose a database that's obvious because you can't really do anything else it's at Ensembl.org and there's a link called mine Ensembl with Biomart it's right here does everybody see that? anybody but that's it's a little bit different and also there's a lot of different Biomarts so you choose a database there's a bunch of different databases here the one that has the most amount of information about genes is Ensembl and it will have a number after it which is the version of Ensembl which increases every couple of months by one so I'm going to choose Ensembl 54 you can play with these other ones but there are many different types of databases and then it says choose a data set and the data set is really it should say choose an organism but this is there's all the organisms that are supported by Ensembl and again a rabbitopsis isn't here but Ensembl is currently expanding their system so that it has a complete genome including bacteria, plants and everything else and I expect that that will eventually make its way into the system I'm not sure when that's going to happen but there may be a Biomart for plants I don't know if you guys know I was using yeast genes so I'm going to select Saccharomyces serviceae so we've selected Ensembl and then we've selected yeast and that's the easy part the hard part is now knowing what to do so on the side you see data set it says Saccharomyces serviceae that's good it's from SGD it's a particular version and then it says filters, none selected and attributes and there's a couple of attributes these are default attributes you'll always get these back but the idea is that as I mentioned this morning you first define some filters so the idea with Biomart is it starts out with the entire genome and then you define some filters to narrow down your set of genes that you're interested in so the filters can be there's lots of different types in this particular case I have entree gene IDs and I want to just get more information about entree gene IDs so I go to gene and I type a filter it's related to gene, entree gene IDs and there's an ID list limit thing which I check and then I can choose one of the one of an identifier to limit to and entree gene ID is one of the standard ones that's in this list so I can click that and I can paste all of my entree gene IDs that I found before in this box so all of those are just put in here and now the next thing to do is to test that Biomart is actually working and the way to do that is click count at the top here this is a part that's not very obvious but if you click count it basically tests that it recognizes data so it should say the data set is now 331 out of 7124 genes so ensemble Biomart for yeast understands 7124 genes and the genes that I gave it you get it selected 331 and ideally you would look in your original list and you would just test if you knew that you had 500 genes you'd expect to have some number that's close to 500 unfortunately ensemble Biomart doesn't really tell you easily right here which ones it doesn't recognize it just tells you that it recognizes a certain number so you have to kind of know the number that you expect so that's it you can filter other that's it for filtering you can filter by other types of criteria which you can experiment with but the next step, the third step after selecting the data set and the filters is clicking on attributes and this is where you can select all sorts of information about these genes this is the shopping mall this is the Biomart part like the Walmart thing there are different types of information so there's information from ensemble all of these things about GC content of the gene or the source or the status or the strand there's also a lot of external information like gene ontology go IDs, go evidence codes that I mentioned there's references to other databases you can get aphometrics IDs so I'm going to select that I can select protein domains maybe I'm interested to see the interpro ID and the interpro short description transmembrane domain signal domains lots of things that's just in this features thing if you click structures or variations or sequence or homologs you can get more information like you can download all the sequences of the genes or the DNA or specific exons or all sorts of different things so this is a really neat tool when you're finished you click up here the top left on results and it gives you a preview of what it will be giving you and so we gave it some IDs and it gives you just the ensemble gene ID and the ensemble transcript ID and this is the stuff that we asked for go IDs and interpro so this has a homeo domain like domain and this is a homeo box domain it only shows you the top 10 by default but you can select to show all of them if you want and you can show it as comma separated values or tab separated values if you're having a problem with a lot of repeated elements here which sometimes happens you can click unique results only but that's only if the entire row is the same as another entire row if there's any differences it won't be able to maintain it so one of the things I notice here is that I don't have I gave it on entre gene IDs but I don't have entre gene IDs in my results so that's going to be a problem because when I want to I want to match up my entre gene IDs to various different types of information so I need to see that in the result so I can go back to a good question so the question is why does go description not come up and it's supposed to come up and we spoke to the Francis spoke to someone who works upstairs who built biomass and he said that it's a problem with ensemble they just this particular version is missing the yeast go description I noticed that someone else was using this for human and the go descriptions did come up for human so it's just a bug in ensemble for yeast in this particular query and someone hopefully they'd be notified now so they they can fix it for the next version sometimes I notice that happens with ensemble because ensemble is a lot of the data comes from a large computational pipeline which takes weeks to run and occasionally things break in there and they don't notice it's such a huge system and they usually it's fixed in the next version the other thing if there's some particular piece of information that you really need and it's not there you can go back to any previous version of ensemble from the ensemble homepage and there'll be a biomark that works on that previous version and you should be able to get the information and maintain all the old versions so I was saying we need to see the entree gene IDs here to make the connection between the gene IDs and the evidence so I'm going to go back to the attributes and I'm going to select under external entree gene ID here entree gene ID and every time I select the check box here it sort of comes up here so this is the list of things that I want so now I'm going to go back to results and now the entree gene is here so that's great now I can match up what I gave for my original data to these other things and now I'm happy this sort of just showed me the first 10 results just as a check to make sure that it's working and now I can save the results to a file in different formats one of them is Excel, one of them is HTML, one of them is tab separated values. If I click HTML and I click go then I just get a web page with all of the information um looks like it gave me a text file oh, cancel, that's because I asked for file oh no, yeah maybe, I don't know yeah, anyway I guess it's an HTML file that you can open up if you want to see it on the web I guess you have to click here click all and then it goes to another web page and it will show you it will show you all of the information and the nice thing about viewing it on the web is that you can click on these things and if you click on any of these links it will find you'll find more information about that so this links to Interpro tells you more about the homeobox domain including nice pictures and other things like that so this is kind of a nice jumping off point for information about your gene list if you give it a set of genes you can ask for information back and you can make a custom web page that you can click and get more information about all of the things so it's really nice for just looking through long lists of genes with only the things that you're interested in seeing and you can also save things to excel which is useful for follow up work inside escape or other places GSEA any questions so fairly simple then try it out over the lab so I'm going to pass it back to Quaid for okay I guess we only up to 5 o'clock I read 1630 is 630 just something I tease my wife about all the time no one's going to tell her that I screwed that up too okay so we're going to use Go Miner now Go Miner is a nice easy tool to use you can also do Go enrichment analysis from inside escape but Go Miner for me is very straightforward so that's what we're going to use and the nice thing about Go Miner compared to compared to FunSpec is it does handle other organisms and it allows you to some extent to filter by Go annotation code so alright I thought I baked one for you but I probably closed it so the way that I always find these things is I go to Google and I say Go Miner that's how I find Biomart too so if you're worried about finding Biomart and you can't remember how Gary told you how to do it I just put Go Biomart here and look there's Biomart and I got to choose Ensemble and there we go now I'm in Biomart so that was pretty easy so Go Miner okay so here's Go Miner actually has is a standalone application but I just use the high throughput settings I like using things on the web and then I just go to the web interface okay alright so Go Miner like all these other things it has to do three things three main things for you the first thing it's got to do is it's got to be able to match up what it calls the change set and that's your gene list with the background list use different identifiers Go Miner is going to get confused and I'm going to show you how it gets confused in a second but we're going to do the easy thing first and then the other thing that Go Miner has to do is Go Miner has to be able to match up your background list and your change set list with Go annotations okay and so if Go Miner fails one of those three things it's failed to do and it's going to complain to you when it fails but we're going to try to make it fail but first we're going to make it succeed okay so step one this is they called the total file here I don't know why they still use background but the nice thing is you can actually ask it to auto-generate so we're just going to do that for the time being we're going to ask it to auto-generate our background set but if you have a specific background set you're going to have to upload it using that browse box and the pointer went away I probably took it to my seat so I'm going to use this thing instead okay so your gene list you actually put it in step two so step two is like the change file so for the gene list I'm going to use that module one yeast gene list that Gary's been using and these are the genes that are involved in Gal 4 okay and that's on my desktop and so let me I'm going to upload it right now but I'm also going to go back to the desktop to show you what it looks like it's pretty easy it's just a it's just a list exactly like that right so one gene identifier per line that's all you have to do okay and we're going to kind of hope for the best with GoMiner that GoMiner will be able to recognize everything so now step three is we have to select the data source and what that means is this is where GoMiner is getting their Go annotations from and so there's only one data source that provides Go annotations for and that's SGD which is the model organism database some of these organisms here's plant right there so we're covering plant some organisms like Homo sapiens human have multiple data sources that you can get annotations from in that case you have to put in the gene identifier for the data source in a in a semi-colon separated list here I've never actually tried that hard but I don't think it is but SGD that's pretty easy okay and now you have to select the organism so that would be used okay and this is where we get to select the evidence codes we get to choose the evidence level from the list so we don't have arbitrary selection here we only have they've ranked the evidence codes in terms of how suspicious they think they are so the most suspicious evidence code is the IEA so you can only select all if you select all you get the IEA and then Go minor feels the RCA evidence code is the next most suspicious so you can get rid of that and then it feels that the NAS is the next most suspicious but down here you can also do a semi-colon separated list if you don't like their ranking I'm pretty happy with evidence level 2 so IEA these are ones that are totally electronic RCA these are ones that are these are annotations that have been imputed by computational methods but someone's published them so in that sense someone has looked at the publication and uploaded the I don't fully trust those even though I'm in the business of predicted G function okay because those are my competitors and if you take enhance you gotta take enhance names off because if you choose uniprot as the source you need the enhance names but we're not choosing uniprot okay and then you can choose just nominal p-value or FDR we're going to take both as the constraint and then so here's so remember you always have to do this multiple test correction so when you do this multiple test correction you should lose significance if you do too many tests now one way to ensure that you're able to detect some statistical signal is to reduce the number of tests that you do so you can't look at your data before you decide which tests you're going to throw away but there are you can throw away tests based on whether or not you think they're going to have sufficient power to give you a significant result and so but gene sets that are too small you're never going to get sufficient statistical power so what gold miner allows you to do is allows you to set a threshold on the smallest gene set that you can consider so it puts that threshold at 5 which is kind of decent like we throw away anything that's 5 or less actually we throw away anything that's 10 or less what that essentially means is you're doing so many tests that you're never going to get sufficient statistical power in small categories and you're never even just do that just never do that test in the first place okay so and by restricting the number of tests you do you restrict what you have to divide by in the bottom front of correction okay and the CIM thing I'm going to show you after we've done the analysis this is one of these ways to deal with the fact that many of the gene sets are going to be looking at contain the same genes over and over again okay and then we're just going to look at the biological process but you can choose all the categories if you want let's go crazy no I'm going to choose biological process because I don't want to go off the script and so this analysis actually takes a few minutes so I'm just going to put in my email address and I'm going to submit the query okay and if you're successful gold miner will tell you that you're going to be emailed just to happen I'm going to try to be unsuccessful to see what it looks like so you don't get upset if something weird happens okay and the way that I'm going to be unsuccessful is I'm going to use an unsuitable background set so to get this unsuitable background set I'm going to go to bio mark and I always find bio mark by putting it into google and then going to bio mark and choosing ensemble okay and as Gary said you got to choose the ensemble database and remember we're using yeast so we got to go down and find saccharomyces okay and like the the major thing I use when I use ensembles I use actually ensemble to get a list of all the genes in the genome well that's one of the things I use bio mark for but for the filters we're going to be a little bit more specific so we're going to limit how do I do this again yeah we're going to limit to genes that have go IDs right so I don't know if any of you use biologists but 7000 genes seems like a lot of genes for yeast and that's because they're including like RNA non-protein coding genes so what we're going to do when we establish our background set I'm just going to limit to ones that have go IDs so these are ones that actually have been annotated okay and then I'm going to go to attributes and I just want a single list and actually take some of these features off okay and then then I'm going to go to results and see to make sure that I got the thing that I need now this is actually oh this is perfect actually these are what I need so these are the systematic names for yeast oh no no no no no no no no actually I have to screw up to make a greater effort so instead of using the systematic names for yeast I'm going to use the common names so the gene symbols for yeast so I chose SGD ID it took me a while to figure out which one that was but the way that I do that is I always go to results to get a preview of what I'm doing and now I'm going to for some reason you have to click unique results only because sometimes you get the same rule over again and then I'm going to ask Bob Mark to give me a text file okay so let's open up a notepad and oh it looks awful but you know what we're going to do we're just going to save and hope for the best okay yeah so what's happened is notepad doesn't read new line characters from different systems appropriately so you get everything on the same line okay but now I have saved that file which I believe to be an inappropriate thing for a go minor to use and I'm going to try to make go minor fail why is it inappropriate to select this it's a wrong ID it's a wrong ID so the list that I put in has systematic names and the background list has common names and I managed to get go minor to fail doing it the other way so I'm going to try to get go minor to fail using this way okay and so let's go to recent places oh documents here we go it's down somewhere that I don't know where it is so I'm going to go back and get file marked send it to me again and I'm going to open up wordpad okay I'm going to do this now for the best that is not going to work but mark export is going to tell me where it is it's on the desktop okay okay go minor okay while we're waiting for this to happen I'm going to check my email to see whether or not I've got the file yet okay has everybody else sent their thing to go minor okay so it's going to take a little while then alright okay where do you want me to start okay search for bomb art on right ensemble ensemble yeah choose data set choose data set as ensemble and the data set that's just the organism name yeah okay so filters so if I don't filter I get 7000 genes and that includes non protein coding genes so I don't really want those in my background set so I'm going to filter for things that are protein coding and you can actually do that here I believe I'm not sure oh there we go you can filter by gene type so that you only take the protein coding genes do you see that I want to see how many protein coding genes ensemble thinks are on the east genome okay ensemble thinks are 6600 but then I can filter it even more by limiting to genes that have go IDs oh yes yes okay now I'm limiting genes up go IDs and press count again to see how many I've got and I've now got 5900 which seems a bit reasonable to me so I know there's like around 6000 genes of protein coding genes we're currently filtering the background yeah so this is just to establish the background now if you are if you have some other reason so there's other ways to do the filtering so what I can show you afterwards or I can show individual people if you choose human you can filter for expression in certain tissues for example right so but it's worth your while to look around at all the different filters you can use so here you can actually filter by chromosome you can filter by chromosome region by putting in the chromosome coordinates I mean on ensemble is really powerful I always the first thing I do is I go and look at Baumart to see if it can solve my problem you can filter by protein domains which is pretty cool look at this you can filter by whether or not they have a transmembrane domain and what you can filter by depends upon the species that you choose as well okay but we're just going to filter by GoID so this seems like a reasonable background okay but then we have to change the attributes because I want to have a file at the end of the day that I'm going to be able to upload into GoMiner and by default Baumart applied to ensemble because ensemble gene ID and transcript ID which in this case is actually helpful to me sometimes the ensemble gene ID is not what you want because it's EN SG blah blah blah blah and ensemble likes that but not very many other people do so you have to go around looking for the gene symbol so I'll show you how to get to gene symbol for yeast usually the gene symbol that you want is an external and in this case they're calling it the SGG ID okay so I've just opened up external and I've chosen the SGG ID and then the way that I actually figured out that SGG ID was the thing that I wanted is I looked at results and it gives me a little preview and these things look about right these are clearly gene names here and if you're not sure about that you can just make it bigger right and then you've got like more examples to look at these are gene names the ones that still are systematic are these are genes that haven't been assigned a name yet that's what oh yeah let me go up to the top in the view you can choose the number of rows that you see yeah so 10 is mostly I usually just use that because that gives me like a sample of what I've actually chosen so but now that I've got what I need I can get it to give me a file and like I said you have to choose unique results only because sometimes you get duplicated results I'm not sure why that happens probably some database lookup thing but there you go okay so now I've chosen unique results only press go okay so I'm going to try saving this file and then looking for it this is my laptop it would be easy but since I don't know Vista really well I'm going to try okay so I've just saved the file that I've gotten back okay so let's try mart export okay good now where do you live alright that's what I want okay I'm actually going to have to open this file up in wordpad I didn't find it does anybody know Vista it's a little better than I do you can tell me the file is always called martexport.txt I did do search but I'm looking for the location yeah it's the file it's the second it's download open download and then right click yeah it's on desktop right click okay hopefully this is the right martexport what's the latest date there I'll use this martexport instead okay now I'm going to dump it somewhere I can find it again so I'm going to dump it into documents okay now we're going to go back and we're going to make our attempt to make go minor fail does anybody got the results from go minor yet yeah okay let me looks like I got mine too yeah okay so I'm going to browse the results okay all we have here is just a summary of the tests that we did these are the genes that we uploaded right that fires file should be familiar these are the parameters that we used and this is the database version that we used and this is important because the go annotations get updated all the time so it's a good thing to report okay and you can download the whole thing or you can browse your results in html so you can download this thing to your computer and then you can do html browsing locally but I'm just going to do it here okay and so I want to look at the results for each of the changed files so that brings me here okay so the first thing I'm going to look at is I'm going to look at the gene category summary okay so what we have here I wish I had my pointer but at the very top these are currently sorted by p-value and here the reporting p-value is log to base 10 p-value so this would be something times 10 to the minus 10 because this is minus 9 minus 9.58 okay enrichment here is just full enrichment and so this is the total number of genes in this category which is 200 and I think this is log full enrichment is 209 and the number of changed genes in this category which is carbohydrate metabolic process was 39 all wonderful good let's see if I can lose this one too okay and so here we are right and then that's the log to base 10 p-value and here's the false discovery rate and the false discovery rate is effective okay now this file is from a gal force screen so we shouldn't be surprised that we have carbohydrate metabolic process that's coming up here and says carbohydrate metabolic process and now if you click through these are actually probably related to each other so this is the description of it okay so this is this is where it sits in the goal a hierarchy and cellular carbohydrate metabolic process as we thought is actually a child carbohydrate metabolic process which means that carbohydrate metabolic process contains all the genes that are in cellular carbohydrate metabolic process so now again here's a list of all the all the categories we can actually download that as an excel file which is useful but then we can also look at this sort of cluster analysis so what's happening in the cluster analysis this is I guess this is a two-way hierarchical cluster of the matrix the annotation matrix so what I mean by that is along the x-axis are all the different categories the goal categories that there's significant enrichment for and along the y-axis are the various genes that participate in those categories and you see a red element here indicates that that gene is annotated in the specific category so you can see that there's these blocks and what these blocks indicate is that there's a set of categories that all contain that same gene set and there's a set of genes that are all in the same category so let's look up here this is a good block to look at just reflect redundancy in the goal annotation it reflects redundancy in the goal annotation right so the first and the third ranked sets were categories that are actually hard to be related to each other so when you start reporting these things you're going to report a whole bunch of things that basically say this gene is involved in carbohydrate metabolic processing right so right here I went through here okay now let's try to get go minor to fail okay so I used the same change set which is our original gene list but I put in a background set that used different gene identifiers there we go okay so here the problem is is that it could match this gene name that was in the change set within my gene list with the set of total genes it's not smart enough to go and fix it by yourself so you have to fix it for it okay but that's what happens when you see this screen right and it just means that you have to get your change set and your background set in line okay so there's go minor it's in very well no because in that case your background file might not be appropriate what is that the appropriate background file so you have a set of genes in your gene set right in the background file is the set of genes from which that gene set is chosen right so if you have a microarray that only has like genes that are expressed in the brain on it there's no way that you could pull a gene set from that microarray that contain genes that weren't expressed in the brain I would say my set of samples came from the brain right so even if they are all of the genes maybe are appropriate to have at least the genes that are expressed in the brain or is it okay to use all of the human genes as the background again for me it depends what your test is right but the first thing I would do is I would just use everything as your background set okay there's one other error that I wanted to point out here right at the bottom so in the export that I got from the file mart it has a header line that says SGTID right now if I opened it up let me see if I could show that you're going to just trust me on that and GoMiner doesn't like that so what you have to do if you were actually using this as a background set you have to open that up in WordPad and take out the header line up in Excel let me see if I can find that so here is the header line so no pad is actually thrown away all the stuff but the thing that came back to me from bottom art actually had a header line that said SGTID and GoMiner is a little bit sensitive that sort of thing so you would have to edit that out in WordPad I think you could also do it in Excel so let me see if I can open this thing up in Excel failure oh there's WordPad okay now I'm just going to take a subset of these as my change set to show you that I decided against doing that because we only have 15 minutes left so with the remaining 15 minutes and likely later tonight why don't you try doing the same thing with the gene sets that you brought and we'll be around to help you with that the things that you have to be able to do is your background set you can either let GoMiner choose your background set which I think is a good thing to start with and GoMiner has to be able to match up your gene identifiers to the identifiers used in the annotations right so the thing that I would try first is just take your list of your initial gene set and upload it into GoMiner and choose the appropriate organisms