 Welcome back everyone. I hope the first day wasn't too harsh on you and you still have some energy for today. What we're going to do today is that we're going to, so there's still one more module that I'm supposed to cover. But I think it's more important that we make sure we finish this one because it's probably the one that is the most interesting to you anyway. And then if we have some time left we'll talk about the last module. Before I start is there any questions about what I covered yesterday, some of the material we went over? No questions? No, there's no requirements. So the read that table is very flexible. There's a lot of options you can play with. So for example you could have header, you could have no header, you can skip a few lines at the beginning if you want to when you read the table. Of course there's a limitation of space, it all depends on your computer and how much memory you have on it. It doesn't matter, it's very very flexible. You could have names for the columns and rows, you could have no names, you could have different things to separate, you could have quotes around. It's very very flexible, you just have to play with it and to get a feel of how to get it to work on your data set. But typically the default will work for most data sets if it's nice enough in the weights format. And to answer the other questions so we're going to see some examples in Gini expression. We're not going to look at Gini signature and things like that because it's sort of beyond the scope of this workshop. We're going to look at the HIV data set and try to find differentially expressed genes and hopefully you're going to get a feel of how to apply that to other data sets. I mean it's going to be very similar. Any other questions? Okay so yesterday we talked about, well we finished by talking about the t-test, the one sample t-test, the two sample t-test and the per-t-test. And basically for microarrays you would either do a two sample t-test or a per-t-test, which is just like doing a one sample t-test on the per differences when you've got a CDNA microarray. And remember I ask you a question about what is the p-value, is the p-value is the probability of making an error and in fact it's not because we said the p-value is only related to one kind of error. Okay and this is what we're going to talk about now. These are actually very, very important when you do a statistical test and these are called the type one and type two error rates. Okay so let's look at a small table. This is the decision we're going to make with respect to the test. We could either reject the null hypothesis, say the gene is differentially expressed, that is there is a change or we could say no it's not differentially expressed or we accept the null hypothesis. But in fact this is the decision we make right and we don't really know if we're right. And there's two outcomes, either H1 is true, that is we should really reject or H0 is true and therefore we should accept right. So if you look at this small table you're going to see a couple of things where we're going to make a mistake. Basically here we accept but the null hypothesis was wrong and here we reject but the null hypothesis was correct. So these are the two possibilities where we're going to make a mistake and here are the two things where we're going to be correct. So the first one here is called the type two error. So we're going to accept the null hypothesis, we're not going to reject it but it was in fact, the gene was in fact differentially expressed for example. There was a difference, we should have rejected the null hypothesis. So this is kind of like a false negative, we're missing something, right? We say it's not differentially expressed but in fact it was. The other kind of mistake we can make, it's probably the one we're more afraid about is the type one error where we're going to reject something but in fact it was not differentially expressed. So we're saying that gene is differentially expressed but in fact it's not, okay? So there's two kinds of mistakes you can make. Is this sort of clear this table? Okay, so let's try to understand when are we going to make these mistakes based on the test statistics. So here let's assume that the null hypothesis is mu not is equal to zero. So we're looking at the log ratios and we're asking ourselves is the mean of the log ratio equal to zero or not? If it is equal to zero the gene is not differentially expressed, if it's not equal to zero it is differentially expressed. So remember that what we're going to do typically with the statistical test is that we're going to fix the type one error rate. So we're going to fix the significance level which is let's say 0.05, this is what people typically use in practice. And then we're going to say if our p-value is less than 0.05 then I'm going to reject. This is greater than 0.05, I'm not going to reject. So let's say this is the value here, the cutoff so that the area under the curve in red over here is 0.05. If I get a test statistic that's above that value or below that value I'm going to reject it. But in fact if the null hypothesis is true, if this is really the distribution here that is centered at zero it's still possible that once in a while I'm going to get something fairly extreme in the tails above the significance level. That is it's possible that we make a type one error rate and the probability of making a type one error rate is 0.05, right? Because it's possible that by chance alone I get something in the tail of my null distribution. Of course it's very unlikely and we're okay with making an error and that's what we said at 0.05. On the other hand it's possible that in fact the null hypothesis is not true and in this case the true mean is 3 which you can see here with that distribution. And because the true mean is 3 it's going to be very likely that we're going to get something above that value, right? So we're going to reject because most of the time the test statistic is going to be above the critical value here. However by chance alone it's also possible that I get something that's below that and therefore I'm not going to reject the null hypothesis even though it's true. The greater mu naught is going to be the further away it's going to be from zero the greater the chance that I'm going to reject the null hypothesis therefore I'm not going to make an error. So here this is the type one error rate and in green this is the type two error rate. Is this sort of clear what we see in this picture? This is sort of important so if you don't really understand please let me know and I can try to explain it again. Okay so we're going to use the board a little bit. So remember here we've got so the null hypothesis is mu naught is equal to zero. So we want to know if is the mean equal to zero or not. If it is equal to zero then this is going to be distribution for my statistic. Okay remember that it should be a t statistic with n minus one degrees of freedom and this is going to be the distribution for my test statistic. And then typically you're going to compute the p-value and you're going to say if the p-value is less than 0.05 I'm going to reject the null hypothesis. But rejecting if the p-value is less than 0.05 is the same thing as computing of having the value here would so we're going to typically we're going to call this one something like z of alpha over two and minus z of alpha over two. I'm going to explain to you what that means. So this is the value so that you get alpha over two over here under the curve and alpha over two over here under the curve. Okay and if you take alpha a significance level is 0.05 this means that here you've got five percent that's covered in the tells of that distribution. Okay then if the t statistic that we observe is greater than this value or less than this value so let's say this is the value we observe over here. This is our t statistic. Then we're going to say well if the mean was really 0 okay if the null hypothesis is true and I get something that extreme you know in the tail even though I know it's pretty unlikely that I get something that extreme because the probability is 0.05 I'm going to reject. So if I get something above that value below that value I reject the null hypothesis because it's very unlikely that I get something like that by chance alone. And in fact the probability that I do is 0.05. Okay and remember if I get this value here I can also compute the p value which will be the area over here and minus t over here. Okay so you can see the relationship between the p value that we compute which is this in green and the significance level. If the p value is less than 0.05 which it is here because you've got less area under the curve then you reject. So it's the same thing. So you could reject either if p is less than 0.05 which is the same thing as saying if your t statistics is above 0 alpha over 2 or minus t is less than 0 alpha over 2. Okay so these are two ways to look at it to say I reject if the p value is less than 0.05 or if my tested statistics is above the critical value so that above that I get alpha over 2 and alpha over 2. Does that make sense? If it does not make sense please let me know. This is alpha over here, alpha, z. So this is the value. So okay let me give you an example. So if alpha is 0.05 if I want 5% then typically the value will be about almost 1.96 you know about 1.96 and minus 1.96 and you can get that from r. So if alpha is 0.05 then this is about 1.96. Okay so you can actually get that from a table and then if your test statistics is above that value or it's below the negative that value you reject. Okay on the other hand here what I show you in that figure is that it's possible that the null hypothesis is wrong. So in fact the true mean is not 0 but mu is equal to let's say 3. Okay in which case the true distribution is not that one but it's something like that. So of course if this is the true distribution then it's going to be very likely that I get something above the critical value and therefore I'm going to reject. But it's also possible that I get something below the critical value. Right? Because by chance alone I can get something below the critical value over here and therefore it's possible that I'm going to make a mistake that is I'm not going to reject when I should reject the null hypothesis. Okay and the further away the true mean is then the less likely that I'm going to make a mistake. Type 2 error. So this is the probability of making a type 1 error over here in red and in grid is the probability of making a type 2 error. Okay? So I'm not really going to go through the sample size calculation because it's a bit tricky. But this sort of tells you that using these kinds of plots you can sort of guess how many sample points do I need in my sample to be able to decrease the probability of making an error of type 2. So I'm not going to go through that but I'm going to give you a good reference for this. So you can actually go and do a calculation. You need to do a little bit of algebra for the calculation. So I've got a good colleague at the University of British Columbia who has a few sample size calculator. But what I want to say is that if we go back to the plot so for the sample size calculation you can just input a few things like your standard error, the difference between the true mean and the null hypothesis and then it will tell you how many... So what sample size do you require to have the power that you want or to have a given type 2 error rate? So if we go back to this plot what we see over here is that if I move this distribution on the right then the green shady area is going to become smaller and smaller and smaller. So if the true mean is very far from the null hypothesis so if it's 100 then of course this is going to be almost zero. It's going to be so easy. It's going to be so far from the true mean that if you take a sample you're going to see right away that it's not zero. If it's very close to the true mean, if you take a small sample then we're going to be able to know if the mean is really zero or not but the more data you're going to get the better and better your estimate of the mean is going to be and therefore the more you're going to be able to tell if it's zero or not. So that's the idea behind the sample size calculation is that if you increase the number of data points then you're going to be able to discriminate better and better between the null hypothesis and the true mean. Does that make sense? So if you ever come up to a problem where you need to do a sample size calculation then you can do it. You can go on this website. You just enter a few numbers. Very, very simple calculation and it's just going to take a few seconds. It's going to give you the sample size. The drawback of a sample size calculation so sometimes people say, you know, they come to me and they say, oh, how many, you know, what sample size should I use for my study? And I say, well, I don't know anything about your study so how can I tell you how many data points do you need because the problem with the sample size calculation is that you need to have a rough idea of a couple of things. First, the standard error but also the difference between the true mean and the null hypothesis because if you don't know that you cannot really compute the sample size that you need but in reality in practice you never know that, right? You never know what the true mean is because if you did then you wouldn't do the experiment to begin with. So it's a bit tricky so sometimes when you want to do a sample size calculation what you can do is take a small sample you know, maybe get an estimate of the standard error and an estimate of the difference between the true mean and the null hypothesis and then maybe you can work some kind of a number but it's pretty tricky to do sample size calculation that usually I just say, well, I don't know how to do it because I don't know anything about your experiment and of course typically people don't really want to take a few data points because if they ask you that question is that they really want to know the answers that they don't have to do these two steps. Any questions about this? Yeah. Distribution and look what's the actual... Pretty much, yeah, so what you can do once you've done an experiment you forget your data what you can do is that you can compute the sample mean from your data set and this will give you an estimate of the true mean so let's say I compute the sample mean and in this case it's 2.8 of course it doesn't necessarily mean that that mean is not zero because there's some variability it could be that I was very unlucky and I got a bad estimate of the mean but the more data point you're going to take the more likely it's going to become that by chance alone you got that number right? So that's the idea of the sample size calculation what you can do is that take a few numbers maybe compute the sample mean and then plug that into the sample size calculator and that's going to tell you how many data points you need in order to minimize the type 2 array 2.8, the new people to zero is in this case not equal to zero but equal to two people No, it's still... I mean the null hypothesis you know it this is something you want to test has nothing to do with your data set and you should never change after you observe the data set because you want to test it so in this case typically you know what you're testing so here we really want to know the log ratio is zero or not so you know that beforehand then once you observe your data you're going to be able to test the null hypothesis but you can't just upload the data, compute the mean and say okay I'm going to test that you always have to fix everything beforehand before you see the data and then you see the data and you decide whether or not you want to reject the null hypothesis okay? okay let's move on to something that's very interesting so what we've seen so far in the examples is that we've done a single hypothesis testing right we say each gene 1 different she expressed each gene 4 different she expressed but this is not really what you do in practice right you want to know what are the genes that are different she expressed in my data set so we're not going to do a single hypothesis testing we're going to do lots of tests at once so when you do a single hypothesis testing the rule is to say okay let's fix the type 1 error rate which is the alpha I've described typically people would choose 0.05 and then we're going to try to minimize the type 2 error rate so we fix the alpha the probability of making of having a false positive and then we're going to try to minimize the probability of having a false negative typically you don't really have to think about minimize the type 2 error rate because the T statistics for example these design or traditional tests are designed to minimize the type 2 error rate so once you fix the alpha you will just compute the p value as you've done before without even thinking about the type 2 error rate in fact in practice people are very often very concerned with the type 1 error rate with the false positives of course you don't want too many false negatives either but false positives are generally more costly than false negatives because you know you need to validate you're going to spend a lot of time and then at the end you realize it's just wrong so you wasted all that money on validation and so forth so the problem is that there are many tests at once it's not just a single test but in many tests at once so will the type 1 error rate be alpha is 0.05 so let's say I do 10,000 tests and for each of them I fix alpha to 0.05 I do my T test do you think at the end the probability of making an error or type 1 error rate is really 0.05 that is the probability of having a false positive is 0.05 it's not because if you want you're going to have so many things where it's possible to have a false positive that at the end you're certainly going to have more than 0.05 for the probability so we're going to see a small example to illustrate that so in our case the error probability is going to be much greater because we're looking at many genes for example so multiple testing is used to control an overall measure of the error so instead of controlling the type 1 error rate for each test we're going to say let's try to control a global a global measure of the error and there are two things you can do the first one is called the family wise error rate and the second one is called the false discovery rate so here's a small illustration of that so here we've got a thousand I do a thousand T test so I've generated some data from a normal distribution um so these are from a normal distribution with mean 0 and start deviation 1 and I generated basically a thousand genes with 5 replicates and for each of these I'm going to do a T test and then I'm going to look at how many T test are significantly how many genes are significantly differently expressed basically all of the null hypothesis are wrong so I should never reject the null hypothesis okay because they are generated from distribution with mean 0 so all of the genes are not differently expressed I should not reject anything but we know that it's possible that you're going to make an error right and in fact for a single test we know that it's 0.05 because we use the significance level of 0.05 so the probability of making an error of type 1 or having a false positive is 0.05 okay let's look at that in R okay so this is here so once again here I set the seed so that we all get the same data then here what I do is that I generate 100,000 data 100,000 data points from a Gaussian distribution with mean 0 and start deviation 1 and then I form a matrix of size 1000 and so 1000 rows and 5 columns and then for each of these guys I'm going to do a t-test the way I do it to be slightly more efficient I'm going to use the apply function and the way I'm going to do that I'm going to create my own function and this function is that you give me the data and I'm going to test if the alternative is true or false okay so I do one simple t-test to cite it and then I just return the p-value so here I'm going to get a bunch of p-values so if you type p you will see all the p-value for each of the test then here I'm going to count how many p-value, how many test how many null hypothesis am I going to reject so I'm just going to count how many p-values are less than 0.05 right because typically we reject if the p-value is less than 0.05 so I'm going to count these guys and I'm going to see that I have 44 right and I have 1000 test so here the error rate is about 44 divided by 1000 so what we can see is that so I fixed the error rate to 0.05 the type 1 error rate but I get quite a few false positive in fact I get 44 so it's certainly a lot more than I thought I would get because I fixed the error of making a mistake to a type 1 error to 0.05 the idea is to say okay what if we try to fix something else like a global error rate and the first one that people I started to use was called the family wise error rate is to say let's not try to do each single test and fix the error of making a type 1 the probability of making a type 1 error is to say let's try to fix the probability of making at least one type 1 error for all the test so I'm going to try to fix the number of errors I'm going to get over all the test the most so one way to do that is via multiple testing so you're going to try to modify your p-values so that this is going to be satisfied because if we just work with the the classical p-values that we got for each of the single individual t-test this doesn't work very well so what we can do is that we're going to say we're going to be a little bit more conservative so we're going to work with these p-values making them slightly larger in order to control the overall error rate and the first one that people have used is called born for any multiple adjustment so what you do here is let's say you've got capital G genes in these cases maybe you thought 10,000 what you're going to do is very simple you take all of your p-values you multiply by the number of tests okay that's very simple and if you do that then you will call differential express all the genes that have the p-value less than alpha then you know that the overall the family was error rate so the probability of making at least one type one error will be less than alpha so you're going to control the overall error rate what's the problem with this can you try to guess if there's a problem yeah I mean imagine you're doing 100,000 tests right you're doing like a genome white something with a tiling you've got millions of probes across the whole genome and you're doing all of these tests well you take a single p-value you multiply that by a million well it's going to be very unlikely that your p-value is going to be very small at the end right so if G is too large you might be slightly too conservative because the number of tests you have is so large that you're going to multiply your p-value by that number it's going to become large and nothing's going to be significant so this is one of the drawback of Bonferroni it's very simple and it guarantees that you're going to control the family was error rate but it's slightly too conservative in fact there are other procedures that are slightly more powerful that is not as conservative but they don't change that much it's just that the idea of controlling the family was error rate wasn't very good because maybe you don't care if you make one mistake when you have a million tests maybe you're okay with making a few mistakes when you've got a million tests so maybe this is not the right quantity to control because here you're saying I want the probability of making at least one mistake extremely small therefore it's going to be very unlikely that you find anything because if you do and there's one possible false positive in your list you're going to say no no no I don't want that and that's why often with these types of correction you're not going to find many things is that sort of clear? yeah so then I think it was in the 90s came along the falsies curve rate and the falsies curve rate was actually invented by a couple of statisticians and it really came out because most of the problems real life problems started to become sort of high throughput you had many tests at once that you needed to do and the only thing that we had before was the family was error rate which was not very well adapted to these new technology that were high throughput so the idea of the falsies curve rate is to say I don't care if I have a few false positive I'm okay with a few false positive as long as the proportion of false positive that I have in what I declare significant is small enough so you say you know I'm okay if I've got ten false positive as long as in my lease I've got a thousand things that are true I can deal with five false positives so that's the idea is to say let's try to control not the probability of making a single mistake but try to control the proportion of false positive among the genes that I call differential expressed okay so this was derived by Benjamin and Hockberg in 95 and then there's been a lot of modification of this FDR procedure but it hasn't changed very much since the since 95 so the procedure is slightly more tricky to understand how you would do the correction I'm going to give you the formula and then I'll show you a plot because it's much easier to understand on the plot in fact once you see the plot it's very easy to do once again the truth is that you'll probably never do that because R does it for you there's a function that can do the multiple testing you say I want FDR and you input the p-value it outputs the corrected p-value and then life goes on you can declare your genes differently expressed so here what you need to do for this procedure is that you would order all of your p-values so you've got your g-test you order all of the p-values from the smallest to the largest and then you're going to pick the largest value k so that p of g is less or equal than g divided by capital G the total number of genes times alpha the proportion of false positive you're okay with and then if you do that if you call all of the the genes differently expressed that are less than this value then you will know that the false discovery is controlled at a rate of alpha typically for the false discovery people are okay with 5% of false positive or 10% of false positive so this probably doesn't tell you very much it's not very intuitive why this will give you the right thing in fact even the proof is not very intuitive but it's not very difficult to show that it's true let's look at it on the plot you will understand what this means one key point here when you do the FDR correction in that way is that the test need to be independent okay so this is this might be kind of a problem do you think the when you're looking at gene expression data said do you think all the genes are independent no because you know there's probably a lot of pathways and things that make one gene differently expressed and therefore the other one should also be differently expressed so that the the hypothesis are not actually exactly independent however it's been shown that even though here you require independent of the test statistic in order for this procedure to be valid if it's not the price to pay is not too high and in fact it works out quite well but if you knew that that's a good point if you could cluster the genes into the pathways of something where you know if this one is different she expressed so should be this one and so forth then it will be a lot easier because you could sort of cluster the genes and then do a test for each of the different pathways of the different clusters in fact you will gain in power because you will have more data for each of the clusters but typically you don't know that beforehand but people have tried actually to work on things like that where they say okay let's try to cluster genes and at the same time we're going to see if each of the clusters is different she expressed I don't know yeah yeah I don't really know I mean that's a good question but I think it's because they want to make sure they've got you know the reality is that they say they're okay with 25% and then you give them a leaf let's say a thousand genes but at the end they're only going to look at the top ten so you know this is not really you shouldn't ask the question to me because I don't really know 5% chance of rain or 5% chance yeah of course so it's all about the cost at the end right okay so this is I'm going to show you the procedure here so this is again this is a toy example I've generated some data for the t-test for each of these and then I've got all the p-values so I think I've generated the data in a way that they are 10% of true positive here so there's 10,000 genes and you can see there's two things where I generate first 9,000 with a mean of 0 and then 1,000 with a mean of 5 so there's 10% of true positive so you would expect to see quite a few significant genes in here the procedure works visually it's quite easy to understand you order your p-values so this is from I'll only show you the top 1,000 here because there are more genes above but then the plot is too big so you order the p-values from the smallest to the largest over here so these are the p-values that are ordered and then remember the procedure it says that if the p-value is less or equal than the index divided by the number of test times alpha, you're going to reject basically this is just if you look at G divided by capital G times alpha this is just the equation for a line so you can actually draw that line this is the line Y equal G divided by capital G times alpha and then everything that's below the line is called non-differentially expressed everything that's above will be called non-differentially expressed so it's pretty easy to understand on that graphic how you do that you order your p-values and then visually you can just say these are the differentially expressed genes of course you don't really need to do that because here if you look at it there's a function called p.adjust that will do it for you and here you can specify the method you can look at Bonferroni so it's difficult to explain all of these things on a PowerPoint presentation this is you do a graphic like that you've got the p-value and you've got the index G for the gene so you're going to have your first p-value so these are ordered you order the p-values the first one here this is what I have over here and then so G is the index of the gene then the formula says you're going to reject if the p-value for the gene G is less or equal than G divided by capital G times alpha but this is if you look at this this is just if you look at this as a function of the index G this is just alpha over G times small G so this is just a line of alpha over G and intercept 0 so this is just a line that goes from 0 and that has a slope of alpha over capital G so this is just Y equal alpha capital GX it's just the total number of genes and therefore all of the genes that are below that line will satisfy that equation and therefore you will reject them while I put this for these guys so this is actually very easy to understand graphically but it's something that people never explain that way I mean if you look at a book you will find this formula but they will never show you a plot that says this is just a line and everything that's below you do it that way yes is there any other variables there like the G value no because you input all of your P values so he knows the G because he knows the length he knows everything that it needs to do so the output of it will not be here's the list of difference expressed genes it will just output the corrected P values then it's your call difference expressed if the adjusted P values less than a point of 5 for example will tell you and if you do that then you know that the FDR will be 5% which is nice because it will adjust the P value for you and then you still make the final call and you can try 5% and % so forth say that again the false discovery rate will be given by the value that you pick so of course this is just an estimation right you never know the true value of the false discovery rate you hope that so if all the assumptions are correct if the test you're using is correct if the test are independent and everything then the theory tells you that the false discovery rate should be controlled at 5% so it should be smaller than 5% but of course in reality not all the assumptions are going to be valid and satisfied so you might get something that's slightly bigger than what you hope for and it might help you in trying to decide on a cutoff even though it might not be the exact false discovery rate you're hoping for okay so very good question so the Q value was something that was actually introduced by a friend of mine John Storie so John Storie was a student at Stanford and now he's actually a full professor what is it the institute for genomics at Princeton or something and the Q value was kind of like so what I computed here and the adjusted P value will be the Q value in that case so this is a Q value so the Q value is kind of like the analog of the P value but it's using the false discovery rate so if you use 5% for your P value you will know that the false discovery rate is 5% but people don't use Q value that much I don't think some people use the Q value but nowadays I think we're moving away from the Q value because it's been a bit confusing for some people so John might still be using the Q value a lot because he introduced it but I've never liked it that much because I think it's pretty confusing for some people but you you're right here the adjusted P value will just be the Q value other questions so we've got all the codes so I prefer to show you some things and you can play with the code and I should say that even though so I'm going to be leaving after this module pretty much of course you can ask me some questions but if in a couple of days couple of weeks, couple of years you look at the slides and you'll be like oh I don't understand what we're doing feel free to drop me an email I'll be happy to try to explain some of the things that are in the code ok so here what I do is that there's the HIV data set so I'm going to do a one sample t-test so all of the genes and then I'm going to call the gene differential expressed if the P value, just the row P value no correction is less than 0.05 so this is what I show you here, this is the MA plot and I'm just highlighting the genes that are called differential expressed by the t-test so of course this is kind of doing the right thing so I expect the genes that are far away from the line y equals 0 to be differential expressed either up or down regulated and you do have these positive controls over here but you get tons of stuff and there's tons of stuff even towards the middle where probably this shouldn't be differential expressed so you can see that if you just use the P values of 0.05 doing no correction you get a lot of false positive which is what you expect to control the overall error rate now if I do a Bonferroni correction this is what I get I get two little guys up there and I completely miss even the ones that are so obvious that these are differential expressed that are the HIV positive controls so Bonferroni here is way too conservative you see it right away if I do the FDR well it's better I get most of the HIV differential expressed but I do get one guy over here and one guy over here and one over here so that's a bit strange that I do get a couple of things that seem to have very low log ratio so let's try to understand why we get this kind of behavior here so what I'm going to show is the mean of the log ratio and the start deviation of the log ratio so this is just like the average of the log ratio across the four replicates against the standard deviation of the log ratio across the four replicates what we can see is that many of these guys that were fun differential expressed have high log ratio but some of them have low log ratio low log ratio but also low standard deviation so how come they have very low log ratio that's the amount differential expressed yeah and so you mean that is very small you can still get them separated right so this is the t-test is the ratio between the difference and the noise exactly so when you're going to compute the ratio remember when you compute the t-statistic is the ratio of the mean divided by the start deviation right something like that there's two ways the t-statistics can be big it can either be I use a different notation for the log ratio but it's like the y so there's two ways it can be big either the mean is large or this guy is small right if the start deviation is very very small if you divide by something very close to zero you're going to get a very large number no matter how big or small this is so in theory this is how it should work but in practice it's not really how it's going to work and I'm going to tell you why so there's a couple of problems for the t-statistics is that for many genes s is small y well probably in practice it's not very small but since we only have four replicates some of the estimates we're going to get are going to be very noisy and therefore maybe by chance alone you're going to get an estimate of the start deviation so if you start your sample size probably you would be fine but this is one of the big problems when you're doing high throughput expression arrays and you're doing gene expression testing that the sample size is small and because the sample size is small the estimate of the start deviation is going to be very noisy when you've got lots of genes it's very likely that by chance alone many of these will have a very small start deviation and therefore very large t-statistics maybe are not different she expressed then there's also the question is is t really t distributed so we said well if it's normal it's fine if it's not normal and the sample size is large it's okay and if it's not then you're in trouble you need to do something else you need to do a permutation test or maybe a non-parametric test or whatever and here we just did the typical t-test yes yes exactly all of the procedures we've talked about for multiple testing they don't care about the test you use they just care about a p-value so if you've got a list of p-value you can use them if you use the t-test Wilcoxon test permutation test whatever you get p-values you can do it okay so for many genes s is small and that's kind of a problem and that's by news so what can you do so a sample very simple things you can do is that you're going to add a small positive constant to the standard deviation therefore if this gets too close to zero you're going to get protected because you have that small constant that you're adding to the standard deviation so it's kind of a hack this is not really a test this is not really a t-statistic but you define your statistics as you want and then the problem is going to be how you're going to compute the p-value and we're going to see that so in that case not only before doing that modification but even after you do that modification is that you might ask yourself is t really t distributed well in that case probably isn't because you've modified it anyway so what we can do is that we say okay we're going to do something slightly not parametric we're going to try to estimate permuting the labels as we've explained yesterday and then we can compute a p-value from that okay so under the assumption of no differential expression we can just permute the columns of the data matrix and then by counting how many times I observe something as extreme or more extreme I'm going to compute a p-value so this is what Sam does so Sam is for a significant analysis of my queries so I've got a good story actually so this is actually a paper that got published many many times cited many many many times because first of all it's a good paper it's really easy to understand the methodology works well but also because there's an excel plugin yes oh okay good thanks so it's been used a lot because there's an excel plugin so for those who still want to use excel though I hope you're not going to use excel to this workshop you can but there's also now a package that you can use that does the same thing at the same time we actually so in my group we no no no it's independent and so that paper gets cited a lot of times and at the same time we actually published a paper which will call so this one was called a significant analysis of my queries and our paper was called statistical analysis of my queries and we actually got a lot of citations because people you know were thinking they were setting this one but they actually cited our paper instead so so if there's something that has a good name and so just change the title a little bit with the same acronym and maybe you're going to get a free a few citations okay so the idea is we're going to modify the test statistics so it's almost like a t statistic you get the log ratio the average the log ratio you divide by the standard deviation and then we add a small constant so remember the original t statistics we also have the overall the square root of n but here because we're doing permutation we don't even need to we don't even need to put that in it doesn't really matter because we are estimating the p-value in a non-parametric way so it doesn't change anything so it just simplified the formula not to put the over the square root of n so how do you estimate these guys so the way they've done it is that the standard deviation of all the genes and then they will take the 95 percentile of that standard deviation so it's like taking an estimate of the standard deviation using all the other genes and this is what I talked about yesterday is that remember when you do multiple testing or when you do testing on lots of genes at once sure you do each test separately but you've got some information coming from the other genes right so in that case we're going to say what's a pretty common standard deviation coming from all the other genes and then we're going to use that and add it to the standard deviation for that gene so it's kind of like we're going to regularize the standard deviation using the standard deviations from the other genes and Sam will do that for you will estimate the number and you will add it to the standard deviation and then you will estimate the distribution under the null by permitting the labels of control and treatment just like we've done with the deviation test for each permutation you compute the new test statistic and then this is also what's nice about Sam is that there are two ways you can control the false discovery you can say okay let's try to compute the p-values and then adjust the p-values Sam does it in a slightly different way he's going to say let's try to fix the rejection region so we're going to reject if the test statistic is bigger than some value delta and then for that value we're going to try to estimate how many false positive do we have and then we're going to pick the value that gives us the number of false positive that we're comfortable with but it's very similar to computing your p-value and doing the adjustment afterwards and then you can based on the number of false positive that you estimate after the permutation you can actually fix the fdr that way so there's the excel plugin that you can use there's the r package that you can use as well it's pretty easy to use so here I there's a big piece of code because I do lots of things to illustrate a few things but in fact you don't need to do all of that so here's if you use Sam in this data set this is what you would get right you get something that's a lot cleaner you can really see the up and down regulated genes you get also all of the true positive HIV genes so this is really the sort of things that you should you should see on a gene expression data set this is what you expect of course here there might be some false positive because I think I fixed it to to 10% so it's possible that you're going to get something that you're going to get a few false positive maybe over here you can see that these could be false positives but overall I would say it looks very good and it looks like it's doing something very sensible here so this is just basically an idea of saying let's try to regularize the T statistic compute the FDR and this is what you get okay there are other methods that are pretty similar to Sam in spirit so we've done lots of work on this there's also another one that I like very much which is called Lima it's available in by conductor it's very popular people use it a lot oh okay so so the the values are the same because the data sets the same and here just display the MA plus so what's important are the green circles right and how many of these guys are highlighted and you go back okay so you can see here that it's really different right I mean you get lots of stuff in the middle where probably you shouldn't get anything right so you can see that the number of green circles that you have with Sam is way less than this and it seems that it's doing the right thing because it's picking up the points that are sort of far away from the middle cloud do you see the difference between this one and you yeah so this is fine this is doing the right thing but here you get lots of false positive you can see that these guys have almost like a zero log ratio but they are picking up they're picked up as differently expressed by the Ropee values does that make sense the number of permutations that you set into Sam changes the number of genes that are going to be going to help? yeah so what Sam will do is that if the number of permutation if the number of distinct permutation is low so if you've got only you know two replicates for example there's only like three permutations you can do so it's the most impossible if you've got three there's that many you can do four there's more so you will do all of them if it's less than a thousand maybe or something and otherwise you will just pick a thousand random permutations which is typically plenty for estimating the P values okay so remember yesterday when we talked about the permutation I think you've asked how many permutations do you need or do you do and I say well it depends on how many replicates you have because if you've got only a few there's only that many permutations you can do okay in that case Sam will do them all it's a small number it will do all the possible combinations if that number is too large you will randomly pick a subset of that it's close to a thousand a thousand is fine it's not too large it's not too small it's a trader right because you could do you know what I'm asking you is then what's the number so some there's one thousand samples no no no no this so if you get only four replicates then the number of permutations you can do permuting the labels is small because there's only a small number of combinations you can do so if that number is small then Sam will do all the possible permutation if you get maybe 20 and 20 in each group there are plenty of permutations you can do probably too many and if you were to do them all it would take a long time so in that case Sam would just say okay if you take a thousand it's enough for the estimation the p-value so I'm going to stop there okay so if there's a thousand so you can include a thousand if you have a small sample size all the ones that are possible exactly if the number is too large and Sam will say well then after that after and we found that they're redundant so you don't need to do it it will stop any questions about this yes if you assume your false discovery rate aren't you making an assumption no necessarily so you're saying I want the maximum false discovery to be 10% no I'm saying we're just deluding yourself no I mean I think so I'm going to repeat your question just to make sure I understand correctly so you're saying if there's a lot of differentially expressed genes and you fix the FDR to 10% it's fine if there are only a few then fixing it to 10% you're going to get a lot of junk okay so what you do when you fix the FDR to 10% doesn't mean that you're going to get 10% for sure it just means that you're hoping that the maximum will be 10% okay so in the case where you have only few differentially expressed genes let's say there are only two differentially expressed genes okay then if you get if I give you a list of four then the FDR will be 50% so you're not going to get tons of false positive because remember it's the ratio of false positive respect to the one that I give you so if I give you too many right away you're going to get a large false discovery okay so in order to have 10% basically I need to give you only two genes that are the true and if I give you more than that I'm going to get 25% or 50% right away so it doesn't change it's just that the list that I'm going to give you it's probably going to be small or even maybe smaller because if I give you two and one of them is a false positive then right away it's 50% right so in that case it's of course it's going to be more difficult for me to give you something because as soon as I will give you one a false positive I'm going to pay a high price because one false positive will be at least 50% FTR okay so if you fix the false discovery where you get like the genes after you analyze it you expect that one out of the set is going to be false positive yeah but here we're saying let's imagine you have a data set when there's only two true positive genes okay in that case it's going to be very difficult for me to give you a list of different genes that are low FTR because if I do there's a couple of possibilities I give you the two true positive right away then that's fine there's 0% FTR I give you the true false positive plus one false positive but then right away the FTR is 50% right almost 50% or I give you two one of them is a false positive one false positive so probably it will only give you one or zero gene because if I give you more than that it's very likely that the FTR will be higher than what you want okay so it's going to be more difficult to find something you don't know that we don't know so that's one of the problem right it's what I said is that we hope that you know the FTR is going to be less than 10% if all the assumptions are correct which they're not because you know it's real life it's real data analysis and it's real statistics but we're hoping that we've done some validation we've looked at the histograms the box plots it seems that there's no big violation of the assumption so we might think that we're doing okay but you will never know which ones are true and false positive so here we know a little bit we know that these are actually true positive and we do get them so we would expect to get these guys but after that we don't really know which ones should be positive or not yeah but I think it all depends on you know and it's what I said it's always a price of the false positive and false negative in some cases you're very worried about a false positive right but in other cases you're very worried about the false negative yeah I mean this is one thing for the test you cannot just test someone you're fine you know you don't have the disease or the virus or whatever you know go on and live a happy life and then that guy is really sick right and you should be giving him medications in that case you really care about the false negative right whereas the false positive you might say okay you know here we've done the test you're positive therefore we're going to do it again to make sure that it truly is a positive case you know the truth is I don't think doctors know much about statistics yeah I mean that's the hope I think but I think there are people who work with doctors in trying to make these things you know I mean there are clinical trials and everything and there's a lot of us statisticians involved in these things so before it gets to the doctor you know there's a lot of statistics that's been done on these things right yeah yes no I mean if I give you a p-value the truth is that I don't know anything about the data or the test statistics or whatever if someone tells you oh I've done this and here's the p-value you have no idea right I mean I could just guess a number between 0 and 1 say you know 0.49 and I say this is a p-value it doesn't tell you anything it's just garbage because I've guessed the number there's not even any statistics behind it so for you to be able to say something about it you need to go where are the assumptions was it done correctly you kind of just look at the p-value and say oh it's very small and very confident that everything's fine and that it's not a false positive or whatever yeah so in that case of course because if everything's correct right typically and if we just look at 1 gene we forget about the multiple testing now if you just look at 1 gene you would reject if the p-value is less than 0.5 right but if the p-value is 0.049 it's pretty close right so you're like well you know would have been sort of almost likely to get something as extreme but if it's you know 10 to the minus 10 you're like well I'm very confident that you know I should reject the null hypothesis so yes so in fact I mean this is a good question so often people will say okay let's fix alpha we reject the test and I say this is false positive this is false negative this is difference expressed this is not I like to say well okay you've done that then maybe it's good to put a number with that so keep the p-value you know put the p-value in your table because the p-value is kind of like the strength of the evidence right if it's really small you're going to say okay the test is the gene is difference expressed if it's not that small but it was close to 0.05 then you must say okay well we called it difference expressed but you know it's right at the boundary so maybe you should be more cautious about it so yes it does tell you something yeah that's a good question I mean as a statistician I would sort by p-value because yeah and then it all comes down to you know do you really trust the statistical methods you've done and so forth if you do then basically what you're talking about is kind of like statistical significance versus biological significance I mean if you know for sure that you know a full chance of 1.2 is not interesting to you then don't look at it right I mean this is something you know that I don't as far as I'm concerned this is not the question you ask me you ask me you know what are the genes that you think are difference expressed and this is what I do and this is what I give you the p-values now if you think that if you think that what is interesting to you is not difference expressed but full chance is greater than 2 then it's not the same question right so I think if you if you don't want small full change then maybe you should try to refresh your question in another way to say well give me the genes that are at least that or that they are very different I mean if you give me a p-value you know and I give you the extreme case of 0.49 you know and p-value 10 to the minus 10 and you ask me which one do you think is the most significant oh I'll go with the 10 to the minus 10 yeah so when you reach that say 0.001 does that matter 0.00001 I mean after a certain level of because I mean when you say 0.001 you are saying that less than 0.1 percent of the time it's going to be a false right it's going to be a reject it's going to be my chance but when you are dealing with 0.01 percent of the time 0.00001 percent of the time does that matter it depends I mean if you if you say you know cost me a million dollars you look at one gene I can only look at one which one would you pick I would pick the one that the smaller p-value right so it does matter in that case if you say I don't care I've got lots of money you give me the least and I can look at all of them then it doesn't matter you can look at the function of the gene and then go to the biologist if you really want to rank them yes it does matter because you need a way to rank the least and to come to prioritize some genes to look at I have another question regarding this issue of the full change versus the p-value can we can't we expect that with a small sample size in doing migraines usually people have small sample sizes I'm really in power so if I'm growing by the p-value it's most like I'm going to lose the target there but I cannot see because of the power of the p-value so if I'm doing because what I believe is that I'm willing to reject the false the false positive but I'm going to throw away a bunch of false negatives too whereas if I refer to make sure that I put on downstream testing so it depends on the number of p-values it's always going to be a trade-off between false positive and false negative and it's going to depend on your resources what you can do what you want to do so there's no magic answer to that it will all depend on the experiment and probably what you want to do with the experiment okay so we've only talked about the t-test but sometimes you want to know if you've got 10 conditions and you want to know find me the genes that are changing in at least one of these conditions so the way you can do that there's a generalization of the t-test which is called the f-test it's exactly the same idea it's going to try to find changes across all of the conditions so the idea is pretty similar you compute a test statistic which is an f-test you compute a p-value and then you can rank your genes based on the p-value you can do it for you but SAM also does it for you so instead of doing a t-test you will do an f-test and the story is exactly the same so here it's just saying I have more than two conditions let's say I have three conditions and I want to find the genes that are differently expressed in at least one of these three conditions yeah so actually an f-test is ANOVA it's coming from ANOVA on the slide so SAM also uses a modified f-test so because of the same kind of problem because the f-test is also a ratio of two things you need to tweak it a little bit so that you don't run into the problem where the sound deviation is small at the bottom of the test and this is in fact very important the f-test can be very bad for that as well so there are other alternatives that we need not really talk about so there's non-parametric versions of the t-test f-test we've seen the modified version of the f- and t-test and there's also some methods that I call Bayesian methods either empirical or fully Bayesian approaches these are very similar to what SAM does so when you do the correction of the sound deviation it's kind of like saying well I've got my constant that I add this is almost kind of like my prior information what I know about the sound deviation from all the other genes so