 Essentially, the issue here is that after exploratory data analysis has allowed you to identify structure in your data, the question still arises whether a statistical model that you might then apply to, for example, predict unknown data, the properties of unknown data, or ask whether any distinction you make or any conclusion you make is statistically valid and significant, whether these hypotheses are rational, i.e., whether we have support that the hypothesis is actually supported by the data. So what does it mean that the data supports a hypothesis? Probably what we all think about when we hear hypothesis testing is p-values. So that's something we'll really need to clarify a little, especially since in recent work the use of p-values has really come under fire or the abuse of p-values. We'll talk a little bit about that. So once we have a statistical model that describes the distribution of our data, we can explore data points with reference to our model, and we typically ask questions such as, is a particular sample a part of the distribution, or is it an outlier, or a very frequent question, can two sets of samples have been drawn from the same distribution, or did they come from different distributions? So this is the classical control and experimental cohort type of experiment. This is confirmatory data analysis. So there are some concepts we need to clarify. We often talk about the null hypothesis and alternative hypothesis. So somewhat informally stated, the null hypothesis states that nothing of consequence or nothing of interest is apparent in the data distribution. This means the data completely corresponds to our expectation. We learn nothing new. We learn nothing surprising from looking at the data. The alternative hypothesis states that some effect is apparent in the data distribution, and the data is different from our expectation, and we need to account for something new. Now note that this is not a stringent hypothesis in the sense that we're explaining what that hypothesis is. We don't have an alternative account of our data. The alternative hypothesis simply is the rejection of the null hypothesis, and additional work might be required to exactly model and describe what our data is like instead. And in order to distinguish between null and alternative hypothesis, we use statistical tests. And the possible numbers of statistical tests is large. No, it's not large. It's huge. It is absolutely impenetrably, impossibly huge and subtle for the non-initiated. So if you don't have a solid background in statistics, don't spend too much time on statistical tests. You will make mistakes that way. If you think that there's something statistically involved, do get statistical help, and specifically do not come out of this workshop, and just because I mentioned p-values and null hypotheses come to unwarranted conclusions in your actual data. So what we can do here is discuss a little bit, and I'll show you along some of the simple tests that you can do and discuss what they mean. But this is really a domain of expertise that should not be underestimated. So common types of tests that we often come across are things like one sample test, where we compare a sample with a population, or two sample tests where we compare samples with each other, or even paired sample tests where we match pairs of observations, and we ask whether their difference is in some way significant. Now, all these tests are basically done with a statistic in mind. In that sense, a statistic is a kind of measure, a kind of outcome for an algorithm that we apply to our data set. So such a measure, for example, could be a Z test, i.e. a test that compares a sample mean with a normal distribution and asks how many standard deviations does my sample mean deviate from what I would expect in a normal distribution. If the sample and the control is normally distributed, I can then translate that into a probability. That only really works well if the samples are not skewed and are symmetric, and it actually is a normal distribution. A t-test compares a sample mean with a t-distribution, which relaxes some of the requirements on normality, that we should have an underlying Gaussian distribution. Well, for example, through a q-q plot. So, well, if you do a q-q plot against a normal distribution, a q-norm plot, then you will find that all the rank points kind of lie on the same line, and if you weren't here on the first day, right? We went through that example specifically on comparing normal distribution and t-distribution. Maybe just download that if you're looking for an example. Very briefly, you will see deviations. t-distributions have a much larger tail than normal distributions, so outliers are much more common, and you will see them more frequently, and that has a very, very clearly visible signature on the comparison. So, this relaxes it. Something that's very important to remember are that often we have no reasonable model of what our distribution should be. It might be a very highly skewed distribution that doesn't correspond to one of the standard probability distributions. It might be something that's bimodal or whatever, and in this case, often we use non-parametric tests, i.e., tests that translate differences only into rank orders, and these also can sensitively pick up experiments. So, what hypothesis testing really means, regardless of what tests you apply, you have some observation, you have a model of your data, and you ask about the probability that the model of your data could contain your observation. So, what's the probability that my model would contain the observation? So, there are errors you can make in that, and these are often summarized like this. If the null hypothesis is true, we can accept the null hypothesis, and then we are correct. If the null hypothesis is false, or the alternative hypothesis is true, we can reject the null hypothesis, and then we're also correct. But there's two ways to make errors. We can make a so-called type 2 error, and I always discourage the use of the word type 2, because you can also call it a false negative, and the false negative immediately tells me what you're talking about if you tell me something about a type 2 error, I need to try to remember the stats knowledge I should have paid better attention to somewhere as an undergraduate, and we'll probably get it wrong anyway. So, I like to use false negative, I like to be explicit, and I advocate other people to please do the same, and get type 1s and type 2 errors away to the dustbins of statistical history where they really belong to. So, what's a false negative? Well, a false negative says you think it's a negative, but in fact it's not. So, you accept the null hypothesis, but in fact the alternative hypothesis was true. You think there was nothing going on, but you simply missed the effect. This is when you miss important correlations in your data or you miss the genes that you're really interested in, perhaps because there's too much variation in your data. Alternatively, you can have a false positive. A false positive means you identify something that you're really, really interested in, but in fact the signal that you got that caused you to identify it was just stochastic fluctuations or noise. This is where you waste your granting agency's money because you're going to be following up something that will never turn out to be productive in any sense. So, both errors should be avoided, and the question is when can we safely decide that something we should be believing in and pursue or when is it quite likely that in fact we should be accepting the null hypothesis? This is a little bit out of order. So, in my scripted example, I'd be using t-tests for comparing differential expression values. T-tests apply to observations in principle that are independent and normally distributed with equal variance. And then the one sample t-statistic or t versus a population is defined as the value that you get if you compare the mean value with the mean of the population, so the mean value of your samples, sample mean and the mean of the population is divided by the standard error, i.e., the square root difference, some square difference of all the values to the mean. So, basically, the larger the error, the smaller your t-value is going to become. And in this way, we penalize noisy samples and take into account that in order for samples to give us a lot of confidence, they should not be noisy and they should basically have all closely the same value. So, in a two-sample t-test, we test if the means of two distributions are the same. Again, we are looking at data sets that are independent and normally distributed with the mean of something and a variance, and we assume and that's important and that's maybe the most tenuous assumption across the board in biology. We assume that the two groups are independent and this is really tenuous in biological samples because there can be dependencies through confounding factors or through subtle interactions that you don't yet know about and that really can cause a lot of grief in the statistical analysis. And we also assume that the variance is the same, but both the normality and the variance are not that critical if you have enough data and enough measurements. So, given that, two-sample t-test basically works out in a similar way, but with a difference in... that now describes the difference between means. Now, once the t's are computed, you can translate them into p-values. There's a mapping possible between the t-test and the p-value just like we can get p-values from normal distribution. And then you can start talking about your p-values and asking whether a particular observation is significant. Now, I'd like to point you to a couple of papers that have recently appeared. If you could load the hypothesis testing part of the tutorial. There are three PDF files within the files that should download. If you open the Significance Revisited, you will get a paper from Nature Neuroscience, erroneous analysis of interactions in neuroscience, a problem of significance. This was published in 2011. And what the authors did is they analyzed papers in neuroscience about their interpretation of p-values. And specifically, in a distressingly large number of papers, they found statements that are similar to something like this. The percentage of neurons showing q-related activity is increased with training in the mutant mice, p less than 005. So a significant result, but not in the control mice with a p of greater than 005. So the problem of that is that what's being done here is two different independent one-sample t-tests. So one group here is significantly different from the population. And one group here, the control mice, is not significantly different from the population. So up to that point, this is correct. However, the inference is that mutants and controls are somehow different. Probably when you read this, this completely slipped by you. But the inference of this sentence is that mutants and controls are different. But this is not supported by the statistics. Imagine that you have 10,000 mice, and the mutant mice give you a p-value of 0.051. And the controls give you a p-value of 0.049. That would not be significant, but the other one would be significant. Now you would be arguing for a difference between the mutant and the control mice, which is based only in a difference of p-values of 2 in a thousand. That is almost certainly not a correct difference. So this is where a two-sample t-test would have needed to be done. You compare directly the effect in the mutant mice with the control mice. And there's no telling how that will come out. Apparently there was possibly a lot of variation in the control mice. And then the difference may no longer be statistically significant. So this is a really important cautionary tale. I mean, if I tell you this like that, you'll say, it's obvious, and I understand that as soon as I say it. But these are papers that were published in Science, Nature, Nature, Neuroscience, Neuron, and the Journal of Neuroscience. So these are not lightweight papers and lightweight authors, and not lightweight reviewers either. So this is a really important cautionary tale about statistical analysis and the many traps and pitfalls that are in that. So if I say R makes it really easy to do statistical analysis, but there's no guarantee that you are doing the statistical analysis correct and don't assume that just because I showed you how to run a t-test that you'll be doing the right t-test in the right context on the right numbers and the right data, that's not going to fly. And this is where I come from when I say that. Even people who are the absolute experts in their fields are prone to make very fundamental errors that probably in some cases will lead to irreproducible results or, in the worst case, to high-profile retractions. So be cautious. This is well analyzed here. Moreover, and that's something that's in the other two papers, in a Nature News and Views article of this year, there's a report of the American Statistical Association, this is in March of this year, warning that the p-value is often misused. And the American Statistical Association has issued some guiding principles how to properly use p-values in the interpretation of your data. So p-values alone are probably not sufficient to interpret your data because really all they tell you is whether the null hypothesis might hold or might not hold under certain conditions. They don't tell you anything about your experiment. They tell you something about the absence of the effect. So if you think your hypothesis is true because the p-value says, so this is wrong, it simply says the null hypothesis that your hypothesis was false has a low chance of being correct. The paper itself is here, this ASA statement. Wasserstein et al. 2016, and it has a very nice initial discussion of why this statement was necessary in the first place. I think the reasoning is very familiar to you since you all have been in the field for a while. Why do so many colleges and grad schools teach a p-value cutoff of 0.05 and that's because what the scientific community and the journal editors use, right? And why do so many people still use a p-value cutoff of 0.05? Well, because that's what you're being taught in college and grad school. So there's a certain kind of a circularity in the sociology of science that doesn't just apply to p-values, it applies to other things like cluster LW. We teach it because it's what we do and we do it because that's what we teach. So this is a nice discussion, but it also of course contains the actual recommendations, the ASA statement on statistical significance and p-values, which is quite readable, makes for great background reading. So p-values of 0.05. What's a p-value anyway? It's a measure of how much evidence we have against alternative hypothesis or the probability of making an error or something that biologists want to be below 0.05 or maybe in the best way the probability of observing a value as extreme or more extreme by chance alone. Now that's something that you really ought to understand. We're talking about the distribution of possible values here and we're talking about a single observation of a statistic. And we're not asking what is the probability of that observation because the probability of a single number within a continuous distribution approaches 0, it's just one point on the number line. We're asking about that number or something even more extreme. So basically we're asking if you think of this as a Gaussian or a normal distribution, right here. So here we have our normal distribution and now we have an observation say it's out here. What we're asking about is the area of the curve from here to infinity as compared to the area of the curve from here to infinity. That's in the one sound, in the asymmetric case or you can also use that in the symmetric case with absolute values. Now this little sketch actually also immediately illustrates to you an alternative to using rigorous statistical tests for computing or calculating p values because if you are able to simulate your statistical distribution here you don't need to integrate it. So we can have a simulation where that follows this distribution and then ask how many of my simulation results are smaller than and how many of my simulation results are larger than my observation. And then simply from the simulation counting statistics you can say well how probable is it that the value as median does the one that we've actually observed occurs under these circumstances. And that's something that's actually powerful because if you are able to capture the ideas that you embody in your experimental design in a simulation you can use this very simple simulation procedures to get rigorous estimations of p values of individual observations. You don't need to think very clearly about whether something is normally distributed you don't need to know what the proper conditions are when you should be applying a t-test or an f-test or chi-square-test statistics or an ANOVA or whatever you simply do your simulation. So rather than having the statistical and mathematical expertise to know whether your analytical test is correct and hoping to find that that particular test models your biology well your expertise is in the biology and needs to go into the simulation and make sure that the simulation is in fact relevant and not prone to some kind of a sampling error or bias. But since the simulation is in your domain of expertise you have a much better confidence in that what you're doing is basically correct. So this is another way where programming with R or programming with computers can be extremely helpful. Using simulation tests to estimate probabilities. For example, permutation tests are one case. For example, if you have data that has multiple categories associated with each observation you select a statistic, a mean difference, or a t-statistic it doesn't matter any statistic well most many statistics will do and compute the statistic for the observation of interest which then gives you one value and then you do a large number of permutations of say if you're looking at expression profiles you just shuffle the expression profiles all over and with these permuted observations you then calculate how many of the observations are smaller and how many of the observations are larger compared by the number of trials and this gives you an estimate of the probability and you can even do that many times over and see what the variation in these estimates is and then give you some kind of confidence interval on your predicted p-value that you have. So in principle this is a bit of a game-changer that you're able to do that. In this sense you're able to defer to shift the required expertise into the biological domain and not into the mathematical statistical domain. Now this is my error. This script expects the data objects to persist which of course they don't if we reload them in a separate project. What do we do at this point? I will need to fix this and re-upload this so that you have the actual functions available but I don't think that we will usefully be doing that today. I think this is something that you can readily do at home. The script is extensively commented. Of course like with all the scripts there's something that you're interested in and that isn't clear. You're welcome to email me and I will try to clarify by updating the script. What I'd like to do alternatively is at least take you through non-parametric analysis the Wilcoxon test because this is something that really comes up a lot something that's appropriate especially if you have relatively small data samples that don't correspond to good statistical models for example because they're not normally distributed. I also like that because we can do some little bit of final playing with programming R. Let's generate two random data samples with slightly different means. The means are going to be very slightly different in this case we defined 25 samples in the first data set and 25 samples in the second data set and throw all that together and one of the set of samples we're going to draw from a normal distribution with a mean of 10 and one of the data sets we're going to draw from a normal distribution with a mean of 11 so not very different and if we box plot them here's where we see the distribution differences so if we plot this we can kind of see the trend of difference in these values the red ones where the ones taken from the distribution with a mean of 11 and the black ones were from the distribution with a mean of 10 and there's a trend there but it's kind of hard to say and especially for example if we plot them in a random order it's kind of hard to say whether there is actually a difference between these two population groups so if we simply compare the mean and the standard deviations I don't know what the result would be then now Wilcox and Test works in the way that we count for each observation of one group how many observations from the other group are ranked below it so whenever we have say a black observation we ask how many red observations are smaller than that black point and for two equal distributions regardless of what the distributions are we would expect that for every observed point there's an approximately equal number of observations from one distribution and from observations from the other distribution below it so let's illustrate this if I order my p-values if I order my observations by size I get the following picture of red and black points here and I think it becomes a little more visible that there is a trend of the red points to be towards the top and a trend for more black points to be towards the bottom here so now if we do a Wilcox and Test on that we get a p-value and that p-value is 0.03 so does this mean we have two significantly different populations? well we've just talked about the normal p-value cutoff of 0.05 so is 0.035 smaller or larger than 0.05? smaller so we conclude that this is statistically significant one thing we could do however is ask well, how robust is this? and that's what I'd like you to ask first of all I've set the seed to a particular value so I would like each and every one of you to repeat this with a random number of choice so a randomly but different reproducible value if you all choose 42 or if you all choose 7 as your random number that's not very random so I think to make it truly random use an expression like this which you can rerun if we choose from a thousand random numbers then we will probably not have the same random number twice in the room so the range of numbers is large if you imagine that somebody asks you for a random number any random number and expects you to produce something which is not going to exceed the boundaries of the universe that's a really poor expectation the small random numbers that we usually produce as humans are certainly not random given the space of possible numbers that exist which are infinite in size and length anyway, so pick a number replace the 53 run the same analysis look at your box plot, see if you believe that these are different look at the order and then calculate the Wilcoxon test for your version of randomly chosen numbers of 25 observations for which the means differ by approximately 10% of value and I'd be curious how often we find something that's significant ok, so I think most of you have or almost all of you have run this now let's see a show of hands who got a significant different result ok, so this is 1, 2, 3, 4, 5, 6, 7, 8, 9 ok, who got a result that was different but not significant 1, 2, 3, 4 so the majority of people in the room actually gets a result that is significant by 3 times more frequently than not so basically what this seems to be telling us is that the Wilcoxon test is able to pick out differences even in the small population of differences where the mean is less or is approximately 10% a different of value so it's pretty sensitive we say that non-parametric tests lose so-called power i.e. we need to do more of the tests versus parametric tests so if you can do a rigorous T test which works on numerical values rather than just rank values this is going to be even more sensitive so I think this illustrates that we're kind of able to pick out really big differences now, as a second test I would like you to repeat this test for a single random population with 50 elements and a mean of 10.5 instead of a mean of 10 and 11 we now use a single population with a mean of 10.5 and at first we plot the black and red circles as before so here you change the parameters of your normal distribution we do the same thing in this loop here but instead of 10 we'll take 10.5 and instead of 11 we'll take 10.5 so now these are in fact different populations so in this case this is our negative control for the Wilcoxon test the negative control says, well in the positive control we've repeated this several times over with different numbers and we've been able to determine that we get a different result more often than not, or a significant result more often than not if the populations are different now the question is the other way around well if the populations are actually not different how often do we get a result that appears different so that's our negative control here so do the same thing for a single random population no matter what the seed value is that you use use a mean of 10.5 so they are the same population here plotted again with black and red circles and see if it looks different so I'll do this myself keep this the same this shall be 10.5 this shall be 10.5 let's see what we have here the box plots turn out to be quite similar slightly different there's more variation in my first set here no obvious trend in the values if we plot this well really can't tell what's going to turn out here and the Wilcoxon test tells me p value of 0.7 which is significantly more than 0.05 so in this case my negative control gave me a negative result i.e. a p value that is not significant so who got a p value greater than 0.05 if the two means are the same i.e. 10.5 so 1, 2, 3, 4, 5, 6, 7 people in the room who got a statistically significant result 1, 2 what was it? 0.037 0.037 so that's about as much as we had the last time with the difference in means and how much did you have? 0.028 okay again about half of the required p value so even with the small number of experiments in the room we can see that we can be wrong relatively frequently so how often would we expect to be wrong? what's the distribution of a Wilcoxon test under these circumstances? so one way to find that out is to simply illustrate this now to do the same thing that you've done before but do this in a loop let's say let's do it a thousand times so let's build a vector of p values as an empty vector initially numeric and n where we define n to be 1000 trials so if I initialize p value like this I have a numeric vector of 1000 elements and they're all 0 initially and now I write a loop that will do the same procedure that I've done before with random starting conditions or rather not setting any seeds so that the randomness just continues going on and trying the Wilcoxon test and figuring out what the p values are actually I need to solve a little problem before that I need to know how do I actually capture this number 0.7292 right now I'm just fringing to screen do I have to write it or type it into an Excel spreadsheet and then then take it from there or a thousand times that would be onerous yes on is it a list is it what what is it how do we know what would you intuitively now I think somebody mentioned on our first day the thing to know about are is that it's not intuitive well we've been beating our heads around this for three days now so intuitively what would you do with this Wilcoxon test to try to capture some of its output right a variable assign the output to assign the output that's what I would try so let's try this here so this p is a list of seven different items and the different items are statistic and attributes parameters p value null value alternative method data name and so on and this is what we're looking for so the p value is a title of our list okay now I need to go through trial and error either I do this and now my p is this p value okay that's good now I wonder could I also have done that same thing okay so both ways of syntax are the same thing or I could have assigned this and then used my list extraction or the dollar operator on the assigned value but that's the way we capture our p p value okay so we need a for loop for I in one to n and we need to do something and as a result of that something we need to assign something in the ith position of our vector p value which captures the outputs I'm not going to type this something that's up to you now well without hopefully giving too much away our calculations of the p value go into this spot so we calculate the p value and then we put it into our slot here but of course if we run our loop like that obviously we'll get 1000 times the same value what we still need to do is to calculate different random values so for simplicity and because it's too late to do any significant thinking I'll just simply copy this we could code this more efficiently and more easily but now is not the time so I have a loop here that 1000 time empties a matrix and initializes a new matrix object and I have to make sure my n is correctly defined lowercase n so I run that 1000 times I build a matrix with column 1, column 2 of 25 elements each I take both of them from the same normal distribution of 10.5 and I think that's it right any mistakes well it gives me something so how do I look at these numbers what should I be doing I can take the mean interesting what do we expect I wouldn't know what's the mean result of a Wilcoxon test on the same population under these circumstances it is close to zero actually that's odd exactly we're going to look at a histogram I mean this would be a hugely significant p-value each and every single time so let's look at a histogram so most of our values were really really small and if I set the x-limb between 0 and 1 I'm only going to see nothing so that's unexpected I think you have something there because I see large numbers of zeros there actually I see almost only zeros so what oh no the 23rd value none of this is expected okay so what are we doing wrong here I'm using i and of course then the two i's interfere with each other okay so let's not even do this let's actually simplify this instead of filling the matrix I can just fill an object here with 25 values each call this m1 m2 or the other way around so these are my two variables here I don't need that loop because I do this directly I don't need to initialize the matrix and I just do a Wilcoxon test on m1 and 2 to remove that obviously Michelle has seen me mistype many more things previously over the years I think this looks a bit more present a bit more representable even though it's late in the day and now I actually got numbers that I expect and a histogram of p values that are uniformly pretty much distributed between 0 and 1 so this means 5% of my tests under these conditions are going to turn out to be significant and 1% of my tests under these conditions are going to be highly significant with a p value of 0.01 so what you've seen here is multiple testing in action even though there is no difference between our populations if we run them often enough we will find examples that are statistically significant so be doubly and triply cautious to use the right number of multiple testing corrections now I'm tired you're tired I think it's time to close up I don't actually have any closing remarks you had a question how do you test if it's what if it's a robust test or not exactly exactly so what we've done here with this Wilcoxon test is we've looked at the distribution of values that we get if the null hypothesis is true and that tells us how large the outcome is now there's a very general procedure that we can use for asking whether a test is robust in the sense of robust not sensitive on details of the data and that's so called bootstrapping procedures so in a bootstrapping procedure say you have 100 elements and you wonder if your test depends on the presence or absence of only a very small number of them so what you'll do is you'll remove 10 of the elements and replace these 10 empty slots with values that you randomly pick from the remainder and then you rerun your test and you do that many many times over and if your distribution of p-values is very tight and you always get the same value then you know that your test is not critically dependent on the details of your data but if it's like all over the place then your bootstrapping test tells you that there's something critical in individual elements unless you're absolutely sure that these elements are what they ought to be and are not just outliers or artifacts then your results have to be seen with great caution so there's several ways to define robustness incidentally this is a kind of robustness that has to do with sensitivity to the details of data or to initial conditions have a safe trip home thanks for coming by and I hope that you enjoyed these three days as much as I did