 This is a video about hypothesis testing. I'm going to introduce this topic by looking at a couple of examples. My first example is to do with flooding. We're often told that in Britain one effect of climate change is to increase the frequency with which rivers flood. And I was looking for some evidence that the frequency of flooding is actually increasing. So one source of evidence comes from looking at the Thames barrier. The Thames barrier is a flood defence located in central London. You can see it here at the bottom right-hand corner of this picture. And you can also see the Thames barrier in this picture which shows the entire river Thames. It starts out near Lechlaid and flows through Oxford, Henley, Windsor and then through London and finally out into the North Sea. The Thames barrier is down here at the bottom right-hand corner again. So how does it work? Well, there are two ways in which London could flood because of the river Thames. One thing that can happen is that an enormous amount of water flows in directly from the North Sea and floods central London. Another possibility is that a huge amount of rainfall upriver of London fills the Thames with so much water that it then floods out into central London. And the main danger is that the river Thames is tidal below Teddington Lock. So in the worst possible scenario, a massive amount of rainfall fills up the river, upriver of Teddington Lock, so that the river is almost bursting its banks. And at the same time, high tide arrives bringing a huge amount of water in on the tide from the North Sea. And then the combination of the water coming downriver from the rainfall and the high tide coming in off the North Sea produces a huge amount of water in central London, which could have an effect something like this. Of course, this isn't a real photo. This is just somebody imagining what it would look like. Okay, so the Thames barrier prevents flooding on the river Thames by holding back the high tide that arrives out from the sea. And in this picture, you can see that the barriers have been raised in order to hold back the water and prevent flooding. Now here's the evidence that I found from the Thames barrier. It was built in 1983 in the first 20 years since it was built. It's been raised to prevent river flooding about one and a half times a year. But in the last four years, the Thames barrier has been raised 11 times in order to prevent river flooding. Now, the big question is, is 11 times a high enough number to give us good evidence that flooding is happening more often? Is 11 times in four years, when obviously we'd expect six times if it's one and a half times per year on average, is 11 times so much bigger than six that we can say, yes, that's really good evidence that actually flooding must be happening more frequently. And this is what we use hypothesis testing for. So let me show you how we would do a hypothesis test in this case. First of all, we've got a random variable, a test statistic, which is the number of times that the Thames barrier has had to be raised to prevent river flooding. And let's assume that that's got the Poisson distribution, because we're talking about the number of events in a fixed period of time, in this case four years. Then we set up something called a null hypothesis, which is a sort of default statement which tells us how things have been before. In this case, the null hypothesis will say that lambda is equal to six. The expected number of times the barriers had to be raised should be six if it's one and a half times per year and we've got a time period of four years. Next, we set up an alternative hypothesis. In this case, this says that lambda will be more than six. So what we're saying here is that perhaps in reality, flooding is happening more often. And so the number of times we should expect to raise the barrier is now actually more than six. Okay, now I said that the actual value of the test statistic, the actual number of times that the barriers had to be raised is 11. And the great question that we want to ask is, is 11 a big enough number to be able to say that the alternative hypothesis rather than the null hypothesis is true? Now the way that we answer this is really clever. What we do is we assume that the null hypothesis is true. We assume that lambda is six. And then we work out how likely is it that we would get a number like 11 as the number of times that the barrier has had to be raised if the expected number of raisings is six, if lambda is six. And if we find that that probability is really small, if it's really unlikely that we would get 11 raisings when the expected number is six, then we'll say, ah, well in that case, the null hypothesis can't be right. It's so unlikely that we would get this outcome if the null hypothesis is true, then I don't believe the null hypothesis anymore. So what we do is we work out the probability of getting a number like 11. So we work out the probability that x is greater than or equal to 11 on the assumption that the null hypothesis is true. And obviously we do that by doing one take away the probability that x is less than or equal to 10. And we can find that by looking in the tables. If we find the column headed up by lambda equals six and look along the row where x is equal to 10, we'll see 0.9574. So we do the sum one take away 0.9574, which is 0.0426. Okay, so that's quite a small probability. Now at this stage we need one more thing. We need something called a significance level for our test. And we usually use the letter alpha to talk about the significance level. And a really common significance level to choose is five percent. But the actual level depends upon what you feel like and how rigorous you want the test to be. Basically, the smaller the percentage, the more rigorous the test. So now we compare our probability with the significance level. In this case what you'll see is the probability is smaller than the significance level. So what that means is that we've got a really small probability showing that it's really unlikely that we would get 11 raisings of the barrier if the expected number of raisings is actually six. In other words, it's really unlikely that we would have to raise the barrier so many times if there hasn't been any change in the average number of flood events or flood warnings. So what we can do in this situation is to reject the null hypothesis. We can say, well the null hypothesis probably isn't true, it's really unlikely. And in context, we would say, it's at the five percent level of significance. It's quite good evidence, there's enough evidence to conclude that the rate of flooding has increased that the number of times you're going to have to raise the barrier is more than it used to be. OK, so this is how a hypothesis test works. There's a few features to draw your attention to. First of all, the null hypothesis, which remember is the kind of default statement which specifies a parameter for our sampling distribution. Secondly, the alternative hypothesis, which is what we would like to try and show. And thirdly, the significance level, which is the sort of threshold value for probability whether it's too high or too low. And the idea there is that if we get a probability that's lower than the significance level, then we can reject the null hypothesis. And one more thing to draw your attention to is that we found the probability that X is greater than or equal to 11. In other words, the probability that we get the actual number of events we observed or more than that. And that can be quite hard to understand at first, because it feels like we should be just finding the probability that the barrier has to be raised 11 times. Why are we finding out the probability that it's 11 or more than 11? And the answer to that is that the probability that it's exactly equal to 11 is probably always going to be small. In most situations, finding the probability that a random variable is exactly equal to one particular value will mostly be a small probability. For example, if lambda were very large, if lambda were 100, then even the probability of getting exactly 100 events would be very small, even though that's the expected value. Because as well as being 100, it could be 99 or 101 or 98 and 102. And the probability is spread out between all those different possible values. So the probability of getting one particular value is usually small. It doesn't make sense to just look at the probability of it equalling something. If we want to run a hypothesis test, we need to find the probability of getting 11 or a number similar to 11. And the way that we actually do it is to find the probability of getting 11 or more than 11. So it's the probability of observing the actual number of events or more than the actual number of events. The probability of 11 closures or more than 11 closures. Note that it's more than 11 rather than less than 11, because here the alternative hypothesis is saying that the expected number of closures is more than 6. So if 11 is bad for the null hypothesis, because 11 is more than 6, numbers that are greater than 11 will be even worse. 12 is worse for the null hypothesis than 11 is. So it's numbers that are more than 11, which are like 11, in suggesting that the null hypothesis is wrong. If the alternative hypothesis said that lambda was less than 6, then we would be finding the probability that x was less than or equal to something. Okay, so here are the definitions that you need to remember. The null hypothesis is a default statement that you're hoping to reject. But it's precise, it specifies a parameter of the sampling distribution, so you can actually calculate some probabilities. Secondly, the alternative hypothesis is the thing that you'd like to justify. It's what you might believe, and you're trying to find evidence that the alternative hypothesis is actually true. Thirdly, the significance level of a hypothesis test is the cutoff point for rejecting the null hypothesis. You reject the null hypothesis if you get a probability lower than the significance level. And you might be interested to see a graph showing the number of closures due to river flooding at the Thames barrier since it opened. You can see that as time has gone by, there do seem to be more and more closures, although the closures seem to have peaked between 1999 and 2003. You might agree that if you just looked at this graph in isolation, you might be unsure whether you would be able to say that the incidences of flooding, that the need to shut the barrier due to the possibility of flooding has increased. So it's really good that we've got this proper mathematical test, which enables us to say with some confidence that the number of times we have to shut the barrier due to the possibility of flooding is increasing or has increased. This graph shows us something else that's important about hypothesis testing, and that's that because it all depends on data, which is random, a hypothesis test can end up leading us to the wrong conclusion. For example, if we'd have carried out the test in 2008, there wouldn't have been any cases where the barrier needed to be raised recently, and so we'd have ended up not rejecting the null hypothesis. We'd have said, well, there isn't sufficient evidence to say that the amount of flooding is increasing, and that would have been a mistake because we can now see that it is. Notice that the opposite thing can happen as well. You can end up rejecting the null hypothesis even when the null hypothesis is true if you happen to get freakish data where your test statistic is unusually big or unusually small just by chance. So do remember that hypothesis testing is using data, which comes about through a random process, and so it's not an infallible guide to what's true and what's false. It's just the best kind of decision procedure, the most rational decision procedure that we can come up with, especially as statisticians. One thing before we finish this example though, I said at the beginning, let's assume that this random variable has the Poisson distribution and it would probably be a good idea to think about whether that's actually reasonable. So here are the criteria for a Poisson distribution and you might like to think about whether they're met. First of all remember, the events must occur randomly in a fixed interval of time or space. Well, that's probably okay because we did have a fixed interval of time, it was four years, and presumably we're happy to say that flooding is a kind of random event. It's certainly not a predictable one, or at least it's not predictable more than a week or two in advance. The next criterion is that events must occur at a constant average rate and you can probably see that that's not really going to be met because the probability of flooding must change at different points in the year. And in particular, it's going to change at different stages of the moon cycle because at some times the tide is higher, the high tide is higher than at other times, and obviously when the high tide is at its highest, that's when London is particularly vulnerable to flooding and at those times it's more likely that we're going to have to raise the barrier. Another criterion is that the events must occur independently and one at a time. And actually that's probably not met either because if you're going to have to raise the barrier for one high tide, then it's quite likely that you're going to have to raise it for the next high tide as well because if you had flood conditions at one point then sort of 12 hours later or however long it takes to get from one high tide to the next one, it's quite likely that those same conditions will prevail and it's still quite likely that you'll have flooding. And actually you can see this because if the barrier has to be raised for one high tide, it does often have to be raised for the next one as well. So that means that the events aren't really independent. So our test statistic, the random variable which is equal to the number of times the barrier has been raised, probably doesn't really have the Poisson distribution. And it's sad that that rather spoils our hypothesis test and weakens the conclusions we try to draw from it. But this just shows how trying to use maths in the real world is actually very exciting and complicated because we can't just use the simple distributions which we're learning about at this stage of our mathematical education. If we want to use mathematical modelling in the real world, we need more sophisticated distributions which you don't know about yet. But never mind. I hope that using the Poisson distribution helped you to understand how our hypothesis test would work. Okay, let's move on to another example. And this one's going to be about comedy. Because something that people often say is that British people and Americans have a different sense of humour. So let's look at a question that's got something to do with that. First of all, do you recognise who these people are? I think that the man in the middle might be particularly recognisable. That's Michael Palin. And the guy at the back in the chef's hat might recognise that's John Cleese. Because this is Monty Python. Okay, so let's suppose that we show the Monty Python parrot sketch which is one of their most famous sketches to a British audience. And 80% of them laugh out loud when they first hear it. But when we take that sketch and we play it to a group of Americans, only 19 out of 30 of them laugh out loud. So is this sufficient evidence to say that Americans and British people have a different sense of humour? Or at least that they differ in their response to Monty Python sketches? Well, here we've got a random variable with a binomial distribution because we looked at 30 Americans and we asked ourselves whether they laughed out loud. So our random variable X has the binomial distribution with 30 trials and P is the probability of success, P being that they laugh. And here we use the null hypothesis that P is 0.8 because 80% of British people laughed and the default should be that Americans are the same as British. Our alternative hypothesis this time will be that P is not equal to 0.8. We're not saying that P is less than 0.8 or P is more than 0.8 because I didn't say that we're interested in whether Americans find Monty Python less funny or more funny. I just said we wanted to know whether Americans differed in their sense of humour. And this time we found that 19 out of 30 Americans laughed so the actual value of our test statistic is 19. Now you may remember from studying the binomial distribution there's a bit annoying to look up probabilities in the table when you have P equals 0.8 because the tables only tell you about values of P below 0.5. So we're going to have to think about a different random variable. The number of people who don't laugh although how you could possibly not laugh at the parent sketch I don't know but we're going to have to look at the number of people who didn't laugh when they listened to the parent sketch and we'll find that by doing 30 take away X and that random variable would be binomally distributed with 30 trials and 0.2 as the probability of success. This is assuming that the null hypothesis is true. Remember we always assume that the null hypothesis is true during the course of this test. Again the actual number of people who didn't laugh is 11. Now what we want to do next is to find the probability that Y is greater than or equal to 11. Let's just stop and think about that. We don't want to know the probability that Y is exactly equal to 11 because that will usually be small and it's not helpful to work out that. We want to know the probability that we get 11 people who don't find Monty Python funny or more than 11 people who don't find Monty Python funny. More than 11 because that's even worse for the null hypothesis. 11 people not finding Monty Python funny is quite bad because if the null hypothesis is true we'd expect only six of them to dislike Monty Python. 11 is bad, more than 11 is even worse. So you find the probability that 11 or more than 11 people don't find Monty Python funny. And obviously we work that out by doing 1 minus the probability that Y is less than or equal to 10. We can look that up in the tables. We look for where n is 30 and we find the column headed by 0.2 and then we scan across from 10 and we discover the probability 0.9744. So the probability that we're looking for is 1 minus 0.9744 which is 0.0256. At this point we need a significance level and let's choose 5% because we very often choose 5% as the significance level in hypothesis tests. But at this stage we need to think about something. There's something very important. There's two ways that we could end up rejecting the null hypothesis. One way is if lots more than six Americans stay silent and the other way is if far fewer than six Americans stay silent. And so when we're asking ourselves what's an outcome that's like 11 people not laughing? We need to include not just 11 and more than 11 but numbers that are way smaller than six. So when we ask for the probability of getting an outcome like 11 not laughing we need to remember that at the moment we've only worked out half of it. We've only worked out the probability for getting 11 or more. And if we want all the outcomes that are like 11 we need to include the probability for far fewer people not laughing for getting numbers much smaller than six. So at this stage to work out the probability of getting an outcome like 11 not laughing we need to double 0.0256 so that we're not just counting one tail but both tails. Not just the case where more people are staying silent but also the case where fewer people are staying silent. We need to double 0.0256 to get 0.0512 and of course this time 0.0512 is greater than 5% and so in this case we should not reject H0. And all we can say is that at the 5% level of significance there's not enough evidence to conclude that Americans differ from Britons in their appreciation of Monty Python. Now this doubling bit is something very important that you need to remember. There are two types of hypothesis test. The first type we did was what we call a one tailed test and if you're trying to show that a parameter is greater than or less than a particular value the hypothesis test is one tailed. But if you're doing a hypothesis test where you're trying to show that a parameter is different from a particular value then it's a two tailed test. And this example we've just been looking at was a two tailed test and that has this added complexity that you'll need to double the probability that you work out before you compare it with the significance level. Okay, let's just think about whether the binomial distribution was really an appropriate model in this situation. Here are the criteria for a binomial distribution. First of all the number of trials must be fixed. Well that's okay, we looked at 30 Americans so we had a fixed number of trials, the 30 people. Secondly, each trial must have the same two possible outcomes. Well that seems okay because either they laughed or they didn't laugh. But then again when you think about it maybe there would be some ambiguous cases that are between laughing and not laughing where you weren't quite sure whether that counted or not and that would present a problem. Thirdly the trials must be independent and I suppose that depends on how we did it. If we put the Americans all together in one room then I guess it wouldn't be independent because probably if some of them laughed then that makes it more likely that some of the others would laugh. But if we put them all in separate rooms and just monitored them then I guess they could well be independent. And finally the probability of success must be the same for each trial. And it seems clear to me that that won't be the case because surely everybody's different and some people are going to be more susceptible to multi-python humor and other people are going to be less susceptible. So I guess that the probability of success the probability of somebody laughing isn't going to be the same for each person. So unfortunately this otherwise extremely scientific test about sense of humor isn't quite perfect because I don't think that the random variable would really have the binomial distribution. But just as before it's quite nice to think about it in terms of the binomial distribution and again this is a nice insight into how trying to do this kind of stuff in reality requires some more sophisticated maths and some more sophisticated thinking than some of these neat examples. But you know that's all part of the fun of maths. Trying to apply all this stuff in reality is even more exciting than these examples suggest. Okay so this is nearly the end of my introduction to hypothesis testing. It's something which you'll want to practice a lot and look at a lot more examples until you get the hang of it but I hope this has been enough to give you the idea. There's some key definitions that you need to remember. First of all the null hypothesis is a default statement that you're hoping to reject and it specifies a parameter of the sampling distribution. Secondly the alternative hypothesis is the thing that you're trying to justify is the thing that you're trying to get evidence for. And thirdly the significance level is the cut-off point for rejecting the null hypothesis. You reject the null hypothesis if the probability is lower than the significance level. And finally there are two types of tests. If you're trying to show that a parameter is greater than or less than a particular value the hypothesis test is called one-tailed and that's the easier kind. Secondly a hypothesis test is two-tailed if you're trying to show that a parameter is different from a particular value and in that case there's an extra step which you need to get clear about. So that is the end of my video about hypothesis testing. I hope you found it useful. Thank you very much for watching. By the way I suggest you go and watch the parrot sketch. I suspect it's going to have had more views and more likes than this video.