 Next part, which is when we start really speaking about statistics. So we are going to speak about distribution in statistics, how we can play with them in Python, and then slide into statistical hypothesis testing. I know that property distribution is not necessarily the sexiest topic, but if you understand a bit better what a statistical distribution is, what they are, and how we manipulate them, then statistical hypothesis testing makes a lot more sense. So it's very important that we pave the way toward that. OK, so probability distribution. So I always start by a little bit of import. OK, I make sure that I have matplotlib, seaborne, scipy, the stats will be our main library to doing statistics for now, pandas and numpy. All right, so we can ask the question, what is a probability distribution? So I'm going to plot here. It's much easier with a visual representation. So a probability distribution is a tool that we use to describe the behavior of a random variable. So that is a quantity or a variable whose quantity will vary. And if you will, it varies randomly, but this randomness has a rule to it that some events have more probability of occurring than others. And the pattern of the probability of occurring of these events is what makes the distribution. So we like to represent a distribution by using its probability density function when it's continuous or probability mass function when it's discrete. OK, that's just a question of terms. But the idea is that you want to have different event and each event is weighted by its probability of occurring. And you can either represent it like this with the probability density function or in a cumulative way where you just accumulate the probability. And because we talk about probabilities here, the sum of all the total probability of all events always sums to 1. Kind of makes sense. You have the distribution representing the entirety of what can happen. And so then the total probability of everything is 1 by definition. And that's why here in this cumulative plot, we start at 0 and we arrive at 1. OK, and then again we start at 0 and we arrive at 1. That's a generic property of probability distribution, which we use a lot to simplify our computations later on. All right, so here I show to you, I think, two distribution which you might have heard about. First one, I'm sure you've heard about it, is the normal distribution. OK, you can see this little bell shape, very symmetrical. Here this is centered on 0 and with a standard deviation of 1. So that's the central and scaled normal distribution. And here you have an alternative, which is a binomial law. So this one is also fairly common. A binomial law is a law that describes how many success you have among n-trial with each trial having a probability p of success. And here that corresponds to 5-trial and each trial having a 0.4 probability of success. OK, and so you can see that you have certain probability of having 2, 3, 4, and so on and so forth. All right, so far so good, simple enough. OK, so as you can see when I've talked to you about these different distribution, either as a normal or the binomial, I've named them, but I also used numbers to qualify exactly what they were. This is a normal distribution with mean 0, standard deviation 1. Also, it's a binomial distribution with n, so a number of trial 5 and probability p, 0.4. OK, so that means that on top of just saying what they are, we have to also specify bit where and what they look like. So for the standard deviation for many, sorry, for the normal distribution and for many, many distribution, this boils down to their mean, which we call also their location because the mean is basically where you are. And their standard deviation or also called scale because that describes the spread of the points. And we use these two things to specify exactly what distribution we have in front of us. So here's different examples, three examples. One with location 0, scale 1, location minus 2, scale 1, location 1, scale 3. And there you can see them. So you see that changing the mean changed the location, changing the scale of standard deviation changes the spread around our mean. OK, again, I think this is simple enough. Just maybe if that was, if you have, if you have not done any start in a long time, then I think that this kind of puts something back in front of your mind. And in terms of just a notation, what we say is that we have a random variable for some x and we say that it follows something specific by using this tilde character there. So x follows a normal distribution of mean m and standard deviation s. Because that's just the notation that we use. Now I want to spend some time to explain to you or to, yeah, to come back to why everyone talks so much about the normal distribution. And the normal distribution is maybe the most well known distribution and with a good reason. And for instance, if you have computed any 95% confidence interval, you may have heard about like this value of 1.96. Or you might also have heard about two standard deviation. Are you at x standard deviation away from the mean? And if you are above 2, then I mean that you are significant. As any of you heard about this sort of rule of thumb? Yes, no. Yes, some of you maybe not many. Yes, no for some. OK, so this is a very, very, very classical thing. And that is, if you will, that's due to a fairly peculiar, but super useful property of random variables. So I will see site Wikipedia there in some situation when some independent random variables are added together. So imagine you have a bunch of random variables. They are completely independent. When you sum them together, they tend toward a normal distribution. And you take any sort of random variable when you have enough of them sum together. You get a normal distribution. So that's kind of a nice property. But when it becomes super powerful is when you consider that an average is at its core, it's a sum. So you sum everything in your data and it's just that at the end you will divide it by the number of observations. But it's a sum and each element in this sum are different observations of a random variable. And if they were then drawn independently, you have your independent random variable. So that means that when you have enough sample in your data and they were drawn independently, then the mean of your sample will follow a normal distribution. And that's a fairly strong property because if you know that it follows a normal distribution, then you can predict and you can make a lot of prediction on that. And if you can make prediction, that means that you can test hypothesis. Okay, makes sense so far? Yes, of course. I would say that you don't want to take me at face value. And for me, I like never to take things at face value because here it says I need to have enough sample, right? So what is enough? I will see that in a second. First, let's just check this first property. So I will here use a lot of simulation during the rest of the course because each time I make some sort of grand assertion like that, I like to use some simulation to demonstrate it to you. And that can be also useful to test the limit, okay? How much sample is enough? Is five enough or do I need at least 100 and so on and so forth? So here, imagine that we have sample, we have 100 samples and we draw them not from a normal distribution but from what we call an exponential distribution. So it does kind of a decrease, an exponential decay, decrease curve like this. We draw these samples each time made with 100 sample and we compute their mean. So with this here little function. So I draw one sample there and then I will let's say, for instance, do it 10 times. So I will have 10 means of samples of size 100 each, then 100, then 100 start. And then I will plot them and we'll see a bit what this looks like. So one sample look like this. So this is the PDF of the exponential and you can see they are a bit tiny, but you can see here the individual sample point has a tiny orange little dots here. If I only have 10 points, this is what this 10 means. So these are not individual points, but these are the mean of sample of size 100. They look like this. And as I get more and more and more, you can see that indeed they have this sort of bell shape here, which actually correspond to something close to a normal. Okay, that works. Now you could say, okay, this is with simulation. So we can actually do the same thing with actual data. So here I will play again with my census fraction data, which we created in the previous notebook. So what I do is that I won't use the entire data frame because that means that if I have the entire data frame, I have the whole population. Instead, what I will do is I will draw a small sample. So imagine that, for instance, you didn't have the full census data. So you would do a bunch of polls. You would randomly choose 10 cities and gather information and then do this again with 10 cities, again with 10 cities and so on and so forth. And this is basically what happens. If you do it, if you do one poll, we have here the privilege of knowing the population mean and we have here our little poll and we can compute then the difference between each of our poll and the population mean. So you can see that if I repeat that, sometimes I will be slightly under the population mean, sometimes much more under, sometimes a lot under, sometimes above and so on and so forth. So we can see that, you know, this average there of my sample is kind of randomly hovering around the population mean. There was a question by Ahmad. I was allowed to ask that the numbers are different to you, but now as you said, it's random, so now I understand. Exactly. Yes. So that's here. Indeed, we can see a little bit of the randomness. Each time we draw 10 other cities and so on and so forth. And we can kind of look at what happened if we did a lot of these polls. We would see how, like what is the pattern of this difference between the population mean and the sample mean. So I repeat now the same thing 1000 times and each time I keep the mean of the poll and then I plot that. And so this is what we see. So that is the raw data, okay? And that, so you can see that the raw data is not really normal. It's kind of skewed a little bit towards zero. But here you have the distribution of sample means. So this is a mean of a poll of 10 cities. And here this is the real, this is the real mean. And you can see that then the samples mean. They are kind of spread around, not exactly centered on, but almost centered on at this point the actual population mean, okay? So then we have this property that the mean of our sample. So if your data is a sample, your mean of your data is a normal, sorry, is a random variable that follows a distribution whose pattern we know. And thanks to the central limit theorem, we know that it should look like a normal distribution whose mean will be the mean of the actual original distribution. So that would be that line here and whose standard deviation depends both on the standard deviation of the population divided by the square root of N, so the number of points in your sample, okay? So if we write it mathematically, the mean of the sample X bar follows a normal distribution whose mean is the mean of the population and whose standard deviation is S divided by the square root of N. So basically that's why also we want to have a larger sample because as you have larger sample, then the standard deviation here becomes smaller and that means that here this distribution becomes more and more gathered, concentrated around the real value. That means that then we have better estimator and average of the actual population mean as the whole thing. And because it's a square root relationship, we have sort of a diminishing return, but always better, okay? And then also know that this only works if the sample size is large enough, as I said earlier, and this is the property that is actually hard to judge, like what is large enough? Is it 5? Is it 10? Is it 30? Here is here no fixed answer, unfortunately, because it depends on the shape of the original distribution. For some distribution, for instance, if the base variable, so if, for instance, this here was already a normal distribution, then even a sample of size one would be enough to have the convergence, but if it's not normal, then the farther it is from normal, the more points you need to have actually your mean of the sample being normal. In general, we say that in most case, 20, 30 is enough, but there are some fringe cases where actually it's not. So the best to me is sometimes to play a little bit around with this. So that's what I, for instance, I do here. I play different sample size, 5, 50 and 500. And each time I will sample a lot of time, a lot of samples, so 10,000 samples from, with this number of points in there, so with n equal 5, n equal 50 or n equal 500. And I will plot what they mean look like, okay? And I sample from this again, exponential draw. And I will also plot the theoretical normal law. So what the central limit theorem tells us this should be, okay? So you can see that when your sample size is 5, the here, the theoretical normal law if the central limit theorem was applied is actually a bit different from what you actually see when you do the experiment, okay? When you simulate what would really happen. So here that's the case where the central limit theorem doesn't apply. The sample size is too small in that case. But when the sample size is 50, you can see that there is still a small shift, but we are almost, almost there. And then, yeah, personally, I would be okay with applying and using the theoretical normal law to do my testing, okay? It's close enough. And when the sample size is 500, then you can see that they are overlapping almost perfectly. And that here, the central limit theorem also applies quite well. Okay? Does this make sense or that? Everything, okay? Yes? I still don't get why we do that. Why we do that? What is the logic behind? So the logic is that thanks to this property, our, the behavior of the sample mean becomes a predictable quantity. That means that I would expect like under maybe the null hypothesis that maybe my mean of the population is some certain value, I would expect my sample mean to follow, to follow this theoretical normal law, which I can predict. If I can do a prediction based on the hypothesis, then that means that I can compare the result of an experiment to this prediction. And thus, I can make, I can test this assumption. And that offers me then the possibility of doing the statistical test of rejecting the hypothesis or whatever, and so off making science advance. This is the whole idea. Does this make more sense? So here with your population, I can see where you get the original data from one where you can test it against. But if I do in science in the experiment, I don't have this census data, for example, which you are using right now. So what do I compare it to? So the idea is that you don't have, if you have the population, you don't need statistical testing because you already have a result. So you're right in science in general, you just have your sample. And that's why we have to rely on the theoretical part here. In science, you can't have the blue part. You can only have, you only have the yellow part, the theoretical thing that you use then to compare with your experiment. Okay. And that's why this property is so important because without this property, we would not have the yellow line. So you would get your result, your experimental result, but you would not know what to compare them against. Does this make more sense? I need to digest the time. Okay. So it will make, I think, a bit more sense when we go on to the application of that this afternoon. And we talk about statistical testing and so on and so forth. For now, it's 1206. So we are going to go on our lunch break for one hour. And when we come back, there will be a small exercise just about manipulating probabilities. Okay. So I will do that when you come back. If you have any question in the meantime, don't hesitate to write them in the chat. And I will try my best again to answer them. And otherwise, I wish you a bon appetit. Okay. So our exercise was a case where we imagined that we throw a coin 10 times. And then we record the number of times that we see head appear. And we know that as a frequency. So if the coin is fair, we presume that the coin is fair. Then the expected mean should be 0.5. Okay. Because we have 50% chance of getting head. And we know also that's just something that we know that the expected standard deviation of this should be 0.5 as well. Right. So that's for kind of the theoretical part of just our coin toss. And then we ask the question, what would be the normal load for load by the mean of the sample of size 10 if the central limit theorem actually applied? Okay. So we have here this little function to make some experiments. Okay. And we could say, okay, we see some time we have 0.4, 0.5. So this varies a bit. Okay. And so we can actually, we could make a ton of experiments there and then see what is the actual shape of the distribution of the mean of the sample of size 10. But we have the central limit theorem, which already can, without having to repeat this large number of simulations, which in practice in general we cannot do, we can already have a look using the central limit theorem as what this should look like. So if we come back up to this example here, the central limit theorem states that the mean of the sample should have a following normal load that has a mean equal to the mean of the population and a standard deviation, which is equal to the standard deviation of the population divided by the square root of n, the number of observation in our sample. So here in our case, that means that the mean should be 0.5 and the standard deviation of our sample, a mean should be this value of 0.5. So that's the population standard deviation divided by the square root. So this, oops, right, right up, of n. And so n in our case is 10. OK, so here we have square root of 10, so we can then print m and s just. And so that's the idea. So in theory, according to the central limit theorem, the mean of this experiment should follow a normal distribution whose mean is 0.5 and whose standard deviation is about 0.16. OK, so that's what the central limit theorem states. So that's for question one. So far so good? Yes. OK. So here we have to disconnect the property of the population. So the mean of the population, standard deviation of the population, and then apply on that n, the number, like the size of our sample. So that's the theory. Now we can ask a very important question. Is the CLT actually good enough to be applied here? I will not write the code from scratch, so as not to take too much time. So here the idea is that we have, let's say, the luxury of being able to just apply some simulation there. In practice, on your data, oftentimes you will not be able to do that. OK, this is a luxury that we are aware that because we are using a very simple example. But I want to use that to help give you a little bit of a small intuition of how we could go about testing such as an assumption. Because then in practice, you cannot do that. So it's good to build just a repertoire of small method of small ideas about having like how, you know, when does this theorem apply, when it might not, and what you should look for. So OK, our sample size is 10. Our expected mu is usually how we note the population mean. So expected mean is 0.5. Expected standard deviation is 0.5 divided by the square root of the sample size. Then I will sample 100 means, OK, not too many, but right. Then my normal approximation, so the one stated by the theoretical normal low from the central limit theorem is given there. So it's a norm whose location is the mu and scale is expected standard deviation. And it so happens that I know also by the theoretical binomial distribution there because of how the problem is set, I know that this mean is in reality following that distribution. So I can also compare the normal low given by the CLT, which is an approximation. And the actual one, which I know in practice, it will actually follow. And then I can compare that with the empirical result of randomly sampling some means. And then I plot this. OK, so this is what this looks like. OK, and so you can sort of see, so in blue you have the empirical sample. OK, in green you have the theoretical from the binomial. So this one, I know it is the actual theoretical, which is true. And then you have in yellow, you have the normal approximation that is given to you by the central limit theorem. And so from that, now we can try and judge. When you see this, what do you think? Do you think that the central limit theorem is kind of good enough here or that it's not good enough? So if you think it's good enough, please use a green tick. And if you think it's not good here, please use a red cross. So far a majority of green tick and one red cross. OK, so for the people who have put a red cross or even if you didn't put a red cross, if you think that it should not apply, can you write in the chat if you feel like it? Why you think that is? Or you can just speak up. So here we are comparing kind of this blue line there with the yellow one. So we could also say, OK, maybe a sample of 100 means is not enough. So we could maybe sample a few more, 10,000. And then we kind of see now this. So here, if we focus a bit on that, you see that it's a bit strange because we are sampling something in a discrete fashion, if you will. So that means that then the plotting becomes a bit weird, all right? So we have to kind of abstract a little bit around that. If you do so, you would see that this actually follows fairly closely what we expect. Let me see if I can just tune a little bit here, my east plot. And I think that I can change the bin width to 0.1. That should be a bit better, hopefully. Well, not really. I was unlucky there. Should be about the same scale. But unfortunately, there are some rounding errors there that creates some issue. But you can see with the density there that we have something that actually follows the same distribution. And here you see that the binomial law there is very close to the normal approximation here. Maybe to convince ourselves of that, it will be much more obvious in a larger sample size because the discretization will be less of a problem. There can be, of course, and the bin width is now not appropriate. So we should have something like this. Oh, sorry. I think I made. Ah, yes. Sampling mean here should be changed as well. And equal sample size. OK, so we can play around with these different parameters. Now it takes a bit more time to sample, of course. And we still have this small discrepancy between the two. I'm sorry. I am not able to make it visually compelling there. So I'm going to go back to 10. And we are back there. And here, in fact, what we see is that we have the same sort of shape, the same sort of frequency. And so here it's happened that even with a sample size fairly small, 10, the approximation is actually not too bad. You can see that in particular, you can see that the actual binomial law is very close to the normal approximation from the central limit theorem. All right. OK, so far so good. So my question is the matter is the actual binomial law should follow the real data or we match the blue line with the inline. Yeah, so here, yeah, here due to some discretization problem because this is a discrete distribution, the blue line doesn't match like at least the histogram doesn't match too much. But you can see that the KDE line actually matches relatively well the order two. The blue line is what we see empirically with 100 random sample. But of course, there is some randomization in there because it's empirically drone. The green dots is the actual binomial law, the one which I know that this follows without any approximation. So if I were to repeat that an infinite number of time, I would expect exactly what you observe with the green dots here. And then the yellow line is the normal approximation. So ideally, you would want to have this yellow line close to the green points. In practice, in general, you don't really have the green points. So you have to compare the yellow line to the blue line with some simulations. And most of the case, you will not even be able to do some empirical draw. And so you will not even have the luxury of the blue line. We just have to trust the central limit theorem. That's the kind of, if you will, way that this works. OK, more questions? All right, so then let's get on. So this idea is that, OK, thanks to that, we have now some theoretical properties about the mean of the sample that let's us do some prediction. In particular, we know what is the pattern of randomness that they are supposed to follow. That means that now we are able to make a judgment of whether a particular observation was very likely or very unlikely, even the theoretical distribution that it's supposed to follow. This is sort of the idea of saying, OK, I have some knowledge about the size that humans have. I know that, OK, maybe the mean should be somewhere around 170 or 165, depending on sex in particular. I know that this should vary broadly between 1.2 meter and maybe 2.10 meter. And so from this knowledge, from this expectation that I have, if someone says to me that they have seen a human that was 3 meter tall, I would find that fairly unbelievable. All right, so this is here the same idea that we will say, OK, we have, thanks to the central limit theorem, expectations under certain hypothesis about what our sample mean should be. And then that will be useful to help us make judgment call about the likelihood of what we are actually seeing when we do experiments. So in particular, we can start using a tool which we use a lot, which is the confidence interval around a either sample mean or around a population mean. So here the idea is that you want to create an interval centered around your mean, so that it contains 95% of the probability density. I will just do a small visualization there. I think it will work much better. So basically you have your draws here. And you want to have here this area here in orange that contains 95% of the probability density. So you see it's focused on the part where the density is the highest. And then you want to have 5% outside of this. Of course, because you have it's symmetrical. So you have 2.5% to the left and 2.5% to the right. And these 95% confidence interval have a fairly nice property. Is that if you imagine that you have a theoretical mean of here 1, standard deviation of 1, with the theoretical normal law, you can kind of predict with 95% confidence interval that 95% of the means of the sample that you draw will fall within this confidence interval. And thus 5% will fall outside. Of course, we can then test that, say we have a mean of 1, standard deviation of 1, sample size 500. Thus, from that we can derive what would be the theoretical law followed by the sample mean. OK, so that's mean and then standard deviation divided by square root of the sample size. And we can then use this theoretical property to predict the 95% confidence interval of my population mean and say, OK, I predict that 95% of the sample mean should fall between in this confidence interval there, which I compute with the PPF function. So percent point function is the way of computing the quantile. So the 2.5% and 97.5% quantile of this theoretical normal law that I obtained with the central limit theorem. And then I will test that, so that I will sample 10,000 mean there with this sample size and I will check how many indeed fall within or outside of the confidence interval. And what you can see here is that out of the 10,000 sample mean, I have 9505, so 95.05, which indeed fall within the confidence interval there that I computed with the central limit theorem. OK, and thus about 5% that fall outside just by its chance. So far so good? Yes. OK, this is a, let's say, central property of this confidence interval. And that's where we can see that the central limit theorem and the way that it lets us do some prediction of what we should see when we do the experiment can be useful. And this can then also be kind of reversed. And the idea is to say now we will reverse the thing, we won't compute the confidence interval on the population mean, because in practice we don't know that population mean. We will compute the 95% confidence interval on the sample mean, which is most of the time what we do. And we will ask the question, when I draw a random sample, how many times does its confidence interval actually contain the population mean, which I know here because I am doing the simulation. And here you will see that it's actually also the same thing. Here about 95% of the time the real mean is found within the confidence interval of my sample mean. So that's a nice reflective property there. And that's what lets us compute the 95% confidence interval. And that's also why we call that a 95% confidence interval. OK, all right, so far so good. When we had this property, yes, no, unsure. Let me know if there is anything still very weird there because there is an exercise coming. All right, so a few functions then to manipulate all these theoretical distribution I've used them here and there. So the PDF or PMF PDF or continuous PMF for discrete will return the density at a particular point. The CDF returns the probability of doing this value or lower than this value. OK, so if you want the probability of drawing, I know, let's say here, the probability of doing 0.95 or less, you would call stats.nor.cdf of 0.95. And PPF, you get the quantized. So then you give it a fraction here and you get the value at which you have only this fraction lower. So here, PPF of 0.025 would return to you the value such that you have 2.5% of the density to your under you. OK, so that's stats.nor.ppf. I will use my standard normal law with the mean of zero and standard deviation of one. And if I take the quantized 2.5% of this of this low, I should get minus 1.96, OK? Which is kind of, let's say, canonical value 1.96 and minus 1.96 that we use to have approximate 95% confidence interval for a standard normal law. All right, that's how we use them in, let's say, practice. OK, so now your turn to work. So we say we take kind of the same experiment as before. OK, we have 10 coins. We throw them and we count the number of time we find head. As we have seen before, this is a case where the normal approximation for the central limit theorem works. And we know that in theories, they are supposed to follow this, like this, this low where the expected mean is 0.5. And the standard deviation is 0.5 divided by the sample size, so 10, OK? So we have our expectation there provided that the coin is fair. And now I'm just going to say, OK, I do the experiment and I find seven times head. And now I'm just going to ask you a bunch of questions about who likely this result is. So what is the probability of obtaining this result? How likely, so what is the probability of having found at most seven heads? So that means one, two, three, four, five, six, or seven. Then at least seven or more and so on and so forth around this. So it's now we want to relate one observation, one empirical observation with the expected according to the theory. OK, so your turn to play. And as usual, don't hesitate to ask questions, OK? And now we are going to go through the correction together. So we have seen already question one, who likely was the result? So who what was the probability of obtaining seven heads out of ten tome cost, cost tome, not tome of. And then we answered the question, who likely was it to get conduct with at most seven heads provided, again, provided that the coin is fair? I repeat, provided the coin is fair, it's very important because if we don't presume that the coin is fair, that we don't have a value here or P. And so we cannot make a prediction. All right, later on, when I will have this sort of thing, we will talk about our null hypothesis. Our null hypothesis here is that the coin is fair. So question three. So question three is then we say who likely were going to come up with at least seven heads. OK, so now before it was at most seven, then it's at least seven heads. That means probability of doing seven, eight, nine or ten. OK, so here we could do something like something like this, for instance, we do probability mass function of seven plus the one of getting eight plus blah, blah, blah. But there's a nicer way to do that. So CDF only gives you the probability of doing one value or less. However, there is, remember, one nice little property there is that the total sum of all the probabilities will always be one. And that's something that we can use to our advantage there. Because we know that the probability of doing seven or more is one minus one minus the probability of doing the reverse that is at most six. Does that make sense to everyone? OK, super. So we can sort of have this, all right? And doing six or less is quite easy because that's actually just something that we can give to the CDF function. So we have that probability, probability of doing six or less. And we just do one minus that. And voila, we have the 0.17, so 17 percent. That's the probability of doing seven, eight, nine or ten. OK, so far so good. Who was able to find that? And I just asked a little bit of a kind of, you know, reversing trick one minus. There's a question by Ahmad. No, no, no, just when you said that you will come up with that. So I just used instead of one, I just used the last 10, which but yours is more, is more logic. Yeah, no, it's this, but that should be so that would be something. Yeah, exactly. I did that. Yeah, that works as well. Yeah, yeah, indeed. Cool. So, OK, we have that. So now we have, OK, probability of doing less and probability of doing more. OK, at most, at least. Now we come up with this weird question. Who likely were we to come up with a result at least as different from the expected mean of five? OK, so now the question is very tortured, right? So we want to come up with a result which is at least as different from the expected mean of seven. So at least as different from five. So if we have seven, the difference to the mean of five is two. So we want something which is at least as different as seven. All right. So there we can kind of list what is maybe more different than five from different two, five, than seven. So we have zero, one, two, three. Then three is exactly as different because the difference is two. And then we have seven, eight, nine, and 10. OK, so we want the probability of doing either of these. So far so good? Yes. Oh, OK. Don't hesitate to stop me at this point because I know that the formulation can be a bit weird. So now we cannot get that in one go. We will go first this and then that. So the first is given to us by the CDF of three. OK, so CDF of three. OK, and the second part is what we just computed because this gives us the probability of doing seven, eight, nine, 10. OK, so then we have on one hand we have that and on the other hand we have this. OK, and so here we have 34 percent. So that's the probability of obtaining something which is at least as different from this expected mean of five than what we observed seven heads. OK, and now we could ask the question. So this was the last one. How about if you come up with only one head out of 10? And so we can actually do the same. But now to have something at least as different, it's only zero and one and nine and 10, which are at least as different. So CDF of one and here we have eight, if I'm not mistaken, one minus eight, so that we have nine and 10, that works. And we get here two percent only probability of being at least as different. And now my question to you that was the very last one, do you think that the coin is fair in that case? So you throw a coin 10 times and you find only one time one time it gives you head. And you have now so this information there is probability. And I asked you, do you think that the coin is fair? Or do you think that the coin is not fair? If you think that the coin is fair, please vote with a little green tick. If you think that the coin is not fair, please put a red cross. Excuse me, can you please just explain again? We need a result that has the same difference between seven and five, right? Yes, yes. So that's at least the same difference. So then one, so for instance, zero is more different. So we include that, but two is less different. So that's not part of this probability. So two are like two, three, four, five. Yeah, two, three, four, five, six, seven, eight are less different than the result of one. They are less extreme if you want. Does that make sense? Still, I think it exactly. So we have seven heads, whereas the mean, which is five, two. So we need differences, less, two or more, right? Yes, exactly. So that's this example there. And so that's why we want two or more, and that's why we include zero, one, two, three, and seven, eight, nine, 10. And now we are in the other case where we found only one head. And so in that case, to be at least as different, we have only zero, one, and nine, and 10. Okay. All right. Okay, so I see that in the vote, most people voted to say that the coin was not fair. Okay, so you take a coin, you throw it 10 times, you find one head, and then you conclude. The coin is not fair. I now write in the chat, what do you think this quantity is and how this led to your decision? If you have the intuition for what is this probability that we just computed, how we call that in statistics, okay? So this probability is exactly this. This is a p-value. So well played, this is indeed a p-value. So in our case, we make an experiment, and our null hypothesis is that p is equal to 0.5, and our then p-value is the probability of obtaining a result at least as extreme, so as far away from the expected mean than the one that we observed during our experiment. Okay, so p-value is probable of being as extreme or more than observed. And there's a null hypothesis, okay? So the smaller the p-value, the smaller the probability of being more extreme than what we observed, the more unlikely our observed result was if the null was true. And so the idea is that you presume that the null hypothesis tells you that, but if you observe something that is very, very far away from what you expected and provided that there were no problems when recording the data, of course, then you say that this is an argument, a strong argument, to say that your null hypothesis is maybe not true. Rosario, do you have a question? Yeah, can you hear me? Yes. In this last example we did with the p-value, I don't see how this, or I didn't understand how this, and the previous example where we do the range of probability higher than two are the same, because here we are stating that it's just one time we get head, or we should say that we get head or tail once. So if we don't get head, we get tail by definition. So here the idea is we, in that last case, we throw the coin 10 times, and we get nine times tail and only one times head. Yep. And then we ask... By hearing the question we said that we need to find one head and nine tails, right? Yes. Okay, no, I just don't get, I didn't get why we had to include the other probability. Of being zero and nine and 10? Yes, why nine and 10 if we want to have just one option? I thought it was just zero and one. I don't know, maybe I... So, yes, just zero and one versus zero, one, nine and 10, right? Yes. So that's actually a very good remark. So when each time we compute the p-value, or when we do a test and so on and so forth, we have the choice of doing a two-sided test or one-sided test. So in a two-sided test, we say that the null hypothesis is that the probability is zero five, and that's the alternative. So our h1, we'll use this notation, is that p is just different from zero point five. But it could be above or under, okay? We don't make any pre-judgment on that, okay? So that's the two-sided case, and that's usually the default one. But there is another possibility that you say that the alternative could be one-sided. So we say the probability is under zero point five. And so in that case, you would only test for the under case. So you would only test for here zero and one, and not for nine and 10. And this is a choice that has to be made kind of by the experimenter in the end. Does that make more sense? Yes, thank you. It's just that I thought that since we didn't state, like, I was seeing nine and 10, like the probability we would get one tail and nine heads, while we were asking for getting one head and nine tails. So I thought I saw it like two different hypotheses. Yeah, which they are in this kind of... Okay, okay. I just want to, yeah, sure. Okay, so I will paste this in the chat. All right, so that everyone has it. And otherwise, of course, everything is in here. The solution that you can load. So for instance, you can, yeah, you can have this different information, plus a little bit of the reason of why we compute this and that. Okay. Yeah, I think I'm a little bit confused now. Go ahead. We here say zero, one, nine, 10. And these numbers are the numbers of the heads, right? Not the heads. Yeah, heads. I'm here. I'm only counting the number of heads. Okay, great. So now I'm confused. So what did you want, like one or less heads? And then I don't know why we included the nine and 10. Only need one head. Yeah, here we say that our, we presume, what we presume is that the coin is fair. Okay. If the coin is unfair, so that's our alternative hypothesis, it could be unfair in two way. It could be unfair, biased toward, toward head or unfair, biased toward tail. Okay. And so that's why we include both sides. When we say that we, we say that the coin could be unfaired or biased in the two directions. So that's why there is the, there is the two sides, if you will include it. And if it is fair, should the, the result should be 0.5, right? The result should be 0.5, but we know that there is some randomness, right? So when you do statistical experiments, you say, okay, you find 0.4. Is this enough to declare that it's unfair? You find seven. Is it, is it enough? Here the p-value would be 34%. So we'd say, eh, maybe not enough to declare that it's unfair. Doesn't maybe, that doesn't tell you if it's unfair or not. It just says that you don't have enough information there to conclude. Okay. Thanks. Thank you. Do not hesitate to ask more questions. This is the, let's say, maybe the hardest part where we do the most like simple basic math. But that's also, as you can see, we build up, that's where we build up to the whole theoretical reason why we do testing and how testing work. Like that's when we kind of open a little bit the hood. And if you understand this, then the rest should follow without too much difficulty because this is just an application of these principles there. Okay. So as I said, do not hesitate to come back to these later on. If you are unsure about stuff, because now from there, we see that we move from this consideration about this distribution and how knowing about this distribution helps us put some expectation. So we can then put forward one null hypothesis. And then we say, okay, if this null hypothesis, I know something about how a theoretical sample should behave. And so then by comparing our expectation with our observation of reality, we can make a judgment with respect to our hypothesis up to a point. So in statistical is what is text testing, we have our null hypothesis and our alternative hypothesis. Okay. There are statements about the real world, how we think things could work. Our null hypothesis helps us define a test statistic and its expected behavior under the null hypothesis. So this is a theoretical low given here by this stats.binom or by the normal low given by the central limit theorem. And then we actually perform the experiment. It gives us our measured metric, our measured number. And by comparing that with the expectation under the null hypothesis, we obtain the likelihood, if you will, that what we have seen is probable under the null hypothesis. Okay. So a test statistic is something that you can compute from your sample. So let's say the most common one could be the mean of your sample. And then to put that in the frame of many testing system, a lot of the time we actually do a few small transformation on this mean to make it comparable with a standard low such as a standard normal low centered on zero and with a standard deviation of one. So it often comes down to what we are going to just see. Let's imagine that we have n measure, we sample n measure from a normal low. Let's say that we know that it's a normal low. The mean is unknown and we put it at m and the variance is known. Okay. That's far from the usual case, but we start with something very simple and then we see the case where it's unknown. So this should follow then a normal low of mean m and standard deviation sigma. So that's the surgical mean of the population. A null hypothesis would be that the unknown mean of the population would be equal to a reference value. Okay. Maybe this reference value comes from another study or whatever. Alternative hypothesis is that this is then different. This is two sided. It could be different above or different under. According to the central limit theorem, okay, we know that the mean of this sample of n measure should follow this distribution there. So the mean of the sample will have a mean of m. So the mean of the population and a standard deviation, which is a standard deviation of the population divided by square root of n number of summation in our sample. So under the null hypothesis, presuming that m equal m0, then this sample mean should follow that low there. Okay. So CF, I've switched the m for m0 because the null hypothesis states that m equal m0. So far so good. Yep. Okay. Okay. So now to kind of make it comparable with a standard distribution, I will here take this normal distribution and if I remove the expected mean and divide by the expected standard deviation, then I should shift from this normal distribution there to the standard one with mean zero and standard deviation one. Okay. So that's very simple center and scale operation to make it something super standard. And so thus my test statistic is the sample mean minus the expected mean divided by the expected standard deviation. And this test statistic under the null hypothesis should behave like a standard normal low. And so then by comparing this metric here, this statistic with the standard normal low, I can compute a p-value. So now we say, okay, let's have m0 equal 1, sigma equal 2. Let's say we collected 100 measure and our sample mean is 1.42. If we do this little computation, we would get a test statistic of 2.1. And then we can ask the question, how likely was it to observe at something like 2.1 under this standard normal low. So this comes down to this. Okay. We have our standard normal low centered on zero standard deviation of one. And then all the area in red is whatever is larger than 2.1 or minus 2.1. Okay. Remember, we are two sided in our alternative there. And so here I can again use the same thing, the CDF and have the probability of being extreme under and probability of being extreme above. And I just get my p-value of 0.036. So 3.6%, yeah. Okay. So that's what we have just seen. But now it's in this sort of null hypothesis testing and with a normal distribution. And so then is we have this part about either two sided or one sided test. So if our standard, if our alternative hypothesis is not that it's different, but that it is maybe greater. So we restrict our test to only cases where it's greater. Then we can have a slightly different way of computing our p-value, which only takes into account this part there and not that part. That part is not part of the alternative hypothesis anymore. All right. And there now our p-value is 1.7%. Okay. So there you go. That's just a reframing of what we've just seen. So from that, from this very, very simple test, which uses the assumption, which is not very realistic, that we know the theoretical variance in the population, we can shift toward maybe one of the most, sorry, one of the most well-known statistical tests. And this is the t-test, which tests the difference between two means. So here we are now in just the case where it's the same thing as before, but we do not know the standard deviation. It is not known in advance. And so because we don't know it in advance, this adds a layer of uncertainty, if you will. Okay. So you had a new layer of uncertainty to your model is that you have to now estimate the standard deviation on top of the rest. And just this new layer of uncertainty means that the test statistic stays the same, but we know from the theory, we know that it does not follow a standard normal law, but it follows a t-distribution with a new parameter, which is the degrees of freedom. And this value is equal to n to the number of observation minus one. So I will show you a normal distribution and a t-distribution. So you see the normal distribution in blue. And then the t-distribution, you can see here, what happens is that the central part is a bit less likely and the tails are a bit heavier there. They are a bit more likely. So the t-distribution is like a normal distribution, but where you make events which are unlikely, a bit more probable. It's a bit more probable to have extreme events. Okay. And so that's the effect of this added layer of uncertainty when you have to also estimate the variance because you don't know it in advance. So here it's with a degree of freedom of three. If the degree of freedom was one, then you see maybe the tail is even bigger. Actually, let me do something like this. This, this, this, one, and then one. And you see here with one, you have even bigger tails there. But if my degree of freedom was maybe 100, so I have a lot of observations, then the t-distribution becomes very, very, very close to the normal distribution. But to the point that we used to say that when n is above 30, the t-distribution is close enough that we can use a normal approximation. That was the case when we didn't have computer to compute all these for us. Nowadays, you don't want to use the approximation, but if you only have pen and papers, then the approximation is good. All right. So now things have not changed much. It's just that we will use this t-distribution there to compute our p-value. And we also have them to compute our degrees of freedom there, normally Swiss n minus one, but things can get slightly more complex. I will directly show you the same or modern version of the t-test, the one that doesn't presume an equality of variance between the two groups and the one where you don't have to have the same amount of sample in both groups. So our null hypothesis, if we say that we have two population of mean mu1 and mu2, and we sample n1 and n2 individual in both populations, and thus we have then now two samples, one from population 1, one from population 2. The first has a mean x1, the other x2, and the observed standard deviation s1 and s2. Our null hypothesis is that the mean of the populations are the same. Okay. The alternative is that they are different. And we have now a number of assumptions for that test. The first is that the central limit theorem applies. If it does not apply, then we cannot make this normal approximation there. We cannot use the t-distribution. We have to use something different. That usually will be the case if you have enough sample, like maybe more than 10, more than 20, again, depends a little bit on the specifics of your distribution. So I cannot give you an absolute rule here. And also very important, the data should be sampled independently from one another in the two populations being compared. Our test statistic now with all of this information is not exactly the same as before, but I guess maybe you can see a few similarities. It's something that has to do with the means. So here's the difference in mean between our two populations divided by something that depends on the standard deviation and the number of observations there. So here it's the variance in sample 1, so standard deviation squared, divided by number of observations, plus the standard deviation in population 2 and sample 2, divided by number of individuals in sample 2. So it's sort of kind of, if you will, an average between this standard deviation of both samples. And these are computed like this. They are exactly like the standard deviation that you know. So there are differences between each point and the mean divided by the number of individuals in the sample. But there is a little minus 1 here. Okay, so it's not exactly like the normal standard deviation. And this minus 1 here is to reflect this added uncertainty, which is added by the fact that we estimate the standard deviation. Okay, if you don't put this minus 1, you will tend to underestimate the standard deviation. So we have to correct for that. And the proper correction is just this little minus 1 here. And finally, last but not least, our degree of freedom is this sort of error case here, which depends mostly on n, but with a little bit of scaling due to the standard deviation. Okay, we don't have to focus on the, let's say the details here. The most important just part to remember is that this is just something that depends on the size of both sample and a little bit on their standard deviation as well. Usually we just apply this formula because there is a very rich body of literature that has demonstrated mathematically that this was the actual proper way to compute that in this case. So we take this, and now we will try that with an example, of course. So I have here some data that was collected on some mice, which were subjected to different diet and may also come from different genotype. And we collected weight data on these mice. So let me show that to you. So we do a small violin plot. And we have here some mice's, whose then we have our HFD and Chow diet. And you see here that the mice under the HFD diet might on average be bigger weight more than the mice on the Chow diet. And we ask ourselves, is this actually the case? So what I do then is here I will do it manually. And then I will show it, show to you how to do that automatically. So I will just take this data and apply this exact formula on them. I separate them both with, I want the mice on the Chow and mice on the HFD diet. I get their weight. And one and then two is just the length of these. Then the main are also kept there. So that's my X1 and X2. Then I want the variance there. And here there is this little ddof equal one. So that's the delta degree of freedom. And that takes care of this little minus one here when I compute this value, which I will then need here and there. And then here this sort of little r there is just me applying this formula here. So difference of mean divided by square root of the sigma square for group one divided by n1 plus sigma square of group two divided by n2. So this is exactly that translated in code. And the number of degree of freedom is this little bit of code there. I will leave it to you if you are curious to go and check my code to verify that I've not made any mistake if you are curious. So I have got my test statistic and I will then compare this test statistic with a t distribution with this amount of degree of freedom, a mean of zero standard deviation of one, because I have centered and scaled. So that's how I get that. Stats dot t. So it's not stat dot norm but stat dot t because of this estimation of standard deletion which had this high degree of uncertainty. And then the CDF of minus my test statistic and one minus my test statistic here. Okay. So remember this is two sided. So I want what's on the left and what's on the right. I apply this. I compute a test statistic of minus seven. The degree of freedom estimated there is 44. And that corresponds to a p value of 9.24 times 10 to the power minus 10. And of course, we won't ask you to do that in practice. Okay. That's something which exists already that has been coded. So stats dot t test of independent t test underscore in. You give the data for group one, the data for group two. You say that you don't presume that the violence are equal. So not presumed that mean that the violence are equal is why we have to use this slightly barbaric expression there. But it's nice because you have one less assumption for your test. And then we get a test statistic which is exactly what I had and a p value which is also the same. So I've not made any mistake apparently. So this function will return to use a test statistic and the p value. Okay. So now quick question to you if everything is good so far. Given this result, what would be your conclusion? So green tick if you would accept a H zero or at least fail to reject H zero and red cross. If even this p value you decide to reject H zero and say that the mice have different significantly different mean. Let's have a little vote with the reactions. Okay. Some reject some accept. Okay. Okay. So most people reject. And indeed in that particular case we would tend to reject H zero and say that they have significantly different mean there between these two populations because the p value is nine times 10 to the power 10. So it's about one divided by one billion. Okay. So that's actually very, very, very, very, very small. It was very unlikely to see such a result under the null hypothesis. All right. And what we would do also would be maybe on top of this p value you would want to report the actual difference between means. Okay. So you would want to maybe say the mean between the mean difference, difference between the mean of these two samples is eight gram, eight dot four grams, which is approximately, you know, one, one 25% of the weight of the mice, right? Give or take. So that's actually far from negligible, right? That's significant difference. Not only statistically, but also biologically when you have a difference of mass of 25%. Okay. And then if the p value was 0.3, I will give that to you here. If the p value was 0.3, we would say that we fail to reject H zero. We would say, okay, here it was not so unlikely that we observe that under the null hypothesis just by chance. And so we would conclude that we don't reject H zero. Here we don't say that we accept H zero. We just say that we don't reject it because this high p value could come also from the fact that you don't have enough sample, like if the difference is small, it might still exist, but you would need a lot of sample to be able to detect it. I will go in that in more detail a bit later on. But I want to tease. Okay. How are we doing so far? Is everything kind of making sense now that we have seen where we wanted to go and what was this context for the test and so on and so forth? I think at least for me, I need some time to digest all of this. It looks okay, but I think to understand everything, I need time to adjust. Sure, it needs, for me, we all need some time to meditate on these concepts, some time for them to make complete sense. If I can just guide the process here, I think it's more important to focus on the concepts of the assumption of the test and how we have created a setup such that we have expectation and there's a null hypothesis and what we have done here and in this previous test. I think this is more important to having a broad understanding of all of this than the particulars of this formula. This formula, the particulars, we could go in the details of them, but that's far beyond the scope of that course here. We will just use what the literature gives to us and focus more on the test, what their null hypothesis say, and what are their assumptions. It's very important to realize that the p-value is only as good as the null hypothesis and the assumption of the test allow them to be because the p-value is directly defined by the assumption that we can make from the null hypothesis and the assumption of the test. There is a question by Rosario. What is the null hypothesis in the mice weight example? The null hypothesis is here that the average weight of the mice are the same between the two population. We say that these would be the population. Imagine that here I only have 20-something mice per category, but I would imagine that I have an infinite number of mice for each category, then their average weight would coincide. That would be my null hypothesis. Does that make sense? Okay, perfect. Then as we say, all right, we have a test. It has a null hypothesis and some assumption about the properties of what we collect. That is thanks to these that we are able to compute the p-value. If we don't make this null hypothesis and if we don't have this assumption that we cannot compute the p-value and thus a contrarial, if the assumptions are not respected, then the p-value entirely loses its meaning. That is, for example, why when we call this, we have this option equal var equal false or equal var equal true, because there is two ways of doing the t-test. One which presumes equality of variance, but then the formula are not the same. Thus, if our variance were not equal, but we would say that equal var equal true, we would presume false thing. We would make false assumption and so the p-value would not have a proper meaning. It would lose its meaning, would be useless. Here, we say equal var equal false. We don't presume equality of variance and thus we are in a slightly different mathematical formula. We use something slightly different and then only then does the p-value make sense in this case. So here, if we come back a little bit to the assumption of the test that we have just done, the first is that the data used to carry out the test should be sampled independently from the two populations being compared. That one, you should take care of that when you design the experiment. For instance, here, these are entirely different mice and second, the mean of each sample should be normally distributed. The assumption is not necessary that the sample are normally distributed, but that their mean are. That means that even if the sample are not normally distributed, if we have enough point, the central limit theorem ensures that their mean will be super close to normally distributed and usually close enough that we don't have to care too much and the p-value is still meaningful. However, it's more easily said than done. For instance, we can illustrate a little bit. I will take a case where I take two samples from the same distribution. So I'm now respecting the null hypothesis because they have the same theoretical mean. So when I then do some random testing, I should get sometimes things which are significantly different just by chance, and most of the time I should get something which is not significantly different because the null hypothesis in my case is true. Furthermore, we could say that according to its kind of definition, I should see p-values under 0.05, exactly 5% of the time. There will be a little bit of randomness around that because I'm doing random simulation, but we should be fairly close to that. Here I'm doing 10,000 repetitions, so that should be close. I will do it once with when I draw from a normal distribution. So this is kind of, if you will, the perfect case where all the assumptions are respected. And then I will do that a second time with another distribution, a Pareto distribution, which is quite different from the normal distribution. In particular, it gives a lot of very extreme results. And there we will see if our assumption that only 5% of the p-values will be significant is still respected or not. So just to see what is the effect of breaching the assumption, here I breached the assumption that a sample come from a normal distribution as on the p-value. So I do my simulation. Maybe I should have launched that. No, it goes fast. That's cool. So when I draw from a normal loop, I respect the assumption. Here, the proportion of sample p-value under 0.05 is 0.4546. So that's actually very close to the expected. You can see here in green, the distribution of p-values when I draw from the normal distribution. And you see that it's close to uniform, which is what is expected under the null hypothesis. Because then the p-value follows the test statistic, follows its theoretical expectation under the null hypothesis. And so it's equally likely to get any result in terms of p-value. Then when I use a Pareto distribution, then the proportion of sample p-value, which are under 0.05 is only 1.7%. So there is now a huge skew. There is a deformation there. And so the p-value that I compute has lost its meaning. When I see a p-value of 0.05, it should not be interpreted at 0.05. It's actually not the correct meaning for it. So I am now kind of lost with respect to whole. I should interpret it. And you can see kind of what the distribution of p-values when I draw from a Pareto distribution is. Now it's something weird and it's definitely not what was expected. So that's the effect that breaching the assumption of the test can have. It makes the p-value lose its meaning entirely. So we have to be mindful of this sort of stuff. And it's, I think again, one of the very, very important concepts to remember. A p-value is only as good as null hypothesis and assumption. And so it's very important before we do any kind of testing to test the assumption as much as possible. And that means that before we do testing, we have to do testing. Okay, so far so good. I know there's a lot of information. I'm not sure with the p-value like under the, if you assume you have a normal distribution. Why have they all like the same likelihood, like the green one? Yes. So basically, so the p-value gives you the probability of being equally or more extreme that what you expect under the kind of the null hypothesis. So if we respect the null hypothesis, that's the case here because I draw them from the same thing. Then this test statistic will follow exactly the theoretical distribution. So let's take a curve so that it's a bit nicer. Yeah. So it will follow exactly for instance that distribution. Okay. And so that means that if it follows this distribution, then I know that in 5% of the case, I should get something which is equal or under that value. If that is the 5% if that is the 5% percentile of this distribution. Right? Yeah. This is the definition of that. So that's true for 5%, but that's also true then for 10%. I have a 10% chance of being equal or under the 10% quantile, 15%, 20%, 25% and so on. So for any quantile, I have a probability of being under that thing, which is equal to that quantile. So far so good. Yeah. Makes sense. And because this is the p-value, then the probability of a particular p-value is actually always kind of, if you will, the same and equal to itself. To itself, but it's equal. And so that means that, and doesn't know that's why you see this sort of uniform. Okay. Yeah. Thanks. You're welcome. That's a nice little, let's say, property. And that's sometimes that's why I also like to use a lot of simulation to test things and to play around with stuff to see what happens when you breach assumption, much happened when you don't and so on so forth. At least to me, that helps me understand what happens aside from the theory. Okay. So how are we doing so far? Okay. Still alive? Yes. All right. So it's free. What I propose before we move on to our next test. So now testing normality, because if we make the assumption of normality, we want to check if this assumption is correct. So before we move on to next, I propose that we do our 15 minute break now. And then when we are back from the break, maybe this has had some time to, you know, move a little bit in your hand. Maybe you have some time to manipulate the concept. Or, you know, just, you know, go and take a coffee, all right, yourself. And then we will then move on to the second test. All good? Okay. So then I will put recording. And so as usual, do not hesitate to ask questions. So we want to test normality. As there is no perfect test for that, I explain why and what are the tests in a moment. We, in general, want to both use a test, but also maybe a visual assessment can be quite useful. For that, the best graphical overview for that is a quantile, quantile plot or QQ plot in short. So these compare the observed quantile, okay, the quantile of the points in your sample with the one that we would expect under the standard distribution you compare with. So for instance, usually it's normal. But in the end, that could be any sort of other theoretical distribution you have in mind, right? Generous normal. And that's why this is the default. So to create one of these, we, for instance, I will demonstrate one QQ plot with a sample drawn from a normal distribution. So this is a sample where you actually are respecting the assumption. So you should see something close to the diagonal and then from something other than a normal distribution. And so you will see something different. So then I use stats.proplot. And by default, it compares against a normal distribution. And I give the sample and you see that then I say to it like you should plot in this particular axis because I want to do two subplots there. So this is what happens when you have indeed a normal distribution. You see that you kind of follow the diagonal. There is always a little bit of random noise, especially at the extremities. This is quite expected. And this is what happens when it's not normal. Here, this is a bit away from normal. And you can kind of see this pattern of above and under and above, right? There is this systematic bias there. It's not just one few points moving around. Okay. Here, to me, I think it's important to build a frame of reference for what is okay under a QQ plot and what is not. Because that's a question that we see very often. Like you have 30 points and then you look at the QQ plot and I mean, it's very hard to make a judgment based purely on the QQ plot. If you haven't looked at some of them with cases where you could trust what you were looking first. Here, I can trust that this comes from a normal distribution because that's what I've drawn. And this, I can trust that it's not from a normal distribution because again, I control it. Without this, if you don't have too much preconception, it's not so easy to know if that comes from normal. And it's only with a bit of experience, a bit of habit that when you look at this sort of stuff, you say, eh, that's quite okay. Or this is fishy. So play around with that. Maybe we can do a few draws. You see here, we are very, very, very close from the expected, except from maybe one small point there, which is a bit unlikely. But, you know, when you do a 100 point, it's not unexpected that one is a bit away from the others. And we can, again, draw a few just to build the expectation there. It's important to see that if we didn't have 100 sample, but 10, then you would see more, maybe more variation. Okay. Of course, as you have only a few points, and yeah, it can over quite a bit more around just by chances. And then of course, then it becomes a bit harder to detect also cases where you are away from the assumption of normality. So here, when you have so little point, yes, here we know in theory that these are not, these are not normal. But if in practice you encountered that in data, it would be very hard to make a judgment with respect to whether that is normal or not. Making sense so far? Yeah. Okay. So when in doubt and playing with QQ plot, I would say do not hesitate to play around and create a few of these where you control what happens to kind of build your expectation with these. All right. And then with this graphical overview, we often time pair that with a statistical test. Here, I will propose the Shapiro-Wilk test, which is one of maybe the best normality test. There are a few others. So for instance, here we will use Shapiro, but you will see also a lot used normal test, which is also known as the Tagos-Tinos test, and there are a few others as well. The real problem with these tests of normality is going from two points. The first is that they have the null hypothesis as normality. Yeah. And in statistical testing, what we would like in general is that the null hypothesis is what we want to reject. Okay. You know that in science, we can only advance by rejecting hypothesis. Okay. And we do not want to accept our null hypothesis. That's not really how the system work, because seeing a large p-value does not mean that the, again, null hypothesis is true. It may mean that it's true, but it may also mean that we don't have enough point to detect a difference. Okay. So that's one first very important problem with this test of normality. The second is that they only really oftentimes look at a few properties of the distribution. For instance, they will try to test if it's symmetrical and if it has the right peak, so if it's not, so if we come back up there, if it's quite like this and it's not under with heavy tail or above with two short tails. Okay. But that's just looking at two properties of the normal distribution among many. And a few other tests, other properties, but none of them is all encompassing. So that's another limit there. Okay. Well, furthermore, here's a Shapiro test is limited to 5000 points max, which covered most cases. And if that's not the case, then we have another test, but that's the idea. So that's why here I always couple a visual assessment with the test. Okay. I never do only one, because sometimes it may happen that the test does not detect a difference from normality because it only looks at a few things and he might, I might detect that. Conversely, when you are in a weird case like this, when you don't have a lot of points, having also to test telling you that, well, it's not so unlikely to see something like this, if that came from a normal distribution can help you decide a bit. Okay. So the Shapiro test, I will not go too much in the detail, but now the idea, if we take now our sample from a normal distribution and sample from an exponential distribution, the null hypothesis is that we have normality. And here with very small sample of size 10, you can see that the p value is 0.1 and 0.13 for both the normal and the exponential data. So I remember when in one of these case, where here the p value is large, that doesn't mean that this comes from a normal distribution because we know that's not the case actually. That just means that we maybe don't have enough point in order to be able to do a informed judgment with this. And of course, it's very hard in practice, you don't know this truth here. So if you have a large p value, you cannot make a judgment of whether, you know, your null hypothesis is true, or it's just that you don't have enough point. And that's why we don't say that we accept H0. We always say that we fail to reject H0. All right? That's just a way of suspending our judgment. So from Rosario, so in this test, we need to get a large p value to accept. So in this test, we need to get a large p value to fail to reject the null hypothesis, if I may. But that's the idea, yes. A large p value means that you are, you have a draw, which is fairly likely under normality. Right, with that being said now, if my sample size is 10, then to me, it would make sense to say that it's hard to make a judgment. Now, if my sample size is instead 1000, okay, there I would know that actually, when you have 1000 points, finding a large p value is actually quite indicative of a, of some data, which is very close to normal. Okay. And here you see with something which is not normal, you have very small p value, okay, 10 to the, here it's practically indiscriminate from zero. Okay, 10 to the minus 30. Okay. So there we have also to exercise a bit of judgment, right? You will not expect, you will not interpret exactly the thing in the exact same way, depending on the number of points that you have there. Here is an example here again, for instance, here with 100 points and see that we get here p value of 8% with normal data and 10 to the power minus nine for non-normal data, right? So that should also color a little bit how you approach these. And here you see how suddenly, because we are not in this nice frame where we want to reject the null hypothesis, we have, we are much less comfortable with our conclusion. And that's the problem. Okay. So with that being said, if there is no burning question, it's time for a small micro exercise for you. We have done the test, but we forgot to check the assumption of the test. So now check the normality for the weights of the mice show data and HDFD data, right? So make sure that you have them and then maybe do a QQ plot for each and a Shapiro test for each and then say what are your conclusions? All right. So I will let you work. Please put a green tick next to your name if you think that normality is quite okay. And a right cross next to your name if you think that the normality is not okay. All right. Once you have done the test. Okay. So most of you have answered. I see a majority of people who think that normality is okay. And two say no. Okay. So let's have a look together. So the first thing I did was to lazily copy and paste from this code there to there and then just change sample N and sample E with show and HDFD, right? I love copy pasting. It's a very useful art. And then I will just do that here again, copy and paste. And then I want to change sample E and sample N to HDFD and show data. So HDFD and then show and we should be fine. Let's look at it. So we get here p-value of 15%, 16% let's say for show data and for HDFD 38%. All right. So in both case p-value are not significant at the 5% threshold. Okay. So that's one thing. Okay. So it says, okay, you have 30 points and you are not significantly different with departing significantly from there. So we fade to reject H0. And when you look at this here and there, it seems to be a bit kind of, I would say close ish there. It follows very well. And there here there is this point, which is maybe a bit unlikely. But again, all in 30 point or 30 or so point is not completely crazy. One thing that I can check also is maybe the length of HDFD data and the length of show data. So 29 and 20. Okay. So that's for HDFD second and then show is that. So show has only 21 point here. And if you want to help get convinced that this is okay or not, maybe you could just have a look in there, change the side to 21 and then have a little look like here, generate a few here, a sample with size 21 and say, okay, you know, would you kind of by random, if I just show you is that does it seem much more unlikely than whatever you see here? Okay. So if I were to then repeat that 10 times and then I would put this plot in the middle of all the 10 others with it with this one completely stand out or would it kind of look the same? That's if you will, that's the idea. That's what you want to that's the sort of judgment that you want to make here. For me, this is really not so different from the expected. And here the p values are both big. So I would make judgment that if there is a deviation from normality, it is really not so large, right? Furthermore, of course, what you want in that particular case, in the very particular case of the t test is not necessarily the normality of the data itself, but the normality of the sample mean. And so that has, let's say, a layer of security there, because even if your data is departing slightly from the normal distribution, not too much, but slightly, the central limit theorem ensures that if you have enough point, then your sample mean will converge towards the normal. All right, here with 20 points and something close to a normal distribution, if not normal, I'm quite confident that we would be inside the frame of the assumption of the t test. Okay, is that okay for everyone? Does it make sense? Yes. Okay. Good. So you will see that while I talk about this, I try to depart from the very, let's say, simple and classical way of interpreting and just looking at p-value saying, okay, p-value is under this, I automatically interpret that, p-value is above that, I automatically interpret this. I want to give you and to convey to you the kind of fact that it's not as simple as that. And we have to care a bit about what the p-value actually means in order to have a more, a slightly more nuanced judgment. There's a question in the chat. How can we check that the sample mean is normally distributed in case we are not convinced of the data set? It's hard to do in practice. The way that you would do that is if you know that you are using a something which is not the standard, you know, your normal distribution, for instance, let's say that you know for some reason that your data is from an exponential distribution, okay? Because of a reason you have this information. So you know it's not normal, all right? But then you could ask the question, okay, now is the sample mean normally distributed? So in that case where you know what is the kind of theoretical distribution under that, you could do simulations of the sample mean, and then you could simulate the sample mean here for data extracted from an exponential distribution with a sample of size 100. And then when you would see that here when you have, you know, samples of size 100, the mean follows a normal distribution. It would actually compare it to the expected one with this sort of plot there. So there you would say, okay, five is not enough, 50 almost, 500 more than enough. But then of course that kind of presumed that you know about the theoretical law that is underneath that. And that's the kind of hard question. So in most case, you don't know. And so that's in reality very hard. And so if you don't feel confident that you want to make this assumption of normality, and in many cases, it's true that you don't want to make this assumption. Then we have to change the sort of test that we do, we have to switch to what we would call non-parametric test. So non-parametric test, non-parametric statistics are tests which don't make an assumption about the family of distribution of what you are computing. They don't presume normality, binomiality, whatever. So they offer you much more liberty, much more freedom. They are much broader in application. So the non-parametric equivalent of the t-test is what we call the Man-Witt-Nay-Hugh test. His assumption are that all observation from sample are independent from one another, same as before. And that the values are ordinal. That means that they could be compared. So the most common case is that there are numbers, and so number you can compare them, but they don't even have to be compared. They don't even have to be numbers. As soon as there are two things which you can compare, which you can rank, then it's okay. The null hypothesis is that it's not exactly the equality of mean of the population. It's something a bit weirder or longer to write, at least, is that the probability that a randomly selected value from the first sample is lower than a randomly selected value from the second sample is equal to the probability of green greater. So I pick a random element from population one, another random element from population two, and I say, okay, another null hypothesis, I have a 50-50 person chance that the individual from population one is higher than population from element two, than the kind of converse. And if we are away from the null hypothesis, that means that there is kind of a skew. I have a higher probability of here being above in population one than in population two. Slightly different interpretation. So the test statistic is computed by looking at the observation in the group and counting for each one what is their rank. How many times are they above the other categories? And you count 0.54 times. So for instance, this first point is above one orange point. This second point is above three orange points and above four orange points, and then above five orange points. So you get 1, 3, 4, 5. You sum and you get your test statistic 13. So this is the idea. And then this value will be compared against the theoretical low for this test statistic. And that will be used to give us a p-value. I will not go into the detail of how these test statistic and our theoretical values are obtained. That's actually quite hard. All right. So in practice, let's say you have two sample 1.2, 1.5, 2.3, 3.0, 3.1, and then 1.1, 1.8, 2.2, 2.4, 2.8. Here you see that you have only five observations for each. So presuming normality will be very, very hard. We know that any kind of test would not reject normality because we have so many, so little number of points. And we say that we don't want to presume that we don't want to make that risk. So we prefer to go with the non-parametric option. So if we do that manually, that would say we compute you, we give two samples, and then for each sample, for each value in each sample, we go through each. And if you have two values which are above one other, so if it's above in sample A with respect to sample B, then the U for sample A has plus 1. If they are equal, this is a tie, so it's plus 0 1. And if it's above in the second one, then it's plus 1 for the other one. So as you can see here, I computed it first like there for the blue, but I could do the same here for the orange. And I would get here 0, 1, 1, then here I would get 2, because this point is greater than two blue points, and then three. And I would get a U which is different from the other sample. So actually for two samples, you get two U values. And in general, you pick the minimum out of these two values. So I compute these, all right, just by comparing all each value compared to one another. And I get my U which I computed manually, and then I will just do the same thing, but with the Manouit-Neyu function of stats, which just takes sample 1, sample 2. The method will be exact, and the alternative will be two-sided. Okay, here you could say one-sided. So remember, either you have the null hypothesis is that they are equal alternative is that they could be above or under. And when it's one-sided, you would say, okay, it's only above, so only one way of departing from the null hypothesis. All right. And so we do that. And our here U computed with manual or with sci-pi is the same. There's a small trick with sci-pi. If you want to have the U value, you have to compute it with both order of the sample, sample 1, first and sample 1, second, because they don't compute both at once. Nevertheless, irrespective of the order, the p-value will be the same and will be valid. Okay. So here my p-value in particular is 0.69 is quite high. I would fail to reject the null hypothesis that the probability of of having one random point in sample 1 being greater than one point in sample 2 is 50%. Right. Very important note. If for some reason you are using an old version of sci-pi, and specifically under sci-pi 1.7, then statman with the manually new function is not valid when the number of observation is, you know, under 20. Okay. It's a case where it's not possible. And so in that case, you have to grab the value of U and then manually go and look in a table. So you have to do that the old-fashioned way. And you have to say, okay, what is the number of observation? Maybe the number of observation is 10. And then what is the number of observation is a second point. And then for a p-value there here of five percent, then the limit, the threshold for decision would be 23. Okay. Hopefully you never have to do that. Right. That's the very old-fashioned way of checking your p-values. If you want to check your version of sci-pi, you go import sci-pi and then sci-pi.version. And here there is 2 underscore. Okay. It's quite important. So you can test it on your own. I think it's very likely that you have something above 1.7. Now most of the case, this will be the case. But if for some reason you are on an older computer or maybe on a cluster or something like this, it can be useful to check this little detail because that may change quite a bit and that might play a role there. So yeah, just be wary about this. Okay. So that's it. So you can see then this is this non-parametric alternative. A lot of tests have non-parametric alternatives. Right. And oftentimes they are called rank tests because we are here. I compared the number of cases where I was above points of the other sample. But another way of computing this statistic is by making sums of tests, sums of ranks of each of the points in one sample. So that's why they have ranked some test name. Okay. So now your turn to just try and play. Perform the main witness you test on the mice data set. Okay. So use the same data as before, but not the t-test, but the main witness instead. And then what is your conclusion? So same as above. If you fail to reject H0, you use a red cross. And if you reject H0, you use a green tick. All right. I'll let you work. Okay. Great. So I see that most of you have results. That's nice. So let's look at it together. So it's actually not too, too hard. As you can see, true data, HFT data, and then feed that to sci-pi.test.man with you. And then I want to say method equal up exact, because I see no reason why I should not use the exact methods, especially when I don't have too much points. When I have a lot of points, then computing the p-value with the exact methods, I think can be a bit long. And then the symptotic approximation with a normal distribution is preferred. And when you have a lot of points, then the assumption actually works well. The approximation actually works well. But here in this sort of case, I see no reason not to use the exact one. So test statistics, g5.0, as many of you have found, and my p-value is 2 to the power 2 times 10 to the power minus 0.9. So it's also very small. So here, also, the magnitude when you also reject the null hypothesis and say that there is a difference in location between these two tests. Here, you could then say to me, well, I mean, okay, you may just use the t-test. It shows that they are different. But then the t-test has all these assumptions of normality and so on so forth. That is hard to test. And well, yeah. But then you showed us the non-parametric. And the non-parametric also rejects. So why not always use the mat mit ne u-test? What do you think about that? Why would we not always use the mat mit ne u-test after all? Because here it worked as well. Yes? As far as I know, they have less power or they need to be more different from each other before they get significant. So if you have slight differences, then you may miss it. Exactly. Thank you. That's the perfect answer. So indeed, we can say that they have what we call less statistical power, but translated into layman term, if you want. That means that they need to have either more data or larger differences before they can actually have smaller p-value. Whereas the t-test has more power. That means that it is better able to detect more difference between these. If you will, the idea is that the t-test has more information. It has the information that comes from its assumption. And so it is able to leverage this additional information to be more precise. All right? Whereas the mat mit ne u-test has no information, makes no assumption. So then it's slightly less precise in its judgment. All right. So that actually lets us tie into a bit more of a discussion about p-values, statistical powers, and types of error that we can make. So the first thing is type of error. So I like this little, maybe we could say, two differences between the two types of errors. There is what we call type one and type two. So type one error is the error that we make when we reject H0 while it's true. So if we are in a pregnancy test, next zero is that you are not pregnant and H1 is that you are pregnant. And then the type one error is to say to someone who is not pregnant that they are. And interestingly, type two error is accepting H0, failing to reject H0 while it is false. So then we have a person who is pregnant and we say to them that they are not pregnant. So far so good? So then we can always build this sort of little table, which I'm sure that you've already seen a lot of time, but it's always good to then re, you know, re-inject that sort of thing. You have your reality where H0 is false or true and the test which either rejects or accepts H0 or fail to reject H0. And so if the reality is that it's false and then that it you reject, then you reject it correctly. If it was true and you fail to accept it, then you accept correctly. And then you have your different type of error when H0 is true and you fail to reject them or H0 is false and fail to accept them. The probability of doing type one error is typically what we call alpha. And this is the threshold that we use to say that a p-value is significant or not. So that's the type of error that you control, because you are the one who chooses the threshold. It might be 5%, it might be 1%, it might be 1000, it might be 10%. This is something that you choose and you control. When you make, when you, when you choose that, you decide what sort of, what sort of a bet you want to take on the type one error. Okay. But then that's not the only way that you could be wrong. You could be wrong with the type two error as well. And so there's this other probability of being wrong that you have much less control on. And so this error is beta. And typically, when we speak about it, we don't use beta directly. We talk about the power of the test, which is just one minus beta. So the power of the test is for a given setup where H0 is false in reality, what is the probability that our test would be able to correctly reject H0? All right. So far so good. A little bit of a definition there. Then we have our p values. So a p value is a lot of things, but there's also a lot of things that it's not. And I would say everyone gets some of these wrong. And when I say everyone is like, it's been shown that even a university professor would teach about statistics can regularly get the interpretation of p value wrong, because it's kind of easy to trick oneself into thinking that we know what they are and so on so forth, but they oftentimes mean a bit less than what we think. Of course, when we are asked about them directly and we think about it and we take the time to think about it, we can get it right. But in practice, when we read papers, when we analyze our data, it's a bit easy to forget about it and to misinterpret and overinterpret them. So the p value is a probability of obtaining a test statistic as or more extreme than the observed one. If the null hypothesis is true, so you always depend on the null hypothesis. Okay. Remember our example earlier with the binomial low and probability of getting a certain number of head or tails. So as I said, you are always linked to a specific null hypothesis. The way that we compute the p value is not the same if we presume that the coin is fair, that the probability of getting heads is 0.5. Or if we presume that maybe it's 0.55 or 0.6 and so on so forth. In each of these, then the p value would be computed completely differently. So it's then very important to say that the p value is not the probability that the null hypothesis is correct. It's not the probability that it's incorrect. It's not the probability that we are making an error. It's maybe closer, although not exactly, the probability that we are making a type one error. But then there is also the probability of beta, which is that we make an error of the other kind. All right. A large p value does not nearly prove that your H0 is true. Okay. It might just mean that you don't have enough data, that you don't have enough signal in your data to be able to reliably detect the difference. And one last part, always report the exact p value in your papers. Please, please, please. Because just saying p under 0.05 is not enough. Like nowadays, there's no reason that you just don't give the actual values because your threshold for significance might be 0.05. But for other people, it will be maybe 0.01 and so on so forth. So to let people have a fair judgment on your results and so on so forth to report the real values. So a little thing in the presence of a true effect. So that's maybe to rebound on that. In the presence of a true effect, the p values will be affected by the sample size. So the larger the sample size, the smaller the p value. But when you don't have a real effect, then the p value should not be affected by the sample size. You can have larger and larger and larger sample size. The p value should stay by the same. All right? For example, if I draw samples from without any difference, compute p values and then try with more and more and more and more data, I have two situations. Either there is no real difference between the two samples. And then you see I can have three, 10 or 100 individual in the samples. The p values are spread in exactly the same way. But if there is an actual difference between the two, then it makes a difference whether or not I have three, 10 or 100 samples. Okay? And there you can see that when there is only three points. In most case, I fail to detect the difference in when there is 10 points. In most case, I am able to detect a difference. And when there is 100 points here, I am able to detect a difference in almost all of the cases. All right? That makes sense as well. Yes, Rosario? Just about this threshold setting, it is always something that I have trouble understanding. What we use is 0.05 generally in biology or biochemistry. But you say that everybody can choose a different one. Yes. How do we evaluate ourselves? Okay. So to meet all relates to how much certainty do you want to have in your result? Okay. If you reject a null hypothesis with a p value of 0.05, you kind of make the bet and you know that then you have a 5% chance of being wrong on your bet. Your bet is that you reject H0. And the probability that you made a mistake is 5%. If you will. So then that depends like are you a betting man or not? How much do you want to bet on that? And it depends on and I think to me the person who created the t-test was a brewer for Guinness and he kind of put it that way. How much do you stand to gain or do you stand to lose depending on whether you are wrong or not? Okay. And then this compare you should kind of weigh that with the probability that you are making a mistake by doing that particular bet. And you know make your decision based on that. You can use actual betting strategy in the end. So if you are like okay my p value is maybe 0.045, you could say okay. I mean maybe then it's worth it to make some validation experiment on it. But maybe I won't spend one million franc or one billion franc just to validate that. But maybe spending a few weeks to test it would be worth it. Okay. Thank you. It's just that several times we tell stories like okay the p-value is low enough to observe a significant difference between these two data sets. And this is just based purely on these p-value numbers like slower than 0.05. Yes it is significant. If you get like 0.06 you say no it's not significant. You test it again. So this is a misconception I guess. Yeah it is something that we have to be careful about. Again if we come back to the early days of p-values when we had a bit less maybe drag and inertia to our practice sometimes people would see a p-value of I don't know eight percent and then would say right yes that signal that maybe there is something to look for. Yeah okay you don't accept it as a third truth. But that's a signal that okay maybe you know I only had 10 samples in each group and got a p-value of five percent maybe I will spend a bit more time to go from 10 to 20 samples see where that gets me. And same thing like always validate and so on. Thank you. You're welcome. Very good question thank you. Okay so that is then where we get to power. So as you can see here as we have a growing number of elements in our group so as n gets larger for the same difference a larger proportion of the tests are detected as significantly different. So that is this proportion of tests which are detected as significantly different with reason so presuming here that the null hypothesis is false and that we have a specific alternative hypothesis in mind we can call that the power of the test okay. The power of the test is something that depends not only on a specific null hypothesis but also on a specific alternative hypothesis okay. So the alternative hypothesis cannot be the two samples are different no you have to define a precise difference between the two and a specific experimental setup with sample size and variance. And when you have this fixed then it's possible to compute the power of a test and that means that you can also then compute the power for each n and you could then find the n such that you have a little bit of control on your beta on your risk of you know there is a real difference but I don't have enough samples so just by chance I don't see that and that you can sort of try and control that and that gives you a minimal sample size if you will. So let's imagine something let's say we have two groups and we presume that the actual mean difference is two okay maybe some previous experiment has let us know that it's maybe likely that the difference is two all right and we expect also that the send out deviation should be three and say okay let's imagine I have a sample size of 20 for each group and my significance level threshold so the alpha level the belt I want to make is of five percent I will then make a ton of simulation to see what happened so what I do is that I simulate then groups yeah I simulate then groups of this size with this difference and this standard deviation okay and I see what happens do I actually see a significant p-value or not and in the end this is what you see all right so here in red this is the expected distribution of the test statistic under the null hypothesis okay here colored like this in solman and teal we have the observed p-value sorry the observed test statistic okay under then the with the real data so under the actual attentive hypothesis you have here the threshold at 2.1 about okay for your the statistic under the null and we see that here we have then about half of them which were then correctly rejected and then half of them which by chance here fall in this red region there and so they are spuriously accepted in the sense that here we fail to reject h0 but we are mistaken then a mistake because in this particular case h0 was wrong all right and so this fraction there of things that were actually rejected but with good reason is 54 percent and that's the power of the test in this particular setup all right so far so good can I have a question oh yeah go ahead so we could have p-value that less than 0.45 but we still can reject we still can like fail to accept right or fail to reject that depends on your threshold for significance level so for instance if your threshold for significance is one percent then yes a p-value of 0.03 would not be declared as significant and based on what I put this threshold and so this threshold is based as I was saying earlier this threshold here can be seen as your probability of making a mistake when you reject the h0 okay so then it kind of depends on how much basically how much do you want to bet on your decision and how risky of a bet do you want to take and how are they okay so with this the threshold could also be wrong right yeah I mean that how you set this threshold is kind of is very hard because ideally you want this threshold as small as possible because you want to have a low probability of making type one mistake but then the lower the threshold the lower the power so here see with a five percent significance level so threshold at five percent then I have a 54 percent power but if I lower my threshold to one person because I don't want to make take too much risk then my power drops to 28 percent only okay so in most case I'm unable to detect a significant difference in this particular setup okay so it's kind of a balancing act to find we always have to accept some risk as we are doing our bet and it's our kind of job to have a critical look on that and on you know saying that okay I'm making this decision I'm making this decision with this sort of risk of threshold so what I will how I will interpret my result what I will do based on this p-value should always be taken with a grain of salt and the grain of salt should be as big as your significance threshold if you will does that make a bit more sense somehow because for example if I have two groups of patients one are like treated one one not so based on what I say these are these are these two groups has good p-value or not be a good p-value so yeah so here that will be based on what is the p-value what that you do when you compare them is the p-value is something like 10 to the power minus 6 in most case like this is super tiny and so you know that if you declare them a significantly different you are not taking too much of a risk but if the p-value is I don't know 0.02 so 2% or if the p-value is 0.06 you are in this kind of gray area where you are now making more of a risky bet and so you may declare significance but you have to be very nuanced about how you will interpret your result after you have to always take into account the fact that you may have made a mistake when you call that significance is what I mean and then depends on the risk so I can risk because based on the visual and then I can see if I can interpret my results almost right yes and then you can also you can also conduct a power analysis to say okay in that setup given the uh given the difference that you I've just observed that for instance you observe the difference of two and send that deviation of 20 of three you can say okay how likely was it that I would have detected uh this difference and that can also help you interpret a little bit uh also what you know what in what sort of situation you're in that helps you understand a little bit of sort of bet that you are currently making okay thanks there is another question by Rosario yeah sorry for go ahead no I've I think that yeah I think the Rosario is frozen or maybe it's me that is frozen just I was wondering about this thing that you can hear me yes I can hear you can you hear yeah I'm sorry I just was wondering about the other method you mentioned before that the p-value should change with the amount of data that we analyze um when when there is a significant difference yeah so basically if we have a big data set to understand if the p-value makes some sense uh and we can bet on it we can just subset the data set and test it and see that the p-value gets lower with the higher amount of data this is a way to evaluate also our data set that could be uh that could be a way but with that being said it's better to just take the p-value with the full data set because if you have a big data set and there is an actual interesting difference then the p-value should be small anyway yeah I mean just for our own evaluation if we have to do different trials because if we see that the p-value stays constant whatever amount of data we extract from our data set we can understand that the test doesn't make much sense maybe if I understood this point in in a sense yes use that with caution because it could be also be construed as p-hacking by some people but if you want to then use that to gain a better understanding of how tests work and you know how much of uh how many samples could you would or could need to detect the significance and test the power then that might be okay thank you yes all right so do we have more question on this now new concept of the power p-values and so on so forth not so far okay so then let's see how we can use this concept of the power to actually inform our protocol so first off here I used simulation but for simple tests such as the t-test which are very well known you can find in some libraries some function to directly compute the power so here from the step models library I import so t-test independence power and the way this function or object work is that you declare one and then you give it an effect size so the effect size is the main difference divided by the standard deviation okay so you give it an effect size number of observation in the sample one so ratio between the number of observation in sample one and sample two okay so for instance here they have the same size so it's a ratio of one and then alpha the significance threshold for the test so it's a set of same sort of setup than what I had here before a main difference and standard deviation sample size significance lab all right and so here then this will automatically compute me the power so this should be about the same number here so here for instance that's with an alpha equal 0.01 so if I switch that to one percent then you see here I get 28.8 percent and that's quite close to this empirically obtained with simulation 28.5 percent okay so you don't have to do all the simulations you have the function directly for that these exist for most well known tests like the chi-square facial test t-test and over and so on so forth if you are using something which is not among these then I would recommend doing some simulation like what I've shown just above so that you can always compute your power all right so that's one thing okay you can compute your power for a given setup but now you can also play you can say okay I have an effect size of one maybe I've just done like a small study and this small study was done with just 10 individual okay and has shown me an effect size of one and the p-value maybe was like quite you know maybe 0.033 percent so I'm like okay that's not very big I maybe want to collect a bit more data to confirm this hypothesis so I say okay my effect size is one I would like to have a significance threshold of 0.01 okay so that I have you know better certainty about my result what would be the minimal sample size that I should collect in order to have a let's say 80 percent chance of detecting my difference okay so that I have a statistical power of 80 percent also then I could for instance for each sample size between 2 and 50 compute the power for this sample size there so it's the same thing as before it's just that now I've put the sample size in a loop or there's another way is that there is also a solve for function where you give everything the same thing as before it's just that one of these things you stay you put it at none and so it the function will understand that this is what you would like to find so that will give you return you the smallest value for these other parameters so let's try it out and that you see we have here for the effect size of one the power that goes with the sample size and crosses the threshold of 0.8 at about 25 right and that's what we see here the minimum sample size is 25.6 so say here 26 individual in each sample all right that's also how you would go at it you make a small study first to see if it's kind of worth it to brew a slightly bigger study that also gives you an example an estimate of your effect size and then you use a power calculator to then say okay now I want to do a bigger study what should be the minimal sample size to have a good power to be able to find a gain that the effect but now is good confidence okay does this make sense it's all kind of process yeah okay so I hope that now you can hold start viewing your p-values with kind of more nuance and start interpreting them for what they actually mean and not just being like under 0.05 significance above equal non-significant and correct today okay so we arrive at the end of today I know that there was a lot of concept that it was very intense I ask out of you a let's say last effort is it'll go on a small exercise so here consider the mice data set that we loaded before compute the effect size for the diet on the weight and then compute the statistical power of the corresponding t-test for that effect size all right so you have here just what you need to you know have a little look at the data again and to have these vectors in case you don't have them anymore and for the rest it's your turn to play all right so then I will stop the recording and stop sharing and let you work okay so let's go to that so we have our data okay first thing first we will gather n1 and n2 as the length of both thing all right and then the mean of both things all right so this is fairly let's say simple and we have now a mean of 28 and a mean of 27 so by making the difference between that we can have already the mean difference now the next step is to divide this by the standard deviation that's more easily said and done because if we now do the same thing we don't have one standard deviation but we have two you have one for the show data and one for the hfd data and as you can see they are far from equal at least it doesn't look like they are 2.5 here and 4.9 there so now we have to kind of ask ourselves what do we do with that which one do we pick how do we solve that so who had looked at the standard deviations it's a bit tricky so there is absolutely no no shame if you did not think about it it's also the end of the day everyone is tired so it's normal so this kind of this idea so maybe we could do like an average between the two maybe we could also say if you remember that there is this ddof1 so that when you estimate a variance from a from a sample you have to have a little minus one on the n in order to have a fair estimate there but still you have this difference so there we have to actually come back to what we have seen earlier on with the welch t-test where there was this weird way of computing the variance which we call computing a pooled variance all right so for that we actually compute both variance with this delta degree of freedom and then we pull them together with this with this formula there where we multiply them by n minus one and then divided by the sum of the n's and then compute the square root of that okay so it's admittedly not easy and I would not fault that to not have thought about it it's let's say a little bit of a trick here sorry about tooling that one of you so late so there's a question by sabine I yeah so I have very quickly just checked the google how to compute this variance and that I state you you just need to divide by one so by the standard deviation of one of the two groups so why are we now you're doing this pooled variance I think I don't know what exactly you read I think this may be because you are in the case where they presume equality of variance between the two groups okay I presume equality of variance between the two groups and indeed okay that case but here unfortunately I don't think that's the case and then there is a question by katerina so this ddof there is the delta degree of freedom so if I am sorry I'm going to scroll a lot then if I come back to the t test then remember then where we have here we take we compute the square difference with the means that's how we compute a variance and rather than normally what we do normally is just to divide it by n and then here we divide it by n minus one so this reflects if you will the extra uncertainty that you have when when you estimate the variance of a of a sample right because this is an estimate there so you have another layer okay and that's why the then this you should have a slight overinflation there because otherwise you will consistently underestimate the variance right that's a very theoretical result there as to how we can correct this effect and it's just this little minus one there does this make more sense okay perfect so up up so with that once we actually take that into account we get our effect size here of minus 2.01 maybe you got something positive if you did hfd minus chow but if you do chow minus hfd you got minus something the test will be symmetrical will be two sided so it doesn't matter right and so I get this so that's why I ask also what effect size you had found because depending on how you had decided to handle the the standard deviation problem then you might end up with different effect size right and then I give that to p.power and for this effect size and this observation in n1 the ratio is n2 divided by n1 and I say here for 1 percent fresh alpha threshold I got the power of 0.99999 so quite high in fact if you put it to 0.05 you will get 1.0 so it's numerically not distinguished from 1 so it might be maybe 1 minus 10 to the power 6 or maybe 1 minus 10 to the power 10 something close to that okay but nevertheless here we have something very very high all right so there is this little trick there it's not something to remember just by heart I would say just remember that sometimes there are some weird stuff and so it pays to spend a bit of time on Wikipedia and Google and so on so first to see how we can handle this all right okay so it's 450 if there is no burning questions I see no one with raise hand for the moment I will use this oh yeah go ahead can I just explain the one from line 12 to the like from yes to the next cell sorry so this and then that yes so here we compute the effect size the effect size is the difference in mean divided by the standard deviation here the standard deviation we get it with the square root of the pooled variance so far so good yeah okay so then once we have this effect site we can plug it into the t-test of independence power calculator from stats models and so we give the effect size the number of observation in the group one the ratio between the number of observation in the group two and the group one so n two divided by n one and then our alpha threshold for the p-value one is the just the show data right or what was the group one sorry n one is a show data yeah okay okay yeah this number of observation okay okay great all right so let's use our last bit of energy there to add a little bit of something right I will go quite fast with that but I think this is a very important concept I'm sure that you've heard about it it's multiple testing so as with most statistical concept there is an xkcd strip that goes with it if you don't know about it it's a very nice webcomic about science so the idea is that you have someone making some clay and the scientists have to test for it and then when they test if jellybean goes acting and they test for you know old jellybeans they find that the p-value is not significant it's above 0.05 but then people say oh I think it's only one of the 20 colors that does it then scientists go and investigate and so they test all 20 colors of jellybean and for all 20 colors the p-value is 0.05 except for the green jellybean where the p-value is under 0.05 and thus in the new you will see that green jelly are linked to acne that they cause acne but we forget and here it's written but we forget that this is with a p-value under 0.5 though you have a 95 confidence and all of these other tests on all of the other color are conveniently forgotten but now I guess that by now you understand that once you have done 20 tests each test had a 5% chance of being of having a p-value below 5% just by chance if it was actually not not if H0 is true so even if the absence of an effect sometimes just by chance you will get a low p-value and if you do the test if you repeat the test many times you do many tests then the probability that at least one test becomes significant just by chance gets high so for instance if I do one test then the probability is only 5% but if I do already 20 tests then the probability that at least one test is significant just by chance in absence of an effect then this is 50% and then if you do more tests this grows and grows and grows and gets very close to one up to the point that if you imagine that you are for instance testing a panel of 600 proteins for a significant difference between two groups if you have a 5% threshold there you know that among your 600 protein then even if there is absolutely no difference you will find about 600 times 5% so that would be 30 protein that show up with p-values lower than 5% just by chance all right so then it can become hard to differentiate between the one which are there just by chance and the one which are actually significant so far so good yes okay and so then when we have to try and account for that and the way that we account for it traditionally is by applying a correction procedure to our p-values I will talk about two procedure I will not talk about it them in the detail for now the first is controlling what we call the family wise error rate probability of obtaining any false positive so significant p-value but they are actually are not and the other is to try and control the proportion of false positive among all of your findings so I found let's say 50 proteins were actually you know that I see as significant and I want to control that there is no more than 5% of this 50 which are false positive so that's a false discovery rate and so we have several different procedures for that but here you can have for controlling the FWB as a Bonferroni method very well known say you are doing n tests so then your threshold becomes the original threshold divided by n okay the original threshold was 0.05 you are doing 10 tests so your new threshold becomes 0.005 very simple but also very conservative that means that you lose a lot of statistical power but also you are quite sure that you are not making too much mistake so you know a balancing act always and then the Benjamin Augberg controls a false discovery rate it's a bit more complex the idea is that you will sort of your all of your p-values and then you will compare alpha times the number of tests to the actual number of discoveries that you would make at any given threshold and when you have a ratio between alpha n and this threshold that gives you where you do the cut I will not go into the detail there just here understand that this is a slightly different method and that tries to optimize something slightly different that the FWR okay so if we test this briefly in the last minute that I have if I imagine a little setup where I do 10,000 tests where there is absolutely no difference okay it's some precise of 100 and standard deviation of 1 okay doesn't matter too much there I do the test and then from statmodel.multi-test here I import the multiple test function I give to it the list of all the p-values that I have the alpha threshold for significance and whether I want to apply the Bonn-Ferroni method or the FDR Benjamin Augberg method and returns to me four elements the number of tests that were rejected the FWERs and the alpha under the SIDAC method that's not what we are using and the corrected alpha under the Bonn-Ferroni method that's what we are interested in and then for the other is the same thing is just that the second element is not FWER but FDRs okay and so once we have that we can kind of compare them and here we see that this is our p-values there five percent of them are previously significant and you see here the corrected p-values which are all kind of put toward one so they were corrected okay and so no tests were spurred usually rejected using either FWER or FDR okay and in a slightly different scenario when 100 out of the 10,000 tests were actually different uh we have a slightly different case there when our p-value uh let me check the type of correct thing no I don't because I did not do the simulation yeah so we have this scenario when now out of the 10,000 tests there is 100 that should be seen as significant and the p-value sees 583 significant FDR are only 65 FDR 109 okay so FWER is more conservative number of course correctly significant test out of 100 is 100 for the p-value so p-value detects all significant difference but it also has a lot of spurriously significant tests 4483 so it means that if you have like here this almost 600 significant tests you know that only one sixth of that is actually significant most of them are significant just by chance spurriously but for the FWER you only detect 65 percent correctly but you have made absolutely no mistake so you are more conservative you detect less thing but you make less mistakes or less mistake of type uh uh less mistake of type one and then the last is that with the FDR you detect 99 percent of this actually significant different test but then you also have 10 so 10 percent of them are actually mistakenly seen spurriously seen as uh as significant all right so that's a little bit how this works so yesterday we spent first some time um just learning how to play around we spend us and with Matt Plotlib and Seaborn in order to load and manipulate and represent some data we'll not come back too much to that but I think this is part of the sort of core skills that we need in order to then be able to properly conduct data analysis okay because you can be quite skilled at pure statistics if you don't know how to look at your data if you don't know how to read and then single out your data filter your data you will not get very far all right and once we had that fairly well in place we started to play around with statistics concepts so we started discussing a few let's say maybe more theoretical aspects of it in particular distribution okay and we then try to build up you know what was a probability density function so that's here a function that describes the probability of each event across a set of value for a for a set of for a set of possible results okay and this is then a a tool that lets us describe a certain flavor of randomness if you want okay the pattern of the randomness of a given random variable we played a little bit with that we have seen that a same let's say family of distribution for instance the normal distribution had itself its own flavor depending on different parameters here for the normal distribution is just the mean and the variance or mean and standard deviation okay which can also be called the localization and the scale and so then we kind of saw why we should always care a bit about the normal distribution why it was so pervasive and that is because of one neat little let's say property which is called the central limit theorem so let me go back to this I think is the most convincing part so the idea is that even though the data that you sample may not follow a normal distribution the central limit theorem states that the mean of the of your sample will be distributed according to something very close to a normal distribution if the sample size is large enough and sample size is large enough is this sort of let's say fuzzy definition okay because it whole large is large enough depends a little bit on the underlying distribution but you know that as the sample size grows you should get closer to this normal distribution and so I try to demonstrate that using simulation so I have drawn thousands of samples from a exponential distribution so something which is quite different from a normal distribution so you sample some data and you compute the average of this data and then we do that a lot of time to look at what you know how this mean of a sample is distributed and we try that for samples of size 5 of size 50 and size 500 and I then compare that with what the central limit theorem gives me as to what I should expect you know if the if if this sample size is large enough property is is true what does the central limit theorem tells me this this sample mean should be and so this is what we see here when the sample size is 5 then we see that we are okay not super far but we are not at the point yet where we are the sample mean distributed according to the theoretical normal law here but when the sample size grows for instance when it's 50 or even better when it's 500 then you can see that we are nearly indistinguishable from the thing that the theoretical normal law predicts for us okay so this is a fairly important property in particular here we can see that the mean of a sample will follow a normal distribution whose mean is the mean of the population so that's nice that we that means that we can expect that our sample won't fall too far away from the let's say the truth the mean of the population all right and at standard deviation will depend on the standard deviation of the population divided by the square root of the number of of individual in our sample all right so as we gather more individual in our study we'll have a more precise uh better estimate of the of the population mean so this is all good and well but then we can ask ourselves why would we care why is that actually important and it is important because this property means that when you sample a limited number of element in in nature for your for your experience you don't have the luxury of repeating all of these okay you cannot do 10 000 experiments of a sample of size 50 it's not feasibly possible in most of the time in research but if you know or if you if you think that you are in this condition where the central limit theorem appear then you don't have to just the fact that you have done one sample gives you some expectation as to what are the property of the sample and how the property of the mean of the sample will relate to the property of the whole population okay so that lets you have a frame to do inference from the property of a sample onto the property of the population and you can do that in a pragmatic way where you are able to give some probability or of that you know your sample mean is within such and such distance of the of the population mean and that's what we basically call the 95 or that's typically what we recapitulate using a 95 probability confidence interval okay so this is what we have seen also yesterday once you are once you have these properties and that you are then able to use the central limit theorem to make this prediction on the sample of your sorry on the mean of your samples then you are able to predict a lot of stuff so the 95 person confidence interval in particular will let you create this interval around your sample means that's what we see here and you know that there is a 95 percent chance that that the mean that the population mean will fall within the 95 percent confidence interval of a sample mean okay that's kind of by definition that's the property of this confidence interval so that means that then you can define this framework where you can rationally estimate the probability that you make certain type of mistake so that's when then we were able to use these properties to introduce the statistical hypothesis testing framework where we define a null hypothesis and an alternative hypothesis both being statements about the real world and then from there we define a test statistic so that's basically a metric for example the mean of a sample and the null hypothesis coupled with statistical properties such as the central limit theorem will let us predict what is the expected behavior of our test statistic under the null hypothesis once we have that we know what we expect under the null hypothesis we can gather our data and compare if our data looks plausible according to the null hypothesis or if it deviates from our expectation and if it deviates enough from our expectation then we can say that it was fairly unlikely that such data was obtained under the null hypothesis and by if you will reversing that reasoning we can say that then the null hypothesis is relatively unlikely and so we will tend to reject the null hypothesis all right and the p-value which is a probability of observing a test statistic at least as extreme as the one that we just observed with our empirical data according to the null hypothesis is a probability that we can use to gauge how much of a bet we are making when rejecting the null hypothesis okay this is related to the probability that the null hypothesis would be true if we reject it all right so far so good all right so then from there we have seen a number of example I will I think focus mostly on the t-test okay because I think this is you know arguably one of the most well-known most well-known tests out there and so it's worth it to spend a little bit more time on that so the t-test is a test where you have two samples coming from for instance different population or different groups okay and you want to ask the question are there are the mean of the population that they are issued from different okay and so you have your two samples and you don't know also the violence of each sample or so you want to allow the different violence of both samples to be different so then what happens is that we know from the literature that that's the difference between these two samples should follow some kind of distribution which is called the t-distribution and which sort of looks like a normal distribution except that it has what we call heavier tails that means that you see here these are the tail and they are a bit higher there they are a bit higher density so that the tail are heavier that means that let's say improbable events are a bit more likely and this reflects the extra layer of uncertainty that is added when we don't know the violence of the sample with respect to the converse case where we would know them so from there we can build our our test system so we have our non-lyposis the non-lyposis is that the mean of the two populations mu 1 and mu 2 are the same the alternative is that they are different here see my alternative hypothesis is just that they are different that means that I am in a two-sided hypothesis test we always have the opportunity of for instance here saying that h1 would be that mu 1 is above mu 2 and in that case the test would be one-sided okay we would only look at one form of deviation anyhow um so you have to consider then a number of assumptions so this test presumes that the mean of the sample are normally distributed oops sorry and that the that the sample are independent from one another and so if that if these assumptions are true and only if these assumptions are true then we know that the difference of the mean of the two samples will follow a t distribution if the assumptions are not true then they will not follow a t distribution okay and this is important because the fact that they follow a t distribution is what we use to compute the p-value okay the p-value will just be computed so we have this the p-value will just be computed by computing here the the amount if you will of density which lies you know after a given threshold on this on this density function so if this density function is not true because the assumption are not respected then the p-value is meaningless so our test statistic in that case and you will see that this is often the case when we do some sort of testing is what we are interested by so he has a difference between the mean of the two sample and it's almost always then normalized by something that make it comparable to a to a to a let's say scaled version of the distribution of interest okay so this is the distribution of interest the t distribution with a standard deviation of one but of course as you know the standard deviation can can vary that's one of the parameters of this distribution and so to bring us back into this sort of simple case where the standard distribution is one we scale our test statistic by something that depends on the variance of the sample and the size of the and the size of of the sample okay it's a little bit like when the central limit theorem was saying to us that the standard deviation of the mean of the sample was the standard deviation of the population divided by the square root of the size of the sample so that's the main idea and here for the test it's a bit more complex because you have two samples which might have a different size and which might have different standard deviations so you have to kind of account for that that makes the formulas more complex but the idea behind them is only that it's only having a normalization by something depending on standard deviation divided by sample size so here we have our two terms there which are basically our standard deviation but with a normalization with an n minus one term and that's due to the fact that if we didn't do this n minus one we would tend to underestimate the variance and again we have to then correct and to account for this small layer of uncertainty due to the fact that we are estimating both the mean and the variance at the same time and this is done by using this little n minus one correction here instead of n and last but not least the t-test has another parameter which is the degree of freedom and which governs how heavy these little tails here are okay if the degree of freedom is small then these tails are heavier okay there is more spread more uncertainty and if the degree of freedom is large then the tails are smaller and we get closer and closer and closer to a normal distribution and so this degree of freedom the formula is a bit a bit complex here as you can see but if you spend again a little bit of time on that you will see that it depends again on this variance and this and this size of the samples there and in the end it should be relatively close to the sum of the size of the samples all right with a little bit of correction for the difference in variance but the underlying idea is sort of again this thing okay so then we have seen that together we have performed a little bit of a t-test with this mice weight and we found that we get then a test statistic a degree of freedom as you can see here the degree of freedom is you see here 44 and if you remember we have in this mice group we have respectively 21 and 29 if I remember correctly elements okay so the sum of of this is 50 okay and so you can see that the degree of freedom is sort of related to that it's not exactly that because as I said there is some amount of normalization done based on the on the respective variance of both elements but it should be in the ballpark of you know in the order of magnitude of the number of samples that you have in total and so then this test statistic for a t-distribution with a degree of freedom equal to 44.25 has a two-sided p-value of 9.24 times 10 to the power minus 10 so we say that then it was very very very unlikely that we would have observed such data such difference in mean if the two groups were you know were you know obtained from population with the same mean and so we can say that we reject h0 so we reject the hypothesis that the mean of the population are the same okay and so what we do when we say that is we say okay provided that the assumption of the test were true I say that I'm making sort of a bet that the mean of the population are actually different and not the same and the risk associated to that bet that I'm making is 10 to the power minus 10 or here it's 9.34 so that's 10 to the power minus 9 and so I'm I could say that I'm making a bet with not too much risk here the probability the probability that I'm wrong when calling that is actually quite small but furthermore what you can add on top of it is also because here what this tells you is about statistical significance okay but then you want also to have an evaluation of the biological significance so you can also say this difference in mean the observed difference in mean between the two sample is here minus 8.43 gram and sorry you can relate that to the actual mice weight and say okay that's you know this eight gram it's about 25 20 percent of the weight of the mice okay so there's a question by Rosario asking how can we use and evaluate the test statistic result of a t-test so the here's a test statistic result is the value that will be compared and that will be looked at in a in the in the in the t-test statistics okay so that's basically here to compute the p-value for a test statistic of let's say minus 7 I would have to go and look at that low and go up until minus 7 which might be maybe somewhere I don't know maybe somewhere around here right and again plus 7 would be somewhere around here maybe and then I would sum like the total density underneath the curve from this point and with the one from that point on sum the two and that's how I get my p-value right okay so that's basically what we do when we throw this sort of thing there and here as you can see here we are already with something very very small when we are here so if we are minus 7 around there and of course it's quite tiny okay so we have seen that together and then we have murdered the point that of course all of these presumes as I said each time that the assumption of the test were true so we have also to learn one thing is to test if this assumption are true or true enough and the second what to do when they actually are not true at all or if we when we don't want to make the bet that they are true um so that's when we talked about testing normality we discussed the fact that there was no perfect normality test as of yet I don't know maybe in a few years there will be kind of dilated but who knows and so we have seen that then to kind of circumvent the fact that there was no test that was perfect I advocate a dual approach where you do a plot okay you use basically your nice and beautiful human eye and human brain that is able to catch many patterns easily and you just represent then the qq plot which is the theoretical quantile versus observed uh quantile and if you use the normal distribution for your theoretical quantile then if the data follows something close to a normal distribution then you should see your point close to this diagonal line there and in particular you should not see any particular pattern to this deviation from from the diagonal so this is here what a normal distribution looks like with 100 points and this is here what an exponential looks like when with 100 points and here you can kind of see that most of the time it follows at the extremities you are talking about fairly unlikely heavens so there is always a little bit of deviation that's quite expected but here you can see that there is clearly kind of a pattern of deviation around the diagonal line and that's what should alert you toward the fact that you are playing with some data which is not normal so that's one wrong and then our cycle angle of attack is an actual test and the problem if you will is that these tests the test of normality always presume normality as their null hypothesis and this causes problem to us because the way that we conduct testing does not control well the amount of error that we do when we are in the accepting h0 scenario okay which is what we want there with a test of normality whereas the null hypothesis is the normality so that means that we can use that test to have a small idea or to reject normality but the test does not allow us to comfortably accept normality okay we really never accept h0 we merely fail to reject h0 okay so we are in this uncomfortable scenario which is why we also want to have this visual approach so then what you do is that you for instance the Shapiro's test okay which works very well up till 5,000 points so that should cover most cases and this one should give you either a fairly high p-value if your data is close enough to normality that a deviation is not detected which might be because your data is normal or because your you don't have too many points and your data is not so removed from normal that you don't detect a difference okay so that's the big grain of sand that you have to accept or you know if you get a small small small p-value that's let's say maybe the best case for you because then you can fairly comfortably reject h0 and so that's when you would use the the test for its actual purpose that is rejecting h0 but then that means that you cannot very comfortably use the t-test because one assumption is rich so you have to go to something else all right again everything is good everything is you know back together everything making sense i have one question yeah go ahead if i have many many groups many samples as you yesterday for example if refer to genes or whatsoever so so how many of my samples would i need to test for normality or would it be the correct way to test really all of them all of them so you want to test all of them separately and so that happened to me not so so so so long ago and so the thing is that then you test all of them and you have to then content with what do you do when you see that maybe 50 percent of them uh looked like they could be normal and the other 50 percent of them uh looked like they are not normal right that's i would say it is a typical case and so in that sort of case personally i err on the side of caution and would say you use non-parametric for all i know but you you so my initial question i'll go ahead test all of them yeah for normality okay yeah and you test all of them separately okay if you so you the groups should be tested for that's something that i did not discuss too much but the groups would be tested for normality separately um i think this can be demonstrated uh fairly easily why is that imagine that you have two groups which are normal but are very different that means that if you take the entire data set it should look something like this right the density plot so you would have a group one here and group two here okay and each group taken separately is normal but then if you pull them together well you have this sort of bimodal distribution which will not be detected as normal by all the tests okay that's why you should always test them separately okay so we are in this context we are now able to have a reflection on whether or not our data is normal or not okay and ideally we would want it to be if not normal at least close enough to normal okay we have to sometimes be careful some tests will require strict normality so for instance we'll see that later on today but an ANOVA is a test that requires that the data is normal and some other tests are a bit more relaxed in the sense that they don't require normality of the data itself but normality of the mean of the data okay and so remember we have our central limit theorem that tells to us that if we have enough data points even if the original data is not normal then its mean is normally distributed so that means that sometimes even if you see like some deviation here for instance here remember here I have an exponential distribution okay it's fairly clear that it's not normal but there are 100 points in this sample and if you remember your small our small experiment from earlier already when there were only 50 points in a exponential sample then the mean of the sample was nearly normally distributed okay so that would be kind of a case where here I could constat I could detect non-normality but I would be able to say that there is enough sample that despite the fact that the data itself is not normal its mean is normally distributed and so the assumption of the t-test would not be breached I would still be able to use a t-test so that's something sometimes that we have to take into account or think about and we have also to you know make sure that some you know tests you know if they require strict normality of the data itself or if it's just a normality of the mean all right and of course there is a question by by Eng yes yes I'm just wondering if we have enough data to use the central limit theorem then why bother to check the normality because you will assume and the t-test integrity is not compromised so that's a very fair question so as I said depends a little bit on what sort of test you want to you want to apply some tests do require strict normality and also to different to determine if you will how much data is enough like as I said CLT states that with a large enough number of samples then you are fine but then large enough is not so easy so it's to me it's always kind of worth it to at least have a little look to see how much you deviate because that can help you inform you know if enough is enough so if you have 100 point it's likely that it will be it will be okay but you can always check and also in that let's say in this very particular case I could say here there is a very clear pattern and see if I can actually you know define that using something that we know for instance here an exponential distribution in which case maybe actually using specific test or actually using some simulation to test if 100 point is enough would be feasible so that's also not time lost okay thank you thank you welcome so all right so yeah there's I would say most of the time you will say that you don't necessarily have 100 sample but maybe 20 maybe 30 and so on so forth and so there you are kind of in this gray area where for instance if you have something that is too far away from a normal or like an exponential the CLT might not apply so it's also why you have to do the test okay so then if you are not confident with respect to the to the assumption of your test okay be the normality be the equality of violence for some of the tests or so on so forth you have to then use another test and typically we fall into the family of what we call non parametric tests which are great because they don't make any assumptions about the distribution of your data and so on so forth but they also have a loss of statistical power that means that they have a bit more difficulty differentiating fine small differences between two datasets and so we have seen one example of non parametric tests which is the alternative to the t-test that's the man with new test we won't go into the detail of how it works remember that it from time is linked to the difference in rank between your two samples and we can then get a p-value even if we have very little data if we cannot presume anything about their normality and so on so forth doesn't matter okay so that's a really I would say great property of these of these tests and I personally like them quite a lot all right um but that got us then on to this whole idea of like what were p-values what is statistical power and the different kind of error that we can make when we do some testing okay so remember we have your type one error and the type one error is the error of rejecting h0 while it is true and so that's typically the probability that is described by your p-value okay so your p-value gives you the probability that you are making a type one error when rejecting h0 okay it's the you know how much uh what is the probability that you are making a bad bet and then the second one which is a bit more insidious is a type two error is the converse that's the probability of accepting h0 while it is false so that's the fact that maybe there is an actual difference but you are not able to detect it and so you would fail to reject h0 and so then do a type two error okay and so the type one error yeah we control it very finely because it is our alpha our you know threshold for acceptance this is our five percent or one percent or you know whatever you decide your threshold should be but beta is more complex because it depends on the existence of this alternative scenario okay and so it it can only be defined for a precise definition of the alternative hypothesis finer than just difference of mean okay you have to say a difference of mean of that many for a standard deviation of that much and so on so forth so there are nevertheless some ways to compute that but we have then to kind of fix the sort of the sort of difference that we actually expect to exist so that means that you need to have prior knowledge about your data for that so either from the literature or from a small pilot study that you conducted and we have seen together that it's actually you know you can actually use poor calculator here's some from stat models which has also some from so from t-test for chi-square for most of the tests that we'll see today and for whom we give what we call an effect size the effect size is the difference in mean divided by the standard deviation of the of the samples and then for that for each sample size you can compute the power that is the ability of detecting the actual difference for a given alpha threshold okay so typically 0.01 and you can either make a slide nice beautiful line like we have seen I've shown here or you can actually have the calculator compute for yourself directly what would be the minimal sample size all right or alternatively you could also fix the number of size and then ask what should be the effect size what would be the minimum effect size for a given level of power okay that also can give you an idea of okay how much of an important effect does there need to be for me to have an 80% chance of detecting it given a simple size of 30 for example all right so that also gives you an idea let's say you already have a study you already have your your data you know that you have 30 individual in each group well you see then okay how much you know how big of an how big you know uh effects which I be able to detect will I only be able to detect only like the biggest stuff the most important ones or will I also be able to detect finer differences between the groups that's the idea okay so then from then we played a little bit and we came back to this mice data set and we have seen together then that when we try to compute in effect the power of of this particular t-test it can be a bit complexified by the fact that the two samples have differing standard deviation and so in order to be able to input the standard deviation inside our poor computer formula we had to actually use this pooled variance formula that will make an estimate of there of the global variance of the sample inside both group and you can see that this is basically the variance of both groups and but normalized by the size of each group if we make it simple and then once we have that we are able to compute the power which is actually quite high there all right so far so good everything is still making sense yes there is Rosario hi just maybe a very basic question I didn't get so much the type two error if I understand this if we wrongly accept the null hypothesis yes but this then can raise just in certain statistical tests like normality test where we need to accept the null hypothesis or you know what I mean like I I I have like this confusion a bit in my head because like in certain tests like normality tests we like to have a big p-value in order to think that we have normality yeah so in this case we can have the type two error but in the other test we need to reject the null hypothesis so how can we have the type two error so okay so there's maybe two two things to disconnect the first is what we would like to see and so usually for most tests we would like to have a small p-value and reject h0 and for the normalities we would like to see maybe a big p-value and fail to reject h0 so that's what we would like and we have to disconnect that from what actually happens what actually happens in the end is that you always have a p-value and this p-value will be above or under threshold and depending on that you will fail or you will sorry reject or fail to reject h0 that is what will happen in the end and so because you always have the possibility of either rejecting h0 or failing to reject i0 which time you will fail to reject h0 okay even even in a t-test none that means you run the risk of committing the type two error does that make more sense yes thank you yeah i see so we then finished the day with the concept of multiple hypothesis testing and that's something which i think is important to maybe not know by heart but it's very important to have a knowledge about because it's very common that we don't just test two groups but that we actually test many different groups or that we check many genes for significant difference between two groups or this sort of stuff and because remember we are in this framework where we have probability of doing different kinds of error and that means that any decision that we make when we do a test is is not the truth it's just a bet it's a bet we have the probability of making an error for instance if you say that you reject difference the probability of making an error is is your p-value and if you say that you don't reject the probability of making an error is the power of the test so then you are making a series of tests and it tends to reason that even if you try to control the probability of making an error if you make a lot of bet you will you know again statistically you will lose a portion of these bets okay and so we have to reflect we have to think about what it means and what could be the implication because losing a portion of your bet might not be a problem but you might also have some cases where if you have this kind of overinflation we know that we are we tend to over represent positive result and and and and hide a little bit negative result then that means that you can actually have a distortion of the sort of confidence that we can have about about some results if you don't try and account for that so that's when we have a new framework that tends to try to account for the fact that you are testing for many multiple hypotheses at the same time and for that we have two main framework okay the FWER family wise error rate and this one is computed fairly simply by just saying that if you have a threshold of significance of 0.05 and you do n test then only care about p values which are under 0.05 divided by your n test so you do if you do 10 tests then your new threshold becomes 0.005 this is very very simple this is what we call very conservative that means that with this framework you will not make a lot of type one error okay that means that if you call something significant under this framework then there is a very high probability that it's actually significant okay but conversely that means that you have then a loss in statistical poor so you run a much higher chance of of of committing type two error that means that there would be a real difference but you don't detect it conversely there is the FDR so false discovery rate which aims to find a middle ground there and the way that it works if we make it simple is that you will sort all the p values that you have obtained and you will try to then define a set of significant tests such that no more than let's say five percent of this test that you defined as significant would be false positive so try to say okay I know that I will make some false positives that's kind of a given and I will try to make sure that there is no more than a you know low number of false positive for instance five percent so that I can trust 95 percent of the test which I declared as significant okay and so this has been shown to control then your type one error okay it's still present but at least it's kind of controlled and still conserve a good amount of the power of the test so it's a middle ground between using the row p-value which will give you good power but very bad type one control and the FWER which gives you very bad power but very good type one error control so here the procedure is that you sort your p-value and then for each p-value you look at you know at this given p-value threshold who much would be would be likely to be obtained just by chance so that would be this alpha and so 0.05 for instance times the number of tests that you do and you divide it by the actual number of discovery as that threshold and that lets you have a estimate if you will of the amount of false discovery that you would make if you would decide that this would be your threshold for significance and you find the balancing point where this fraction there is equal to 0.05 or whatever your FDR would be all right so we then ask a very specific question to them oh go ahead and so for example if I have a cell and I look at the proteins that they express and then I have another cell that I treat with a drug and I want to see which proteins for example are over or under expressed yes so I compared it to do I then have to do that and the number of observations is like the how many proteins I look at exactly okay okay yes because so in your setups you would do one test per protein okay yeah okay yeah perfect but that's that's the exact use case yeah and so then I try to just demonstrate this I will not go again through that unless you want me to but you can come back to it later on if you're interested to see a little bit and to get acquainted with how much how these works and how these correction procedures offers to you different if you will balance point or different strategies when come to the control of one or the other of your type of errors okay so that's the recap it took a bit of time but we are quite on time we have the time for that and I think that it's important as there were so many information yesterday that we go through that again and so that we can now start the rest of the of the course on a good basis