 So we have seen, I would say, the most, if not all of the, let's say, vertical background that we needed to go through the rest of the course. There will still be a little bit of it this afternoon when we talk about regression. But for now, we have put everything in place such that we can enter a form of, let's say, catalog mode where I stop introducing new statistical concept and I just show you new and new tests. Okay, because now you know about what the p-value is, what power is, what are the assumptions, why they are important. Okay, so I don't need to explain new thing to you, just what happens in that scenario and what happens in this other scenario, which test you should use and so on. So it should be much simpler. All right, so notebook deal three. So as usual, I import some stuff and yeah, I import some stuff and make my figure bigger. And we will start with two tests which are quite used and I put them together because one is the approximation of the other. The first is the Fisher's exact test, which is as its name entails exact. And the second is the Chi square test. I'm fairly certain that you've heard about one and the other all the two. So before we were playing with a continuous variable and a categorical variable. All right, when we do the t-test, we compare for instance the weight. This is our continuous variable between two groups. So that's our categorical variable. And now the Fisher's and the Chi square tests are for the case where you have two categorical variables. Okay, you have two groupings. And the Fisher's exact test is what happens is exact, but it can be sometimes too complex to compute. So we then sometimes need the Chi square test, which is only an approximation. So it's never perfect. Okay, and in some cases should never be used because as it's an approximation under some condition, the approximation is very bad, but it's easy to compute. And so sometimes we have to use it. We don't have any other choices. All right, so in both tests, the null hypothesis is that there is no particular association between our two categorical variables. And thus the null hypothesis is that there is a significant association between the two variables. Being of, let's say, category A in one variable, influence who with the proportion, the probability that you are of category one or two in the second variable. All right, so let's see our little experimental setup. And that's actually, obviously that's how the legend goes, that the Fisher's exact test was invented. This is a lady tasting tea experiment. The story that was that at the beginning of the 20th century, researcher Fisher was drinking tea with some colleagues. And one of his female colleagues, the lady declined having milk added to her tea because she said that she preferred milk first and then tea. And Fisher with skeptical has whether or not she could actually tell the difference. And so he devised the Fisher's exact test to be able to test this assumption where you would give to the lady four cups where the milk was poured before tea and four cups where the tea was poured before the milk. And the lady had to say for each cup that the milk was given after or before. And then using this table, he made a test in order to check the p-value that the lady was actually able to detect the difference between milk before and milk after. This is not the historical result. I think that the historical result was a perfect success for the lady. So she actually proved that she was able to do that. But it's a bit nicer to have this example there from the standpoint of the computation of how we actually do this test. So with this little bit of history, what the idea behind the Fisher's exact test is to just take this table and then, as usual, in this p-value computing framework we want to count how many tables were as extreme or more extreme than this one. And so we have to kind of ask ourselves the question what does more extreme means for the framework of a table. And so here, you can see that we are in a case where you have fixed margin. Okay, we have exactly like eight four cups before four cup after. And here we have four cup after and four cup before detected by the lady. And so we can say, okay, so the more extreme we have here threes and one. So a more extreme case could be where we have fours and zeroes. That's more extreme, right? And less extreme would be when we only have twos, okay, where it's more spread out and more equal between the, between the column there is more bias toward the categories. So that's being more extreme on one side but we could be also more extreme on the other side where we have here threes and here ones and then here four and zeroes. All right, so far so good? Yes? Okay, so once we have defined what was a more extreme table or less extreme table, there is an additional little trick and that's what actually make the test hard to compute is that you have to consider the different combinatorics that underlies this sort of stuff because then you can say, okay, let's say that we, if we are in the case where the lady answers completely randomly, you can say that she has picked one milk after among four, among four with milk before, okay? So then she could have done that in four different ways, okay? There is four different ways of choosing one cup among four. So there is an additional, if you will, layer of computing the combinatorics of the number of possible tables that are as or more extremes, okay? And so then you have to compute the number of possible tables for each category and you divide that by the total number of possible tables which again is computing, is computed by taking into account all the combinatorics of putting points in such a table. And so then if you make the ratio of all the table which are more extreme divided by the total number of table, you get your P value in that particular case. Yeah, if we just focus on one of the number here and saying that it kind of dictates what the other will look like, you can have ES16 table where you have a free in the number of correctly detected cup with milk before. 16 table where there is only one, okay, based on the combinatorics. Then one table only when there is four and one table only when there is zero. And so that's 16 plus one plus one plus 16 that you divide by the total number. So that's 16 plus 16 plus one plus one plus the number of way that you could have just twos here and you get your P value. Yes, I'm sorry I don't have your name. So this test can be used for large data like for if we have like 100 or 100 Gs also but be like three tables, right? So there's two things to consider. So it used to be when we had to compute that, let's say manually that we could not go much, much beyond a dozen of experiments. Nowadays with modern computer, I would say that you can use the Fisher exact test. Even if you have here maybe even thousands of observations, tens of thousands, you might hear your computer complain a bit but it should still be doable. When it becomes harder, much, much, much harder is if you have more than two categories in either of the categorical viable. Because if you have more than two categories on either of these viable then the combinatorics here explodes something hard and then most computer would not really be able to cope up with it and so it's not so common that this has been implemented actually in libraries because it's a hard problem. Okay, so in this case, for example, we just can use this test for example for treated, untreated patients, right? Yeah, exactly. Like when you have two categories, yeah. Okay, thank you. Yeah, you're welcome. All right, so in practice then you don't compute that yourself and what you do is you just create a table so you see that it's a list of lists here, a 2D table and then you call stats.fisher exact on your table and it gives to you then two values, the odds ratio and the p-value. The p-value of course you want to look at. So the p-value gives you as usual your probability that you would make a type one error if you were to reject. And then the odds ratio here is the product of the role-wise odds here. So for instance, in our practical case, it's 9.0, so it's 3 divided by 1 times 3 divided by 1. You have then 3 divided by 1 times 3 divided by 1. Okay, so that if you will, gives you a measure of effect size in the idea that it gives you a measure of how biased you are towards the association between the different categories. All right. So far so good? Yes, okay. So now let's also talk about the chi-square. So the chi-square is an approximation of the fisher exact test, but it doesn't work when the number of observation is low. The rule of thumb is that the number of observation in each cell should exceed 5. Okay. So for example, typically, here, that's not the case. There is this, maybe it's very small here, it's not necessarily the number of observation. It's the number of expected observation. So the chi-square works by computing the expected table if you had no association between the tables, between the columns and rows. And then it will compare the observation with the expectation in each cell. And so what you do here is that you do observe minus expected squared divided by expected and you sum all of that. And at the end, you compare that with a chi-square distribution and it's the name of the test, which is a distribution which has a parameter which is a degree of freedom, this again, which is computed with a number of columns minus one, number of rows minus one. So in this particular case, yeah, this is a case where, again, I repeat, here you should not apply the chi-square test. It's not applicable because the number of observation is too low, but for demonstration, the number, the expected numbers are here two everywhere, okay? Makes sense if there is absolutely no association, then you expect that all of these are equal, okay? And then we have our actually observation is three, three, one, one. So we have our test statistic. So the sum of observed minus expected squared divided by expected, so three minus two divided by two, one minus two divided by two, and so on and so forth. The total is equal to two. And the number of degree of freedom, number of column minus one times number of row minus one, that is two columns, so minus one that's one, two rows, minus one that's one, so one times one, one. And so there, that's how that would be kind of computed. You would compare this test statistic with the chi-square distribution with one degree of freedom. In practice, again, you want here to just call the chi-square contingency test from stats. You give to it the table, same as for the official test, and there is this correction equal force here. This is using a Yates correction. So this is a correction which is specific for when the effectives are low. Here we, here we have the effectives are so low that anyway, the test is useless. But you can play a little bit around with that to see how that would affect the thing. But you can see here that our chi-square, so it's two, as I said here, and this is the computed p-value there. What you can see is that here the p-value is 0.15 or 0.16. Whereas the p-value that was the real one given, the exact one given by the chi-square test, is 0.48, 0.49 maybe. So that's the sort of case where, yeah, we can see directly that the chi-square gives us a fairly bad approximation of the p-value. The exact one is 0.5, and here it gives us 0.16. Yeah, you may say, okay, it's not too problematic because we still don't reject anyway. But that's just an example. You could imagine that we had the same case, but slightly more extreme, always slightly more data, and you would still have your difference, and this might actually make you switch the way that you do your interpretation. Nevertheless, even if you just consider this from the standpoint of the sort of bet that you want to make, see that you have on one case a bet where the probability of being wrong is 0.5, at least one type of error, and when chi-square says it's 0.16, so it drastically under-estimated probability of making a type one error, which is quite problematic. There's a question by Ahmad. So based on what we determined, the p-value of chi-square is bad or wrong. So you said that the 0.48 from this heat from the pressure test was the correct and the chi-square was bad or something like that. Yes. So by definition, if you will, the Fisher's exact test always gives you the exact p-value. This is the gold standard for that scenario. This is always good. And then the chi-square gives you an approximation and how good that approximation is depends on the number of expected there. It's a good approximation where this number of expected values is high and it is a bad approximation when this number of expected values is low. And that's what we see here. Let's see, go ahead. Okay, so if we have like 10 cases and we get p-value like the same amount, 0.16, in this case we will say p-value is correct here. I'm sorry, I did not get the end of your sentence. So you said that it's bad approximation because we have less number of cases, right? Of observation. Yeah. So if we have like let's say 10 observations or 20 observations and we still get p-value of 0.16. So in this case the p-value here will be correct or will be bad? So the rule of thumb is that, and again it's the rule of thumb that's when does the approximation becomes not so bad. So the rule of thumb would be that the number of expected observation in each cell exceeds five. So that ties into also some of what is returned. You see here the chi-square function returns to us not just the test statistic and the p-value but also then the degree of freedom, that's one thing but also this expected part. And this expected here is like your table. So this is my table there but it gives you the expected values in this table here. So that's these values of two there. So if anything in this expected here is under five you can be sure that this is a bad approximation and that you should not trust this p-value. If everything is above five then you can sort of judge that it should not be too bad of an approximation and that you can kind of trust this p-value. Okay, great. So this is a good point. And then if we have, so let's say we have 20 observations and we have a p-value of chi-square different than Fisher's test. So in this case which one will be our consideration? Fisher always. Fisher always. See if I can make it as simple as possible if you can do Fisher, do Fisher because it's the exact one. Okay. And we only use the chi-square when we have I would say either tens of thousands of samples or when you have more than two categories for one of your volume. Okay, I got it. Okay, I got it. Thank you so much. You're welcome. And there's a question by Dennis. So we now looked at the p-value but then you have this chi-square value or the odds ratio above. Do you also look at these? Are they somehow important or just a p-value? To me personally I do not necessarily look too much as a chi-square because this is a test statistic. This is interesting if you want to re-compare that against a specific against the distribution yourself which can sometimes be interesting in some multiple correction scenarios. But here the odds ratio I look at it because this is a measure of biological significance for me. Okay. It's a bit like not looking just at the p-value but also looking at the difference in mean between the weight of the mice. And here's the equivalent. It's like you look at, this is a measure of how much of a bias varies between the categories. Okay, perfect. So it seems informative in that sense. All right. So then we can imagine. So as I said, okay in that particular case, don't use the chi-square. Okay. It's bad approximation. If the number of sample is larger, we can now let's say that we have a very, very patient lady that will drink 80 cup of tea for us. Then we will, we can try the same scenario and do a fissures exact test and a chi-square test. Okay. And now you can see that there is still a difference between the two. Okay. But the difference is much smaller. Okay. We have before the ratio between the two p-value was just, was something like times three or a bit more. Now it's just times 1.5, which is still not ideal. Okay. It's not the same, but it's let's say less worse. And so that's what we say when we say that the approximation is not so bad anymore. And here in that particular case, the expected you see everything is well above five. Okay. So that gives you a little idea. Again, if you have just two columns, two rows, I would say always use the fissure exact test. I mean, it made sense back in the day where we had to either compute stuff by hand or we did not have a lot of computers. Nowadays with the power of the computer that we have at our disposal, I would say even just a simple smartphone that we have in our pockets, there is absolutely no excuse not to use the fissure exact test when you have just two rows, two columns. Okay. It's the one that gives you the exact p-value. So why not use it? Okay. So now you're done to play around. Come back to our census data set. And your goal is to try and test the association between the majority religion and the majority language. Okay. And I give you a little tip here, a little formula to have this nice pandas that cross tap function to create the contingency table. Okay. So I will, I think give you five, 10 minutes to solve that one. And also like to get back in the mood for coding and so on and so forth. And after that, we'll do a correction. Please put a little green tick next to an M when you are able to solve it. Particular the first question. Second question is maybe more theoretical. So put a green tick when you have solved the first one and then think about the second one while you wait for the correction, please. All right. So I will presume your recording. Okay. So we want to test whether there is an association between majority religion and majority language. Okay. For that, I give you this little trick here to create a table between the two. So the cross tap function is quite nice. You give it two columns, two categorical columns and it gives you a table. That's what we get. So what is the first remark that we can give here? We have there two categories in the majority religion column but four categories in the majority language column. So then if we were to try and try to apply the fish test up and give it the table, what we get is an error. The error says that the input table should be a shape two by two. That's what I said to you. So whenever you have more than two categories, a fish test becomes very, very, very difficult to do very hard and thus many libraries have just not even implemented the fish test in that particular case. You can find some obscure or some libraries in there that will do that for you but be one that for such a high number of sample here, especially with more than 1,000, the actual computation would take a very, very, very, very long time. So here we can rely maybe on the fact that there is enough observation. There is a lot more than 2,000 I think in total to use the chi-square. So this is of course not ideal but we don't have too much of a choice here. So stats.chi-square of contingency and then we give our table here. Okay, and it will return to us the chi-square, the p-value I think is expected and then it's df or is it the converse df and then expected? Yeah, df and then expected. So just something which sometime you have to check and test. And so this runs and of course maybe before I even look at the p-value, I will look at the expected. And so the expected says to me, okay, there is here you see at least 55 sample in the expected table. All right, so I'm there fairly confident that the chi-square should be a good approximation with at least 55 sample here. So I would then look at the p-value with a bit of confidence and the p-value is very, very small saying here that there is only 10 to the power of minus 80. So 80 zeros and then one of one or a three. That there is no association between these two columns. At least that the null value would cause this repetition that I'm seeing here. Let's not over-interpret our p-values. And indeed kind of makes sense because here you can see that you have maybe here a very large majority of Catholic among Italian speaker, like a huge skew, whereas this is reversed among French and German speaker. So already this is kind of a sign that there is something happening there, especially with such high number. If I only had 10 in each category, I would ask myself some questions, but there that's kind of clear. Then one small limit maybe of the chi-square is that it tells you that there is some association. It doesn't necessarily tells you exactly where the associations are and so on. So that's something that sometimes we'll have to kind of dig a bit more to learn. Okay. Now about the second question. We have this table. We have seen that we cannot apply the Fisher exact test because we have more than two categories. Sorry. And so my second question was, how could you make the Fisher's test work here? So do you have any idea what would you maybe try or what would you be your instinct? You can try and write in the chat or just raise your hand and speak if you want. Yes, Sabine? I would have tested the languages individually. So for each language, if there's an association with one or the other religion. Okay. So then you have several tests, but it's always just two. Two by two. That's an idea. Okay. Then Ahmad? I think we could maybe make two or three tables out of this one. The Catholic and the Reformed with French, German and the Italians from British and then we can switch between them. Yeah. Okay. So same thing. We kind of test them two by twos. Yeah. And then there's Dennis as well or the Rosario, which tried merging the Latin language, adding Italian and French speaker together to try and make something with that. And then Engie? I'm sorry. I don't know how to pronounce your name. Maybe you can tell me how to pronounce your name. Just imagine there's a U in front. Okay. Thank you. No worries. I'm just wondering, because I tried to make it into a two by two table, but instead of doing a sub table like what you have done, I tried to limit when I input the Fisher exact test in terms of table in the brackets, I put a square brackets and try to do a colon, comma, colon two, but it failed. So something like I say, you wanted to do something like this, this and then that and that. And then you would just input the numbers. Okay. 426. 693. Is that what you mean? What I meant was you have the Fisher exact. Oh, sorry. Ots ratio comma p value equals to steps. Fisher exact. Bracket. Table. Right. Ah, no. Yeah. No, you cannot. Right. That's not the Fisher. Exact. And then you need to put parentheses table. Yeah. Yeah. I did that. But it's like the table. I put a square bracket and it failed to limit the. Yeah. I mean something like this. In the table. So. Exactly. Then table. And I limit to all this. Yeah. I see. Yeah. So that, that may work. So then you just have to say, okay, I have, I have my table. Okay. Then you have to kind of play around maybe how that would look. So for instance, if you do something like this, I think this will not really work. Yeah. Because this table is indexed. On, uh, on, on labels. So there what you might need is to actually use a structure, which is, which works for that table. So sometime it has that you maybe, um, pair a little bit with stuff. So yeah. I'm sorry. Would that work? Actually, I've never tried to do that directly. So there you see it acts a little bit like a, like a data frame. So you have to then maybe say that you want to have a lock. Yeah. That work. Yeah. Doesn't want. So you would have to, I think then try and find a way to select. To both the French speaker. And the German speaker. But I would be honest with you. I don't know if that works. Usually I work on the higher level. Okay. Yeah. You see that works. So, so you have to, oh, okay. Then if you put this into the future exact, would it work? Normally should be. Yeah. So you give that. Yeah. And then you should get your specific P value for that. So that's one way of doing stuff exactly to either say that you cut, but then that means that either you only do a one test for a small categories of that, or you do then one, two, three, four, five, six tests. Okay. So that means that then because you do six tests, you will want to actually do correction for multiple. Yes, for example, for example. So that's one way of doing that and that works well. Or, and the other one was given by Rosario and Dennis. Indeed. You can also try to combine the Latin language. So that works only if you have, if you have, let's say, expert knowledge of whether or not what you could merge or not. Here you have the expert knowledge or at least the knowledge that these are Latin language. And so it might make sense to merge them together. Might, might not. Then you could discuss with a specialist in the domain to know if that's worth it or not. And for this, of course, then you, it's about, it's mostly about being able to, to compute the numbers. So there you have to kind of either you manually some, I would say that this works if you are not at ease, we spend that sometimes just do the some manually and you create your table table manually and that works fine. Or there are ways, for instance, you can use here. So diffraction. And that would be then majority language. And I've shown to you this equally cool. And then you would have say French speakers. Okay. And that then gives you this force and true. But you can also use the easing and then you can have French speakers, for instance, and then Italian speakers. And then that will give you this again, force and true, but you will have true wave. This is either French or Italian. Okay. And so that's when you can start to use that to actually compute the different, the different thing together. So what I would do, I would do something like this. And then I would have Romance speakers there. Okay. So then I would have my mask for my mask for Latin language. Okay. And I can do value counts on that. Oops. Sorry. Counts for that. Yeah. Okay. And then on the other side, I have my majority religion. So then I would then do a PD dot cross tab of my EF fraction of majority religion and my mask of Latin. And that's what I would get. Okay. Here knowing that the true correspond to the Latin languages and the false to the non Latin languages. So then that's what I said when I said earlier that yeah, having an okay, a good comment of these pandas structure and so forth would make analysis a bit easier. So then once you have that, you have your table and you can feed that to your official exact lesson. So, okay. So let me paste that in the chat so that you have access to this little solution there. But that gives you a two ways of attaining your goal of applying a Fisher exact test and not necessarily the chi-square test to your table. Okay. All right. So do we have questions there? What would you do if we want to test the association between three categorical variable? So I think that I would test for the association between each of the two together. That would be your first idea, maybe. And then another way of maybe thinking about it could be, let me think. It depends a little bit there. It would depend a little bit on exactly how many categories I have on each. But if I can give you a, let's say, complex answer, what I would do is a multinomial generalized linear model on one category is given the two authors. So then we suddenly we jump in terms of complexity. We had at least three layers of complexity to the modeling, but that is how I would test this. Right. So just to translate that into maybe more layman term, what this would come down with is that I would try to create a linear model where I try to define the probability of being on one category or the other as a function of being of the other two categories and eventually also the interaction between the two categories. So, all right. If there are no other burning questions, it is 1046. So I propose that we go on a 15-minute break until 11. All right. So I will. OK. So we have just seen about the Fisher test and its approximation, the chi-square test. Now we are going to look at one test, which is non-parametric, which is the Kolmogorov-Smyernov test. So it's non-parametric and it is made to compare entire distributions. OK. So it's not like the chi-test that is just to compare the mean of two distributions, but then we test both the mean, the scale, so the standard deviation, and also the shape. There is two, let's say, way of doing this test. The first is just what we call a one-sample test. And the second is a two-sample test. So the one-sample test will check one distribution against a theoretical distribution, for instance, your sample against a normal distribution. And the second will test two samples against one another to see if they come from the same or near the same distribution. OK. So the way that it works is that you look at the cumulative distribution of values and try and find the point where you have the maximal distance between your observation, either in one sample or the other, or between the sample that you want to test against the theoretical quantile. So let's try a small example there. So I have my observation, my sample, that will come from a little bimodal distribution there that you can see in blue. And I will do a one-sided one-sample call-mograph-smeanov test against a normal distribution. So I ask the question of whether or not it's close to a normal distribution. The null is that it follows a normal distribution. And the alternative is that it does not. Incidentally, that could also be used as a test for normality. So this is what the PDF looks like, both from the sample data and from the expected data. And then what the call-mograph-smeanov does is that it looks at the CDF of both and it finds the point where this distance is maximal. And that should be this black bar here. And so this distance follows the call-mograph-smeanov distribution, which goes far beyond the scale of, sorry, the discourse. But that is what is then used to compute a p-value, okay? They have built a distribution that describes how this maximal distance then there should behave under the null hypothesis. So in practice, stats.keys test, so call-mograph-smeanov test, here I give my sample and I want to compare that against a particular distribution. So here I want to compare that in normal distribution, stats.norm.cdf, okay? Or any function that will be able to return what I need to make that cumulative distribution curve. And so what I get out of that is a case stats, so 0.15, that is the length here of this little bar and the associated p-value here, you can see it's very, very small, so I can reject the null hypothesis that this sample distribution there follows a normal distribution. So far so good? Yes, no unsure? Yes, okay. So here you see that I have chosen to use a normal, but I could have as easily just changed that to an exponential or whatever. It's a test that is very easily adapted to checking for other kind of distribution. So from there, we have our one sample and we can do the same thing for two samples. So the idea stays the same. We have two samples. We show the, there it comes. So we have sample one and sample two, okay? And we show the CDF for both the observed one now. So there is maybe more uncertainty because you have this additional layer there which before was a surgical line and now it's estimated through the sample. And you will find the point where they are the farthest apart and then this will be the point that you will choose that you will choose to use as your test statistic. And then again, it's test.ks2 sample, sample one, sample two. And here you get this Kolm-Gorshmian of test statistic which is associated with a, again, very, very small p-value. All right? So it's a non-parametric test. That means that you don't have any assumption to check. That's quite nice. That's quite useful. And it accounts for variation much more general, much more broad than just a switch in location. You could also check for difference in scale or shape, as I have said. The reason why we don't use that all the time is that it has much worse statistical power than the t-test, okay? As with any non-parametric test, okay? So less information it has in, as input, the more it has to account for a lot of uncertainty. There is a question by Rotario. Is it needed to normalize the values within the range before we compare the two samples? Well, yeah, if you want to be able to detect a different, for instance, in location or scale, if you did normalize the value in the samples, you would actually erase this information of difference in location and scale. So then that would actually not work there. If you just want to test for a difference in shape, then maybe you could indeed normalize before. So it depends a little bit on your objective. Then from Meghana, there's a question, can I use this test for more than two samples? No, it's really at maximum two sample tests. If you want to use it for more than two samples, then you would have to do multiple tests to test different combinations, and of course then you would have to correct your p-values. And then there's a question by Celina. What is better to use the KS test or the Welch's t-test to compare two samples with an equal variance? If you want to compare two samples with an equal variance and you are only interested into detecting a difference of location, so a difference in the mean, and provided that the assumption of the Welch's t-test are respected, then the Welch's t-test are better. It will have more statistical power. So let's actually demonstrate that. Let's imagine that we have sample of size 10, a difference of 1.0, our significance threshold of 0.5, and we will repeat that one thousand times because I love simulations. And we will try, then we have two samples, and they just differ by this 1.0 difference. So they differ by one standard deviation because we are using standard normal distribution, so the standard deviation is equal to 1. We do our two samples. We test if the KS is able to detect the difference. We check if the t-test will detect the difference, and we count. And so we see that out of a thousand tests, here the KS has a statistical power of only 0.25, whereas the t-test has a statistical power of 0.58. So we can see that it's twice more, twice more powerful for this small difference of one standard deviation here. So fully, I think that this convinces you that when you can, when the assumptions are met, and when this corresponds to what you want to do, the t-test will give you much better results. But then when they don't, it's useful to know what to apply then. And if you want to test for a difference finer than just difference in location, then the Kolmogov-Smirnov always work, and that's nice. Okay, so then a small exercise. So you will see that here we see a test. We do an exercise to check it right away. So your goal is to come back to this nice data set but in this master set, there is another column, not just the diet, but there's also the genotype. And you can try and use the Kolmogov-Smirnov test to determine if the distribution of mice weight differ between the white type and KO. So you see here in the KO, maybe it's like one big blob, maybe there is two blob, maybe by model distribution, not so sure. And the question is, would that just be due to chance or are there significantly different? And so we could use the Kolmogov-Smirnov for that. So I will then stop recording, using the recording, making sure that I have access to the chat, and there we go. So our goal is to use, is to compare these two distribution between these two sample. Okay, are there significantly different or not? So for that, we just call the stats.Kolmogov-Smirnov to sample, and we want to give it the white type mice and the KO mice, specifically their weight. So for that, I use .lock, my data.genotype equal to the white type, and my data.genotype equal to KO. And with these, I do get a p-value of 29, okay, so that would mean that these are not significantly different in terms of statistically significantly different. Whether this is because I don't have enough point or because they are actually not different, I cannot say, but at least that's what here I have. All right. This one was fairly simple, I think, okay, it's just we take this and we apply the function, but I think you got the gist of it now. Okay, so any further questions on this exercise or the Kolmogov-Smirnov test? All right. Then let's get on to another test. So when we were playing with the t-test, we use the version of the t-test, the Welch's t-test that does not presume the equality between the variance of your two samples. And remember, I said that there is a version, the student's t-test, there is a version of that test that does presume the equality of variance. And there are also other tests, such as the ANOVA, who do presume the equality of variance between the different group being tested. That's what property, which we call almost cadasticity, okay, so when you have homogeneity of the variance and of course, we have a test for that. Although again, this is a test where the null hypothesis is the equality of variance. So it's in that sort of imperfect setup where we would like to not reject the null hypothesis, but that's not what the test is actually made for, but we have to make them. So, assumption, we presume that the data is normally distributed. And here it's the data that is actually normally distributed, not the mean of the data. So there we are operating under a stronger set of constraints. And our test hypothesis is that we have M groups, so it works with more than two groups, but works also with two groups, more as well, who have each different variance and they each have different observations, okay, for a total of N observation. And then your null hypothesis is that all the variances are the same. And the alternative is that at least one of these variances differ. So you might have all of them equal except one, and that would be enough to get out of the null hypothesis. The test statistic, I will not go in the detail of this, but if you just take a little bit of time to look at it, you can see that on one side, you have the total variance of the population here. And you make a difference between these variance of the whole or all groups together with each variance taken separately. Okay, so you each group them differently. And then there's just normalization on the size of each group, okay? And so ideally, you can guess that if there is not a lot of difference between the variance of all group and the variance of each group taken individually, it kind of makes sense that this would be judgment more toward the fact that the variance are equal. And if these differences here are great, then again, it makes sense that this would be judgment toward the difference of the variance and the total variance of each group would be different. All right. Here I just defined briefly the different terms. So you have in one case, the sample variance for each group and then here the pulled variance. This is the same thing that we have seen before where we try to estimate the variance for all group taken together. And under the null hypothesis, this statistic there should follow a chi-square distribution. So same as what we have seen in this setup with m-1 degree of liberty, so number of group minus one, okay? So this is an approximation, okay, should approximately follow it so we can actually test briefly this little approximation with some simulation. So here, I imagine that I have three groups containing three elements each. I will randomly draw then these three sample of three elements. So I'm putting myself in a case where the approximation of the chi-square might be bad because my sample size is very small, okay? But I am under the null hypothesis because these all have the same variance and all draw this random sample from a normal distribution with a standard division of one, the default, okay? So I make these three points in three groups here and I give that to stats.barlet which effect the Barlet's test and then I keep my test statistic and then I will compare my test statistic with a chi-square distribution because it says that under the null hypothesis, this statistic should follow a chi-square distribution so I want to verify this. And so once I have all my test statistic, I will compare that to a to the chi-square and I will also use a Kolmogorov-Smeonov test which we just see we have just seen to check if the test statistic is significantly different from the chi-square, okay? With the appropriate degree of freedom there which is number of group minus one. In terms of pure code, you can see that here I cannot just give stats.chi-square.cdf because I have to fix this df argument there so I use what we call a lambda function and that's a way of writing if you will a very quick function in just one line so I say there is a small function with one argument x and then you should return this which is my cdf of the chi-square where I have just fixed the degree of freedom and x is now the same as the one variable that you can get. Alright, so let me execute that one. I do 10,000 tests and then I can see that here even in that setup where I only have 3 points per group and 3 groups I have a p-value of 0.85 so that means that this test statistic here with 10,000 sample is not significantly different from the chi-square distribution and I think that with 10,000 with 10,000 points I would have a good statistical power like the power always gets better as you have more points and then my expectation chi-square you can see it's plotted here in red and here in Avogadro color this is my sample barter's test statistic and you can see that they follow one another fairly well. This is approximate but that's actually a very good approximation so even when your sample size is small you can trust the approximation and use the barter's test. Alright, so far so good. So, two questions by Magna. If your data is not normally distributed can I normalize it and use the barter's test and then do my sample have to be independent? So, for the first question there is a big question about what you mean by normalization a lot of the time when people mean normalization they just mean center and scale and that does not necessarily make your data normal just makes it scaled. So, in general I would say if your data is not normally distributed then you would actually not use the barter's test but instead the Levin test okay. I leave it to you to just click on that little link here to go and look how to use the Levin test but that's just a non-parametric alternative of the barter's. And then second question do my sample have to be independent? In that case I I don't think it's necessary like if you have like time series then then you don't it matters. Does that answer your question? Okay, superb. So that's the barter's test to check equality of variance and here no exercise because we'll get to apply it in the next exercise because the next part is the one-way ANOVA I'm sure that you've already heard about ANOVA before at least a little bit but let's go back to it. ANOVA means analysis of variance, all right and we'll see why it has its name in a moment. So here we'll discuss the one-way ANOVA and to make it simple it's basically like a t-test but with more than two groups, okay it's a t-test where you have 3, 4, 5 doesn't matter how many groups the assumption is that the population distribution are each normal, so we are in the strong normality assumption there that the samples have equal variance, so for that we have our barter's test and that the observation are independent from one another our null hypothesis all the mean of the different populations are the same okay and the alternative hypothesis at least one of the mean differ from the other now the way that it works is it uses not so much variance but sums of square so you have to remember that variance is a sum of square divided by the size of the group or the population okay so here what we do is that we will decompose this variance we will decompose this sum of square in the variance within each group and the variance between each group so and you can actually write the sum of square total so the sum of square for the entirety of the data is equal to the sum of the sum of square within each group plus the sum of square between each group okay now if the group matter if the group actually cause a shift in mean at some point then the sum of square between group should be somewhat big okay and you know it's a bigger the sum of square between group the more the group explain a large part of the variance if you will and so the less likely it is that they all have an actual same mean so that's kind of what the ANOVA relies on so the sum of square total is just the sum of square with the full data so you have the mean of the data and then you compare each point to the mean and you square each time this difference the sum of square within is you do exactly the same but within each group and then you sum across all group and the sum of square between is that you compare the mean of each group to the ground mean of the entire data okay and then we make a ratio of the sum of square between and the sum of square within each time kind of scaled by the number of point that we have to consider okay when you only have five group you divide by five minus one and if you have then a total of 100 point and five group you see here that we divide by 95 this little minus one and minus M there are the degree of freedom correction so there are there to account for the uncertainty that come from the fact that you estimate both violence and mean at the same time again all of these small corrections alright and then that should be compared with the F statistic this is again an F distribution something which is specific to the test okay that we don't have to go into details of we just take what is given to us by the literature and it has a number of degree of freedom which is given to us again by this number of group minus one and number of point minus number of groups so let's have a look with a little data set there of plants that we have there where we have different subspecies of the plant and it has different size that were recorded so we have our three group you see on one side the entire data set here and then split among the three group and so I think here could be legitimate to both say well this look like one big group maybe there is an actual difference okay so it could be worth it to actually do the test to check which way it would go so if we do that a bit manually we can here we need to compute three quantities the sum of square total sum of square within its group and sum of square between its group so to do that I compute the ground mean so total mean of everything and then I use group by to group the plants by subspecies and then I aggregate the plant size by mean and count and then you see I will actually in this group by apply here a lambda function again a little custom function where I compute the difference between each group and its mean right this is slightly let's say more advanced pandas usage I will not go too much in the detail but if you are interested I think you can go back to this example later on and use that as a template for more advanced way of you know doing some computation automatically across group in a data frame so here three subspecies our total mean the mean for each group okay so you can see indeed that these are close but this one is maybe a bit different there is 10 individual in each subgroup and the sum of squares so each plant minus its group mean squared and then some everything is these three values okay so that's our that's the element of our sum of square within and then we have our sum of square total is just the plant size minus the grand mean squared sum the sum of square between is the mean of each group minus the ground mean squared and then the sum of square within is just the sum of these three things so I have then my sum of square total sum of square within this group and sum of square between each group and then here I just test this kind of assumption that the sum of square total is equal to the sum of between and within and you can see here the difference between these two things so the total and between and within is actually one to the power minus 14 so it's small enough that's actually that's what we say here that this is indistinguishable by zero from the standpoint of a computer in general this is numerically equivalent to zero alright so far so good yes alright so we have then this sum of squares and we have test the assumption that indeed we can decompose total between between and within and so then we use the ratio of between of sorry of between and within to gauge how much what is the actual importance of the subgroup there in explaining our of our total variance and this if this importance is large then it's unlikely that they have the same mean and if it's small then it's a bit more likely they have the same mean so I apply here the formula that we have seen before I will normalize here the sum of square by the degree of freedom and then I test that with against an F statistic F distribution with this degree of freedom given there and what I obtain is p value of 0.009 okay so actually something quite small slightly under 1% of course in practice you don't do that yourself you use the stats dot F one way so F because it's an F statistic and one way because it's a one way ANOVA and I give the plant of group 1 the plant of group 2, plants of group 3 and I get my F statistics and my p value and you can see that here I did not mess up too much because I actually gate the same thing up to maybe a small numerical errors down there right but otherwise my computation is fine and so in that particular case I would then reject the null hypothesis and say that at least one of these group differ from the others all good yes Rosario I have just a question about this like how do we identify which group at the end I mean we can have different groups some are the liar but then we need to track it down and then another thing I was thinking is that in this computing the sum of square difference this is very similar to the variance if I understand right it's not sensitive to the number of samples because we don't have the division by the number of observations I guess and so I was thinking like if we assume that the variance should be the same among these sub samples and then we take again the nominator of the variance calculation means that this is sensitive to the number of samples that we have per sub population or maybe I got totally lost with the math but yeah sure so I think it's legitimate so in just the sum of square part of it indeed you don't really take into account the size of the group or the size of the the number of integer in each group except in the sum of square between there because you see we scale up we multiply the sum of square by between group by the size of each group so this is already kind of present here and then here you see that we renormalize again by the number of element in each category so here they appear and further more the number of individuals appear here in the number of degree of freedom that we give to this f distribution there whose then variance and shape will change as a function of that so it doesn't appear directly in the computation of the f statistic too much but with the respect to the computation of the p value it does matter a lot okay thank you and then the first question so I did the second question first the first question was about the how do we do afterward to see how which are the significant difference or not because we know that there's at least one differing but which are the one and so for that we use what we call a post hoc test we recommend here the take is honestly significant different tests so that's something that you only do after you have had a significant ANOVA and I gave you here this little link you can go and read the documentation is fairly simple there is here a nice little example of code and so on so forth that will explain to you how to use that otherwise it goes slightly beyond the scope of the course but indeed that's a very common thing to do the test afterward for which are different and which are not one thing that remains which is quite important is as you have seen the ANOVA have very very stringent assumption normality of the data of each group and equality of variance and so if these fail you cannot apply the ANOVA in which case you have different alternatives so first off if the normality is ok but the violence are unequal you can either use the alexander govern test which is in scipy and which ANOVA which is not in scipy but you can also find it in the pinwin package which implements it ok so if you are in this case you can go and use this link these are well documented functions and if your data is not even normal then you have to go non parametric and though you have the cross call wireless H test which is non parametric and doesn't check equality of mean but rather equality of median so it's slightly different here but it's not uninteresting so you have here again little link ok so then we can go on to our exercise there is a question by Mechner what can I apply for my data also fails in doulence of observation independence I guess observation assumption along with normality and almost cadacity so it depends how it fails independence so if you maybe if you say that if you have a time series is that what you imagine I know so for instance I have about 10 10 categories and my data set and in each category I have 10 samples and I have thousands of observations per sample so what can I use to determine whether these 10 categories are different from each other or not so you have 10 categories ok for each categories you have 10 samples and then you say that for each sample you have what thousands of observation so what do you mean by observation there for instance it's like it's metabolite abundance levels so there are thousands of metabolites that are present sample and they have different abundance levels I see ok so data can be maybe put into like one table with 1000 column each column being one metabolite and then 100 rows each row being a sample and this different sample are split among 10 categories if I understand that yeah so there what you could do is that you could test each metabolite independently each using either a ANOVA if you have the normality and homoscedasticity good for that particular metabolite or a non-parametric alternative otherwise if you look at so much data it's likely that at least one will bridge the assumption and so then you will get well 1000 p-values for each of your tests and you will need to correct this p-value for multiple testing is there an alternative than doing it individually for each metabolite it depends a little bit on what your goal is in the end an alternative could be to then have build a multivariate model to see how you could if we switch the focus from asking the question of what are the metabolites where there is a shift in mean between different categories but whether or but we could switch that to something another question such as what are you know how could we use this metabolite to predict that cells are in sorry sample are in one category or the other we could use again some sort of multinomial generalized leader model to then pull all of your metabolite in a single big multivariate model aha okay but that would add a layer of complexity but let's ignore that for now but that would also kind of if you will shift the perspective of what you could say from this result right okay thank you we welcome if you want to ask maybe more question on that specifically you can use the google doc because I think this touches upon more advanced techniques I would be able to discuss that but that goes a bit beyond the scope okay I will do so thank you very much thank you all right so then unless there is any other question we are going to go then on our then third exercise where I give you a little bit of some data about some weight difference there after after six week of a program you have different kind of type of diet and you want to then use an ANOVA to see if there is a difference between the effect of these diets okay so here do a little bit of plotting first maybe I would say then check the assumption of the test and then if the assumption are respected do the test we are going to do this up until 12 sharp then we are going to stop for our lunch break and when we come back to the lunch from the lunch break I am going to ask you if you want to have maybe five more minutes to complete the exercise or if you want to go to the correction right away okay does that work for everyone okay so then as usual I will so our goal is to try and apply an ANOVA like test on this diet data so here we will just focus on the weight column and the diet column so maybe one thing we could do also is to start looking a little bit at this diet column so I think value counts is a good first approach okay there is 20 there is 3 group 1, 2 and 3 and then there is group 1 with 24 people group 2 with 27 group 3 as well then I will actually just play with the correction so that you don't have to wait for 10 minutes each time for me to just type simple command okay so we start with maybe a little bit of the plotting so here I use the same plot that the one that I propose here I think it works fairly well for that particular case so we have the weight difference after 6 weeks of diet and looks like we have then our 3 diets it might be that they are indeed different from one another but also there is not so many people per group so also testing might be warranted here so we then want to test our assumptions so remember we have two main testable assumptions the first is normality and the second is equality of variance so almost elasticity there is Dennis asking if I could pose the common use to get a number of people in each diet yes of course if.diet stopped and there is a function very useful value counts okay so that is and there you go alright so we want to check all that so first let's do our test of normality so the way I do that here I create axes okay one row three columns and I say that they share there why so that everything is on the kind of the same scale then I say there are three level of diet one two and three so I will go through all three levels and for each of these I will call the prop prop function where for you know the weight difference where the diet is equal to the current level of the diet set a title then I will do the same thing but with an over sorry with the chapter with test so here I use again group by I group by diet and on the column weight if I apply stats dot Shapiro if I did that without the group by this would apply to the Shapiro will exist on the entire weight if column because if group by diet this will apply it on each group separately and then last but not least we want to test for almost cadacity so stats dot Bartlett then the group is one the group with two the group with three and then we have here our results so we have our free QQ plot as you can see they look fairly good maybe there maybe a little bit of let's say there's maybe a little bit of under dispersion might be yes here but nothing too too too large and what we can see there from the test of normality here is that our p values are all above 0.05 so there might be a difference but here it's if this difference is not so huge that it's detected with our 20 or something sample per group does not necessarily mean that it's necessarily not different but that the difference is not too large there so I would then to accept these as okay with respect to the normality assumption and then last but not least we have then our test of almost cadacity with Bartlett result and a very high p value there and so there again I would be fairly confident into accepting my null hypothesis of course again I have to understand that this p value here is not then the risk that I'm interested in because this represents the type 1 error risk and when I'm accepting H0 I'm going into the type 2 error risk so this time I'm advancing somewhat a bit blinder than what I would like but here I could say that it's quite okay with such a high p value we could also of course have a look at the actual standard deviation the observed one in order to help ourselves be convinced let's say by these so in the same way that we have here maybe on top of the test we have the plots can be good to just look at the standard deviation so here that would be df.groupby you want to group by diet you want to look at the weight if I'm going to copy paste whoop whoop and then std and then you see that here the standard deviation are respectively 2.24 2.5 2.40 stands to reason that these would if they are different they are not really different right so it stands to reason that we would accept here the assumption also okay so now that we have satisfied our assumption of normality and or of homoscedasticity we will go with our ANOVA test so here I do ANOVA one way with the weight different categories and then if the p value is lower than that I will report the different means yeah okay so here we would have our ANOVA one way it says here that the p value is 0.003 so we are a fairly good level of confidence toward rejecting H0 there and then I report the mean for each diet so for diet 1 minus 3.3 minus 3.02 and minus 5.14 okay so that's how we would do the ANOVA if we want to maybe go slightly further actually I can do that and try the TOKASD test just to go just a little bit further so the idea of how you use that you see here it's given so you give the data for different groups so the the syntax is exactly the same as your one-way test so let's try it out and see what we get out of that I can just copy paste here of course here I don't say sci-pi the stats because I have imported stats as stats so I need to adapt that and there I go I get this useless result object so let me see if I can just print it and have something a bit more useful okay so what you get is this kind of large table there and what it does is that it compares the groups together in couple and gives you the associated p-value and confidence interval around the differences okay so the way that you have to interpret that is that the first group that you gave will be group 0, second group will be group 1 third group will be group 3 so don't be confused by the numbers there, they don't correspond to the diet exactly but the order in which you have given them that's kind of the big thing so between 0 and 1 the p-value is quite large so this would not be different between 0 and 2 we have a p-value of 2% so between that and this you have significant difference there you would say not so significant but and then between 1 and 2 there you got a fairly small p-value where you would accept that they are different so between 3.02 and 5.14 and then the rest you can see is just a converse they do each test twice because you have 1 versus 0 and then 0 versus 1 just in terms of the confidence interval there alright so that's how you would perform your ANOVA and also how you would do the post-doc test to see which one were actually different provided of course that this p-value here is significant if the p-value there is not significant then I would not do my post-doc test okay any questions everything good? as any among you tried one of the alternative to the ANOVA test one of the one of the ones which were not presuming normality or or anything else maybe the cross-call test maybe we could say that this group one here is not normal and that we would prefer to go to a cross-call test anyone tried that? that's fine it's just to see if anyone tried stuff again here you can see for sci-pi.stat I encourage you to spend a bit of time in the documentation it's very insightful because they have really spent some time to try and explain exactly where the test comes from and also what are the assumptions some details of implementation some bibliography also all of these tests they are the result of some research so you can go and read the actual bibliography if you're interested and also some example of usage okay so for this function cross-call analysis you should have the same structure so if we actually try that just out of curiosity because we are here to learn that's cross-call and then we have here the way difference is the same structure and I have yep okay and what we get out of that we can compare that we actually do get a p-value which is also significant and which is almost the same okay we see that the p-value is slightly higher because we make less assumptions so of course then we lose a little bit of statistical power but still here we are perfectly able to detect the difference between at least one of these group and the rest using the non-parametric test alright so I would say if you have any doubt with respect to our assumption it's also perfectly okay to go with the non-parametric alternative it's far from bad test it's less powerful yes but also that may mean that you then focus yourself more on the on the difference which are more important okay and then in a context where you have I don't know thousands of metabolites or tens of thousands of genes to test for spending a bit more time on the more important and less time on the less important difference is not useless alright do we have questions no okay so if there is no question then I've got then two things maybe before we switch to the next notebook the first is just a small discussion that is maybe more for outside of course just reference so as you will have guessed by now I think this whole course material is things which I've created so that you could follow along during the course and play around with the example of course but that you suggest you could also maybe come back to that later on and use as reference I try to include all of the relevant theory in the text there so that you may come back to that a few months for now and so on and so forth and that's what is especially about this if you ever encounter like this a different type of distribution and you wonder what they are about what they can be used for you can come back to this to understand what they are and what are the parameters some of the most well known one of course you have the normal I mean yeah it's here normal okay now I think that you understand what a normal distribution is but some others which you might hear about a lot is the binomial distribution we've talked about it a little bit and it's like during a number of trial each with a probability of success in each trial so typically you toss a coin and you count the number of time you get head but it can also be used with a lot of success to model for example sensitivity and specificity so typically when we compute the confidence interval around sensitivity or specificity we use a binomial distribution okay and then you have a multinomial distribution which is the same thing but you can have more than two outcomes not just heads or tail but you can have more categories right if you have ever heard about a logistic regression the logistic regression model sorry models a binomial sorry well it does model a binomial distribution but with an equal one most of the time and you can have sometime a logistic regression which does multinomial so it will give the probability of being among more than two categories so these are quite used and it's good to know about them the Poisson distribution is also one which you will hear about a lot so it's a distribution of count of time that something happen along a fixed amount of time or space so typically for instance the amount of plane crashes or you know some event which are relatively rare that's not necessarily a huge assumption you can relax that quite a lot but each event should be somewhat independent and you should each time measure the number of event that occur in some fixed amount of time or space and that's some of the most important ones maybe if you are interested into sequencing technologies in particular erinasec or single cell erinasec you may have heard about the negative binomial distribution which can model the number of failures until you have a number of success in a binomial trial but most of the time in this in the field of erinasec it is used as a form of Poisson distribution with an extra layer of variance so it's an over dispersed Poisson and it has been shown that this models fairly well the number of reads per gene or per transcript that we get in this sort of experiment so that's why a lot of software that look at this data use a negative binomial distribution under the hood to model the data alright so that's just some of the I would say most important one to have heard about at least a little bit okay so that's one new point that I wanted to talk to you about