 Olemme ymmärtäneet esim. vähän enemmän. Olemme keskusteltuneet nall-hypothisiin tarkoittamisen testin, tai NHST, joka on akronomia. Nall-hypothisiin tarkoittamisen testin on, että olemme ymmärtäneet... Olemme ymmärtäneet ymmärtäneet ymmärtäneet ymmärtäneet, ja olemme ymmärtäneet toistaan, olemme testin ja nall-hypothisiin tarkoittamisen testin. Olemme testin ja nall-hypothisiin tarkoittamisen testin. Olemme pitää kokemme ymmärtäneet nall-hypothisiin tarkoittamisen testin. Nall-hypothisiin, tai H0, on ymmärtäneet ymmärtäneet, että H0 on ymmärtäneet ymmärtäneet, tai H0 on ymmärtäneet... Olemme ymmärtäneet... Sitten pysyvät tässä, kun on kohtaa. Ymmärtäneekö testin...? Sirkyskyli tai ne on oikeudella uutistuttuu. Kalkoittamme samanlainen tilanteesta, ja voimme sanoa, että tämä area on p-valoituksia. Tämä on mahdollisuus ottaa testatistikin alha-hypo-tysiä, jossa on alha-hypo-tysiä. Sitten alha-hypo-tysiä, jossa on alha-hypo-tysiä, jossa on alha-hypo-tysiä, jossa on tarkoitus testataan. Tytäkin, tämä on tullut komputerin, jossa ei ole tarvita alha-hypo-tysiä, jossa on alha-hypo-tysiä, jossa on käytöksi understand, mitä on tapahtunut, elä-hypo on tahedut, koska kun me actingin niin koko alue juhlalla, niin koko alue teemme niin pallppuilla. The simplest test, perhaps using the null-hypo-tysiä, significance testing is, are the tea tests. And the idea of a tea test is that it assumes that the estimates are normally distributed over repeated samples. Tämä oli kuten on, kun kappaleet yksi tuntuu. Joten yksi tuntuu on yksi tuntuu, joka on yksi tuntuu, kun samanlaista tilaa on suurin. Sitten on tuntuu, että testaattistikin on tuntuu, kuten on tuntuu, kuten on tuntuu. Joten, ettei katsomaan, kuinka suurin esim. on ympäristysvaluus 0, katsotaan, kuinka suurin esim. on ympäristysvaluus 0. Ja tämä katsotaan student-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t. The idea of a t-test or this estimate divided by standard error is that we standardize the estimate. So remember standardization is subtracting the mean of the estimates. So here we assume the mean to be the null hypothesis. So we subtract zero and it doesn't really make a difference. And we divide by standard deviation in which case is the standard, estimated by standard error here. So the t-statistic tells us how far from zero the estimate is on a standardized metric. If it's more than two standard deviations from zero, then we conclude that that kind of observations will be unlikely to occur by chance only. Because 95% of the observations fall within plus or minus two standard deviations when we have normally distributed statistic. So we compare this area. In practice it often makes sense to compare both areas here. So we calculate this area as well. The logic being that it would be an important finding if the difference was to the other direction as well. And this relates to or is referred to one and two-tail tests. So what area we compare? So normally if we only compare one end of the normal distribution here, this is called a one-tail test. And if we compare the area, what is the 5% area here and here together. So this is 2.5% and this is 2.5% so there's some to 5%. Then that's called a two-tail test. Normally when your statistical software gives you a p-value from a t-test or some other test that uses something that looks like a normal distribution, for example a z-test, then it is two-tail. So you compare both ends. And it's considered cheating to use the one-tail test because what the one-tail test basically does, it gives you a p-value that is exactly half of the p-value of the two-tail test. Because you have two-tails here, the probability of the observation being in both either tails is twice as the probability here. So the probability here is half from what the probability here would be. The problem with one-tail tests is that the standard is to use two-tails. And if we observe a p-value in a research paper, we assume that it's made by using this two-tail test. Sometimes if the p-value is 0.06 and a researcher wants it to be less than 0.05, they switch to one-tail test which allows them to divide the p-value by half. And they present those as if they were tests from the two-tailed distribution, two-tail test. That's mislinear readers, that's unethical. There are basically no good reasons ever to use these one-tail tests because this is more commonly accepted and also if someone wants to have the one-tail test instead of the two-tail test, they can just divide your p-values by two. And that's the difference. The p-values are very commonly used in research papers. So you see papers, for example, this is from Heckmann's paper. You see these p-values behind statistics, so you see a regression estimate here. Then there's p-value less than 0.01, that is statistically significant. You see this ns, that means non-significant, or you can see p-value is greater than 0.05. So for some reason we have decided that the 5% p-value is the gold standard and if you have less than 5% then it's a good thing, if you're more than 5% that's a bad thing. So that's an arbitrary threshold. So a paper could have hundreds of p-values easily. So it's a very commonly used in research papers. P-value relates to two different things. So it relates to two different errors. And we have two things in statistical analysis. We have the population and we have the sample. We want to make an inference that something exists in the population using the sample data. So we calculate a test statistic. The test statistic rejects the null hypothesis in the sample. Then we assume that the null doesn't hold in the population. But that's not actually always the case. When you get a p-value that is small, it's also possible that it is a false positive finding. So p is less than 0.05 means that if there was no effect then getting the kind of result that you just got would be the probability for that would be 5%. So 1 out of 20 samples from population you would be getting a false positive if the null hypothesis doesn't hold. So it's possible that it's false positive but it's also possible that it's true positive. So the problem is we don't know. We have evidence that it would be unlikely that we would get an effect estimate by chance only. Then we conclude that maybe it wasn't by chance only. But we can't know for sure. So this is type 1 error. Then we have type 2 error which is false negative. Let's say that the null hypothesis holds in the population. Let's say that women-led companies are really more profitable than men-led companies. But for some reason our study couldn't find the difference. So that would be a false negative. And there is the case that we say that we can't reject the null hypothesis. We can't reject the claim that there is no difference and there really is no difference. So that's also a valid finding. So we want to be sure that we either have true positives or true negatives. The probability of false positives under the null hypothesis is we consider 5% or less acceptable. So if we say that the p-value is valid then it should behave as expected. So it's okay for the p-value to be less than 0.05. 3% of the time if the null hypothesis holds in the population. So that is we have a conservative test. That's okay. So we want to make errors to be too cautious. But if our p-value was less than 0.05, let's say 7% of the time, then you would say that it's too liberal and it's not the valid p-value for the particular test. Because it doesn't follow the reference distribution. It's important that when the null hypothesis holds, our p-values don't indicate the support too often. Then we have another concept called statistical power. So this is a false positive rate. And statistical power is something that once we have a statistic whose p-value doesn't exceed a false positive rate, we want the test to identify an effect when it exists as frequently as possible. Typically we are okay with 80% but there are studies with way less power. So 80% power means that when there is an effect in the population, then in four out of five studies we would actually detect an effect. The question is which one is more important. We are not okay with more than 5% false positive rates, but we are okay with 20% false negative rates because of 80% power, 20% false negative. Then the reason why we are so much more worried about false positives is that positive effects typically have some kind of policy implications. If we find out that a medicine doesn't do us any good, then no one is going to take the medicine. We continue research. If we find out that the medicine helps people, then people will start taking the medicine. If it's a false positive finding, then people will take medicine that is useless or could be even harmful for them. So false positives have policy implications much more often than false negatives, and that's the reason why we want to avoid false positives. We have agreed that it's okay to have a 5% rate, hence P is less than 0.05, but not more. So that's, of course, in some scenarios if you have a really like a life critical thing, then you could be using P-value threshold of 0.001 for example. So 0.05 is not the one correct value. That's just the convention in many fields. Some other fields use smaller values, and you can use smaller values in an individual study as well.