 Sitten pohjoitamme se, että sillä on tyyppiä, mitä pitää, että kuten sillä on, on kyse takaisin, että ennenpäin on paremmin kokemuksia. En ole vielä ennenpäin kursallit ja kursalliskuus, mutta hän on vain kokemuksia, että on jäämää koko alueellista, tai yksi koko alueella, tai koko kokemuksia, ja kokemuksia ja kokemuksia. Our example now comes from the Talouselma 500 magazine that I covered in a previous video. So this is a Finnish business magazine that follows the 500 largest Finnish companies and in one particular year in 2005 there were big headlines in Finnish newspapers because on this list there are return assets of women-led companies as 4.7% points higher than in men-led companies. So our question now is we have an observation of return on assets difference of 4.7% points and is it a big deal? So does it matter? So what does the data tell us and what kind of inferences can we make from this sample? So 4.7% points is a pretty big difference. So what does it mean? What the data are and what the data tell us directly is that in one point of time in one sample that the firms led by women are more profitable. So that's what the data tells us and now the question is can we generalize? So can we say something beyond that particular sample? Can we say that this generalizes to other years or is it just one year? If it's just one year women-led companies happen to be more profitable but it wouldn't generalize to other years then it's not a big deal. If it generalizes to other years then it's probably a big deal. The second question is does it generalize to other firms? So is it just these 500 companies in which the women-led companies are more profitable? Or does it generalize to the 1,000 largest companies or all companies in Finland or all companies in all countries? So how do we generalize? How widely can we generalize? The first question that we need to ask when we start discussing generalizability of a sample statistic. So this is a sample statistic. It's something calculated, a number from a sample. Does it generalize to the population? We have to ask could this be by chance only? So is it plausible that because of sampling variation we just happened to have the companies that were led by women happened to have a better year than companies that were led by men? Could it be just random occurrence or is it evidence of a systematic difference? And we have to ask two important questions to answer that. Whether it can be by chance only. The one is 4.7% points a large difference. Large differences rarely occur by chance only. Small differences occur by chance only frequently. When we calculate something from a sample, the sample estimate is hardly ever exactly the population value. It's somewhere close. So is it far enough to say that it's improbable that this kind of result could occur by chance only? Or is it close enough to the population value that it actually makes no difference? Then we have to look at, is it a large effect? The mean return on assets is about 10 in this sample. And 4.7% point difference would mean that if the men led companies are let's say 8% ROA, then women led companies are 13% ROA. So they are more than 50% more profitable than men led companies. That's a big thing. So that's a big difference. The second important question relates to sample size. So we know that the full sample is 500 companies, but that's not the full story. We also have to consider how many women there are. If there are just five women, or if there are 250 women, then those two conditions would lead to very different conclusions. It happens to be that there were 22 women in the sample. So that's a fairly small number of observations. Now the question of statistical inference. We want to see if there is actually, if this return on assets of 4.7% point, is it large enough that we can conclude that there probably is a difference, a systematic difference, and this is not due to sampling fluctuations only. We have to ask the question what would be the probability of getting this kind of difference by chance only? And you watched the video by John Rouser. So what would John Rouser do in this scenario? We have 500 companies. We want to know whether the difference between the women led companies and the men led companies could occur by chance only. What we do is one strategy of answering that question is to do a permutation analysis or permutation test, which is a fairly intuitive way of understanding statistical testing. And what we do is that we take the list of companies and so we have the largest companies. I got the data from a database. This may not be the exact same 500 companies, but it doesn't matter for the example. We choose 22 companies at random and we compare the remaining 478 and we calculate the difference. So we have 22 companies again, a mean of 22 companies compared to a mean of 478 companies. We repeat 10,000 times and we see what's the difference. What is the probability of getting at least 4.7 percent point difference in these comparisons? So let's take a look at the results. So I did the analysis. Here are the first 200 comparisons. We can see that quite often when we take randomly 22 companies and compare against the 478 remaining companies, the difference is very close to zero. So here is very close to zero, no difference. Sometimes we get a negative difference here. So the difference actually are, there is no systematic difference. There cannot be because I chose companies randomly and two random samples are always comparable. So, but we get these differences larger than 4.7. We get nine out of 200 comparisons using this permutation testing strategy. So the probability of getting 4.7 percent point difference larger in this test is 0.045 for the first 200 observations. Is that enough evidence to conclude that the 4.7 percent point difference is unlikely to be by chance only? Let's take a look at the bigger picture. So we have the distribution of the estimates. We have 10,000 repeated samples and sometimes we get a large negative estimate. Sometimes we get a large positive estimate. Typically we get an estimate where there is no difference because there should not be any, because we are taking a random sample from population comparing us. Another sample there should be because randomization there shouldn't be any difference. The probability for getting 4.7 percent points or higher difference is 0.0347 out of 10,000 replications. This probability is called the p-value. So it is the probability of observing an effect equally large or greater under there being no effect. We can also, we don't have to do the permutation testing. We don't have to do the random sampling because this shape here looks familiar. So that's the normal distribution. So we see that the difference is normally distributed and many things are many in statistics that follow normal distribution. So we can just, instead of approximating this difference by taking random samples, we only need to find out what is the right normal distribution. So where do we draw the distribution and then compare against that normal distribution? So here's the normal distribution overlaid against that observed distribution of estimates. Here we have the mean of the normal distribution. That's 0. So that's our base case of no difference. And then normal distribution also we need to know the dispersion, the standard deviation. And this standard deviation is estimated using the standard error. So we have the standard error which the statistical software will print out for us. And we draw a normal distribution mean at 0, which is the null hypothesis value of no difference. And then we have this person here, which is quantified by the standard error. Then we compare. What is the size of this area here? So how probable it is to get an estimate of 4.7% points or higher, given the null hypothesis of no effect. 0.04, that is less than 0.05, which is the normal criterion for statistical significance. So could it be by chance only? He is less than 0.05. If this was a research paper, we would conclude that there is a statistically significant difference and we would write a paper and we would get it hopefully published somewhere because we have a statistically significant result. Of course we have to think that in this particular scenario, there are probably reporters who want to say something positive about women. So they could do multiple comparisons. They could do comparisons of growth, profitability, other important statistics. And if they happen to find one statistic that makes women look better, then they write a newspaper article about it. So the p-values work well when you do multiple comparisons, when you do just one comparison. But because of the nature of the test, we will get eventually large effects by chance only. So if we repeat this study, for example, every year we check profitability and we check liquidity, we check growth and we do that over 10 years. So we have 30 comparisons. One of those comparisons will almost certainly give us p is less than 0.05 by chance only. So p is less than 0.05 is not very strong evidence. It is some evidence if it is just one comparison. But if we do multiple comparisons, we can do this kind of data mining and always get something that is less p is less than 0.05. If we would have less than 0.01, then I would buy the claim that there is actually an effect in the populace.