 In the research we're interested in a large number of people or large number of organizations. For example in political polling it's usually interesting to know what is the popularity of a political party. However if we consider a national level popularity then measuring everybody's opinion would involve calling millions of people. And that's in most cases impractical. Instead what we do is that we take a smaller number of people called a sample. For example we call 300 people or 1000 people. We ask their opinions about political parties and we use that sample to calculate an estimate of what is the population popularity of that political party. If the sample is well chosen and if it's large enough then the popularity from the sample gets very close to the actual population popularity. Then another thing that when we do polling is that we want to tell the readers of our poll how certain we are about the result. To do so we present a margin of error. Let's say that the political party's popularity is 21% plus or minus 1% point. That decree of uncertainty is quantified by the standard error. I'm gonna go through these four concepts next. So let's take an example. Let's assume that we have a university with let's say 10,000 students and staff. And we want to calculate what is the mean height of people that are affiliated with the university including students and staff. And there are different ways of doing that. First we have to understand the basic concepts. So here our population is everyone who is enrolled at the university. The actual list of people who have been admitted. And the list of people who are employed available from the university administration forms our sampling frame. Which is the operational definition, the actual list of people that we think belong to our population. Then we take a random sample and from that random sample we hope that we can learn something about the population. We could of course take other kinds of samples as well. But for now we'll just talk about random samples because that simplifies things a lot. So let's go to the example now. And we have different strategies for measuring the height or estimating the mean height of people at the university. One obvious strategy is to take a small sample of people and then calculate everybody's height. Take a sum and divide it by the number of people. Which gives us the average height of the sample or the sample mean of the height. So we can do that and here's some data. So we have a hypothetical university with a population mean height is 169.96 cm. And we have five samples. So we have a sample size of 10 people. We have their measured heights here. We can see that some people are shorter than average. Some people are a bit taller than average. Some people are very tall. And the first sample, the sample mean is 161. So we underestimate the population value by about 8 cm. The second sample is 169.56 so that's very close to the actual population value. Third random sample gives us 173 which overestimates the population value. Then we have 163 which underestimates again. And 168 which is close to the correct true population mean. Now the question is why do these values differ? So why do we get a different estimate from each sample? That is because in a random sample sometimes it happens that tall people get selected more often in that sample than short people. Sometimes we select randomly short people more than tall people. This estimate varies from sample to sample. And this is called sampling variance of an estimator. Estimator here means any strategy that we can apply to data to calculate an estimate. So these estimates vary from sample to sample. Now two questions are how do we make the estimates more precise? Can we improve the estimates because if the population value is 169 and our estimates vary between 161 and 173, that's quite imprecise. And the second question is how do we quantify the uncertainty? If we just say that we estimate that the mean height is 161, that's quite irresponsible thing to do because we are not telling our audience that our sample size is so small that these estimates are very imprecise. Recall my example from political polling. When you see a poll number, there's always the margin of error attached to that particular point estimate of popularity. Let's take a look at the effect of sample size. One obvious strategy for making our estimates calculated from sample better is to increase the sample size. There is a distribution of 10,000 random samples from our population using a sample size of 10. Typically we get estimates that are close to the correct population value. Sometimes we get estimates that are way too small and sometimes estimates that are way too large. So once we increase the sample size to 50, this red line here, we can see that the estimates from repeated samples actually are now distributed between plus or minus about 7 from the population value here. So the estimates are more precise than what we got from 10 observations. If we further increase the samples as a 200, now we get plus or minus 3 cm random population mean. So our precision increases here. If we have the full population, then we have the full population value. So when we have a sample, our estimates typically improve as sample size increases. That's referred to as a consistency property of an estimator that I'll talk in the next slide. Then another thing besides there being uncertainty, we have to quantify. So to quantify the uncertainty, we have to quantify the dispersion. So the question of uncertainty quantification refers to the question of if we were to repeat the study over and over again, how much the estimates would vary from sample to sample. So we want to quantify the sampling variance of the estimate. So we want to quantify how widely different estimates are dispersed. Remember that we have two statistics that quantify dispersion. We have standard deviation and we have variance. Typically in estimates, we are interested in standard deviation because it is in the same metric as the estimator. So if the estimate is 160 cm, we can say that the standard error is 5 cm. So standard error is an estimate of what would be the standard deviation of repeated samples from the same populace. Of course, we would ideally want to calculate the actual standard deviation of these 10,000 replications, but consider for example, political polling. If you are asked to provide a standard deviation of the same poll repeated over 10,000 times, you have to actually do the 10,000 replications to be able to calculate that standard deviation. That's not a practical thing to do. Therefore, we use standard error, which is an estimate of this standard deviation. So the same way as the sample mean is an estimate of the population mean, the standard error is an estimate of the standard deviation of the sample mean over repeated samples. How the standard error is calculated, it's not relevant at this point. You just have to understand that it quantifies the dispersion of the same study if it was repeated over independent random samples. Let's get back to our task. So this far, we only discussed sample mean. So taking a mean of sample is an obvious strategy if you want to calculate the population mean or estimate the population mean. But that's not the only strategy. It's actually if you take a sample of let's say 30 people in a class and you measure everybody's height, it takes some time. It could be sometimes time or effort to do the calculation is an issue for you. So we could for example just take one person from the class and measure his height. And if we get 160 centimeters, then that's a ballpark estimate. It's still an estimate, it's a very precise, but it's an estimate nevertheless. And it's valid in some sense and it's easy to calculate. Then another alternative strategy, of course, that's we would be omitting to 25 people from our sample of 30. So that's not a good strategy. Another quick strategy for calculating the estimate for the height is to allow people to self-organize into a line. So we tell that the shortest person goes to the back of the class. The tallest person goes to the front of the class. And everyone else goes in between those two people ordered by their height. So people can self-organize that way pretty quickly. Then we just go and we measure the height of the person in the middle. That's a sample media. And that's an okay strategy for estimating the population mean under certain conditions. So there are different ways of calculating the estimate of the population mean. We could use the sample mean. We could use the height of the first person that we see in the class. Or we could use the media of the people in the class. So which strategies should we use? In this case, the mean is the best. But to make an informed choice of which one is the most preferable, we have to first define what is the best. So every time when we say that something is the best thing, then we have some kind of criteria. For example, the best ice hockey team is the one that won the most matches. The best runner is the one that got the smallest time. And the best student in the class is the one with the highest grade put out. So we have to, when we say that something is the best, we have to have some criteria. And so we have to go and talk about different properties that these estimation strategies could have when we decide which one is the best. So estimators can have certain properties. Estimator refers to, again, any strategy or any calculation that you apply to your sample to get one value that is an estimate of the population. One minimum quality that every useful estimator must have is that the estimator must be consistent. And that consistency means that if we increase the sample size, then our estimates will get better. So the sample mean is a consistent estimator because it improves. And also consistency requires that if we have the full population and we apply our calculation strategy to the full population, then we will get the correct population result. So consistency guarantees that one study will get better as sample size increases. Of course, in reality, we can't study populations because of cost issues, but we have to rely on samples. And therefore there are other things that we need to consider as well besides consistency. Second important thing is unbiasedness. So if an estimator is unbiased, it means that it is free of systematic error. For example, a biased estimate of sample mean of the height would be if our measurement tape is actually shorter than what it says on the scale. The numbers on the tape would be incorrect. So that would be a biased estimator. So the definition of unbiasedness means that if we repeat a study many, many times, then even if an individual study could be way incorrect, then on average those studies would provide us the correct result. That is important because of how science works. So the idea of science and research is that we accumulate knowledge. So we have studies and they are added to the body of knowledge. And then at some point someone looks at 100 studies and looks at okay, so what is the average effect of one thing on another. If those studies are unbiased or free of systematic error, then average of multiple repeated studies of the same issue provides us a pretty good estimate of the population value. If in reality we often have to work with estimates that are slightly biased but still consistent, sometimes we have multiple unbiased estimators and we have to make a choice. So which estimator do we choose? Sample median and sample mean are both unbiased for this particular scenario. So which one we use? Which one is the best? We have to consider efficiency. So efficiency is a property that compares one or two or more estimation strategies. And the one that has the least variation of repeated samples, so it's the most precise or individual estimates are expected to be closer to the population value than with alternative strategies that is called an efficient estimator. And the property is called efficiency. Then finally, we have normality. It's useful for statistical inference if the estimates are normally distributed over repeated samples or at least follow some other known distribution. Why that's important will be discussed a bit later. No, okay, so this is a bit of a, let's say, statistical theory or concepts that you may not, terms that you may not encounter in empirical articles. So why knowing this is important? Or is this just nice to know stuff? This is important for two reasons. One reason is that if you study a good book about statistical analysis or research methods, you will see these terms. And unless you know what these terms refer to, it's difficult to understand what you're reading. The second thing is that in a regression analysis, which is a pretty basic tool that we'll talk later, the choice of regression analysis is pretty obvious in certain scenarios. But in other scenarios, you have different competing options that you could choose, and there are trade-offs. So you could use an estimator that is very inefficient but unbiased, or you could have a slightly biased but efficient estimator. So which one do you choose? You have to understand these concepts to make choices. Let's take a look at an example. So here is the height example, and we have six estimation strategies. We have the sample median, a sample mean that we discussed. We have the sample median, which is an okay strategy, so take the person in the middle and measure their height. We have the height of the first observation, which is sometimes if you're really in a hurry, that's a fast way of estimating things. Then we have three completely made-up strategies. One is absolute value of the sample mean around the population value. So I'm just using that to get that kind of shape. And we have sample mean plus 100 divided by sample size. So this is an unreasonable strategy as well. And then we have just random guess between 140 and 200. So consistency. Do these estimators get better as sample size increases? For the sample mean, obviously, yeah, that is, it will get better. So it's consistent. We can see that these estimates get closer and closer to population value as the sample size increases. Same thing here. Absolute value of sample mean around the population value is it's not a very good estimator because they are systematic at two larges. But if you increase the sample size, they will get, they are still pretty bad, but they will get better these estimates. So they are still systematic at two larges, but they will get better. First observation, it is inconsistent because the sample size has no effect. So consistency is about whether things improve as sample size increases. And the first observation doesn't really, the number of observations in our sample doesn't really influence what is the height of the first person in the class. Sample median is consistent. Sample mean plus 100 divided by sample size is consistent if the population size is infinitely large. So if the population is very, very large, this one goes to zero and the values go to the actual population value. Then we have a guess between 140 and 200 and that is inconsistent because it doesn't depend on the sample size. So we have four consistent estimators and two inconsistent ones. The next property was unbiasedness. Sample mean is unbiased because we can see that these observations are equally spread out around the population value. So if we take a mean, regardless of the sample size, we will get the correct population value. Absolute value of sample mean is biased. We see that these observations are systematically, these estimates are systematically too large, that is the definition of biasness. So there is systematic error in the estimates. This is actually unbiased because even though this is a really bad way of estimating things, because it doesn't improve with sample size, it is unbiased because on average repeated estimates are correct at the population value. In reality it's very difficult to come up with a scenario where there is an unbiased estimator that is inconsistent and that will still be useful, because typically if we have an unbiased estimator, it is also consistent. Then sample median is unbiased to correct on average. Sample mean plus 100 divided by sample size is biased systematically too large and this one is slightly biased, so you can't see it from everybody. Then we have efficiency. Sample mean is efficient and we compare efficiency against other unbiased estimators. Sample mean is certainly more precise than taking the first observation, so that's very clear. The difference is very clear, so this is spread widely here and here they are much closer to the population value. Sample mean is also more precise than sample median. You can't see with plain eye, so the difference is very small, but it has been proven to be for this particular case and generally also more efficient. Then absolute value of sample mean is actually slightly more precise than sample mean, so these observations are less dispersed, but because this is a biased estimator considering its efficiency doesn't make much sense. So this is efficient, but it doesn't really count. This one is inefficient because they are spread widely compared to others. Sample median is inefficient because sample mean is better. This one is sample mean plus 100 divided by N is efficient or equally efficient as sample mean, because this person is the same, but it's biased, so comparing the efficiency doesn't really make sense. If we compare the efficiency, these two biased estimators, then this would be inefficient, but again the comparison of efficiency of biased estimators doesn't make much sense. And this one is inefficient because they are spread out quite widely.