 Hello everyone, this week we're going to be talking about confidence intervals. So just to give you an example right out of the gate, suppose you are trying to determine the mean rent of a two bedroom apartment in your town. You might look in the classified section of those paper, write down several rent values and average all of these rent values together. You would have then obtained what we call a point estimate of the true mean. And what we mean by that is the estimate or the mean that you've calculated does not reflect the actual mean of all apartments from the population in your town. You've only taken a sample of the rent values. So your mean, the average that you've calculated will be a little bit off or sometimes a lot off from the actual value of all of the apartments, all of the rent values in your city combined together. So there is a true mean and our mean that we've calculated will be slightly off. So we want to know how far off are we basically? How close are we to calculating the true mean? Because we want to know what the true mean is or what the probability of the true mean is from ours. We use these calculations a lot in pretty much all studies where we have some sort of survey or we're taking some sample of a population. We want to not only calculate, for example, the mean of the population that we've sampled, but also figure out how far, how confident are we that our value reflects the actual population. And that's really what we're trying to find out. How confident are we that we are correct? And if our confidence interval, we're talking about confidence intervals, if our confidence interval is very big, then that means that there's a huge range where values might lie, which means we're not very accurate, which means we can't be very confident that our guess is true. If our confidence interval is very small, then we have a shorter range in which we can say that our values fall within these particular ranges, and our value, hopefully, is closer to the truth. So today we're talking about confidence intervals, and we can basically use this to calculate how close our point estimates are, because we're always calculating point estimates from the true mean, whenever we can't sample every object in a population. How close are we to the true mean? What's the likelihood that we fall a certain distance from the true mean? So it's very similar to using distributions like we did before. We use sample data to make generalizations about an unknown population. So we sample data, which is a small portion of the population that we're looking into, and we're using this sample data to make some sort of generalization or guess about an unknown population. We want to know some attribute of this population. We call this inferential statistics. So really, so far, we've been doing a lot of descriptive statistics, just describing the data that we have, and now we're actually trying to make generalizations or assign attributes to a population that we don't necessarily know anything about the population. We have to infer what those attributes actually are. So the sample data help us to make an estimate of a population parameter. We want to estimate the population parameter. The point estimate is most likely not the exact value of the population parameter, but close to it. So just like we said before, whenever we're calculating the mean, if we calculate the mean of our sample, well, it's not going to be the exact mean of the entire population. It's just the mean of our sample. But it should be close, right? So if our sample is large enough, then it should reflect quite accurately the mean of the overall population. If our sample size is small, it will be further, it will possibly be further away from the true mean, right? So the point estimate is most likely not the exact value of the population parameter, but very close to it. And after calculating point estimates, we construct interval estimates called confidence intervals. So once we have our point estimates, once we've analyzed our sample and we've made some estimate, for example, the mean of the sample that we have, and we want to say how close is this mean from our sample to the overall population? How closely does it actually relate to each other? We can construct interval estimates called confidence intervals to say, we are confident that our point estimate is within a certain range of the true mean or the true value, the true parameter that we're actually trying to calculate. So first some terminology. We say margin of error, and that's our confidence interval, an amount usually very small that's allowed for a case of miscalculation or change of circumstances. So we have an interval from whatever value we've actually calculated. So let's say we have our point estimate, that's the mean, and we will accept a little bit of error. If we had a graph, it would be to the left and to the right, a very small amount of error that we allow for miscalculation, change of circumstances, or just kind of some unknowns in our measurements that we haven't thought of or something like that. So we allow a little bit of error into our calculation just because our calculation is not going to be perfect. There's going to be a little bit of error pretty much anytime you measure something. Our confidence level is the probability that the value of a parameter falls within a specified range of values. The probability that the value of a parameter falls within a specified range of values, and this should sound familiar because it's essentially what we were doing with normal distribution. So what is the probability that a value falls within a certain range of the mean in a normal distribution? And we can use the normal distribution there to calculate the probability that values will fall within a specified range. Now our confidence level is basically how confident we are that the value will be within a certain range. And normally for real studies, our confidence level is normally greater than 90%, usually something like 90, 95, or 99%, and most of the studies I've seen use 95, but it really depends on what area you're in and what you're measuring. The empirical rule in approximately 95% of the samples, the sample mean will be within two standard deviations of the population mean. Now this should also look familiar, it's essentially the normal distribution, right? So in approximately 95% of the samples, the sample mean will be within two standard deviations of the population mean. So here we go. Calculating confidence intervals, we can calculate confidence interval where the population standard deviation is known. So where we know the population standard deviation, which is not normal in most real studies, but if you do happen to know it, then we take the point estimate, which is the parameter that we're trying to calculate, where we have calculated, and it's our estimate that is hopefully close to the true value. We take our point estimate minus our error bound. And we take our point estimate plus our error bound. Then we get this interval. This should look familiar because we kind of did something like this whenever we took the mean of a sample, we took a mean of a sample and then calculated how many standard deviations it is. Well, in this case, we're saying, okay, we have our point estimate and we are subtracting the error bound and adding the error bound and then we get this range. So in this case on the screen here, we have x equals 10. So this is basically our mean and notice this is a normal distribution here, or it looks normal at least. And then we have this error bound. We've calculated our error bound based on the standard deviation. And then x minus our error bound equals 5 because our error bound is 5, so 10 minus 5 is 5, 10 plus 5 is 15. So this is our confidence interval whenever the population standard deviation is known. In this case, our confidence level is 90 or 90%, but 90% written as a decimal. So in this case, we have 90%. Now notice, if our confidence level was, for example, 95%, then it would be actually a little bit wider. So basically what we're saying here is that we are confident, we are 90% confident, we can say that the true value, the true parameter will fall somewhere within this range. So we have calculated our point estimate is 10, we've counted a point estimate of 10, we can say that the true value will fall somewhere within this range and we are 90% confident about that. Depending on what exactly we're measuring here, plus minus 5 may be really, really horrible or it may be very, very good depending on what we're calculating. So this is a calculating confidence interval where the population standard deviation is known. I won't talk much about that because we don't really see that in nature very much. We don't usually know our population standard deviation. Calculating the confidence interval where the population standard deviation is unknown and this is more like, well, can we say real life? So in this case, we write down the phenomenon that you would like to test. So write down the phenomenon you'd like to test. For example, the average weight of a male student in Hallam University. Now in this case, as soon as I wrote this, I already had problems with this statement because it's a little bit too vague for what I'm used to. In this case, the average weight of a male student in Hallam University. So I would have a couple questions about this statement. For example, how are we calculating weight? Is that with clothes on or shoes on or whatever? Or is it without clothes, without shoes? What is our standard for weight here? How are we weighing things? Is it just what people say they weigh or are we actually going to weigh them? And then I would say male student, which seems very straightforward, but are we actually doing, for example, a genetic test to figure out if they're male? Are we just taking people who claim that they're male? How do we actually find male students in this case? Like what's the standard for male here, really, or what are we basing it on? And student in Hallam University, I also have a problem with because are we taking any student that's in Hallam University right now, which could be potentially high school students as well as college students? Are we taking students who are not enrolled in Hallam, but are college students from another university? So this whole phenomenon that I would like to test is, in my opinion, too vague, but hopefully it works for this example. So basically my point with this is going back, and we will talk about hypothesis testing, but we're essentially making a statement about some population here. And we could be much, much more specific about the population itself. This in my mind is a little bit too vague of a population. So we write down the phenomenon we like to test. In this case, the average weight of a male student in Hallam University, and we select a sample from your chosen population. Now we need to also think about how we select these samples. A truly random sample would be best, but how do we do that random sample? If we stand, for example, in front of the student union, we are likely to get students from maybe certain majors or certain schools. If we stand in front of the library, for example, we're likely to get students from different majors or different kind of sample. We also want to think, in this case, the average weight of a male student in Hallam University, we also don't want to just stand in front of the gym, because if you work out, your muscle is much heavier, right? So very muscular people will weigh, on average, much more than people who don't work out. So where we stand and where we take our sample will really affect our results in this case. So select a sample from your chosen population and be very careful about how you select these samples. You might need to choose several different locations on campus. You might need to choose different times, things like that. Then we calculate the sample mean, right? So just like we did before, from our sample, we calculate the sample mean. In this case, we just add up all the weights and divide by the number of samples. So that's relatively straightforward. Once we have the mean, we need to calculate the sample standard deviation. Remember, we don't know the standard deviation already. But in this case, we assume, actually, we know that the weight should be more or less normal if we have a large enough sample. If we're talking about weights of students, it should be relatively normal. So we calculate the sample standard deviation. So we get the mean, which we've already calculated above. We find the variance, which is the squared differences from the mean. The squared differences from the mean. And we'll talk a little bit more about variance later. So we calculate the variance, and then we take the square root of the variance. And that gives us our sample standard deviation, right? So now we know the value of one standard deviation, right? And we can basically add and subtract that from our calculated mean to get all of our different standard deviations. So for one standard deviation, two standard, etc. So then we need to select a confidence level. So again, how confident do we want to be? What is our level of confidence in our study? Yeah, how confident do we want to be, basically? So we select the confidence level. One standard deviation, remember, is about 68% confident. Two standard deviations is 95% confident. Three standard deviations is 99.7% confident. So if we go back one, notice our confidence level is about 90%. And it looks like, I don't know if you can tell. But it looks like it's a little over one standard deviation, but not quite two standard deviations. Whenever I see this, I see something that's a little bit more than one standard deviation, but not quite two. Well, that completely makes sense in this case because two standard deviations is about 95%. So we had a little bit less than two standard deviations for a confidence level that's 95%. We have three standard deviations that's 99.7%. So in this case, if we're three standard deviations out, we can be very, or we can be quite confident that the actual, the true value, or our true parameter is going to be somewhere within three standard deviations. We can be 99.7% sure or percent confident that we're going to get the true value somewhere within there. If it's two standard deviations, we can be 95% confident that the true value is going to fall within two standard deviations of the mean or the calculated mean. And that's actually pretty high. Most studies that I've seen, like I said, use about two standard deviations are 95% confidence level. So for determining sample size, we want to know, for example, in a sample, well, lots of studies and lots of people ask the question, how do I know how many people to survey? Or how do I know how many samples I need to collect if I'm calculating some value? Well, we have to calculate, based on a couple different things, how many samples we should take. Now before I said kind of the hard and fast rule is you need at least 35 samples to be somewhat significant. We'll talk about significance later. But you need at least 35 samples. But that really depends on the population size. For example, if your population size is five and you get one sample, you know, that's one sample is not great in this case, but it's also not bad because you, as long as because you have a pretty good chance of representing the sample, if there's only five things in it, right? You can't get 35. But in the case of, well, okay, let's just talk about determining the sample size. So determining sample size, we need to know the population size. We need to have an estimate of how big the population size is. So we need to know basically how many, in the case of our prior example, how many males, how many male Hallam University students are there? And I think that's something close to like 5,000 or so. I think that's a little bit over, but we can say that our population size for males in Hallam are probably about 5,000, something like that. We need to know our margin of error, which is our confidence interval, which we talked about calculating before, the interval in which the true value will lie within some probability. And we also need to know our confidence level. So what confidence level do we want to calculate? Basically, with confidence level, the lower the value, the more lenient you are in collecting data. So let's say the lower the confidence that I accept, the larger the range of values end up being. And I'll show you an example of this, I think. Actually, you'll see an example of this whenever you use the calculator in a second. So for determining the sample size, we can use this formula here. And this formula says, basically, sample size is our z-score squared, z-score, and z-score corresponds to our desired confidence level. So if our confidence level, like I said, is normally about 95 or 99, I think the lowest I've actually seen in a real study was 90%. So our confidence level, or our z-score, corresponds to a particular confidence level that we've already chosen. So let's say we chose 95%. That means that our z-score is 1.96%. Again, we'll talk about z-scores a little bit more. It also describes them how to calculate a z-score in the book. But or we can just use this graph and it's just handy right now. So z here is our z-score. So we take z-score squared and then we take our standard deviation times 1 minus the standard deviation. And then e is our margin of error squared. And that gives us the top value of this formula. We take 1 plus z-score squared, standard deviation times 1 minus standard deviation divided by our margin of error squared times the population size. And that will give us our overall sample size. So essentially with population size, margin of error and confidence level, we can calculate how big a sample we need if we know the population size, given some confidence level. So there's also a handy little chart because like I said, this is used in almost every study that's looking at some population. So it's already been done for a confidence level 95% or 99%. You can get these charts online as well. They've already been calculated. We also, well, I'll show you something in a second. So whenever we're determining the sample size, just looking at this, we have our confidence level here is 95% and 99%. We have a margin of error, how much error we're willing to have, like I said, how much error we're accepting in our study. Now notice, and 5% or 2.5% or 1% are actually relatively common margins of error that we're willing to accept depending on what the study actually is. But look at confidence, let's do the strictest one, for example. So confidence level 99%. So we will be extremely certain that or we're extremely confident that the true value is within some range. And our margin of error is only 1%. If our population size is 100, then that means that we have to have a sample size of 99. And why is that? That's because we want to be absolutely sure that the values fall within our estimates and our margin of error is actually very, very low. So we have to get as much data as possible. In this case, we have to get 99 samples to make sure that our confidence level is 99% and our margin of error is 1%. Now, if we accept a 5% margin of error, then we only have to get 87 samples. I say only, we have to get 87 samples out of a population size of 100 if we want a confidence level of 99% and a margin of error of 5%. Now looking over, confidence level of 95% where we are relaxing a little bit how confident we are, which actually means that we're reducing the range basically reducing the range of values that we accept. We're relaxing a little bit how confident we are. And if we're looking at our margin of error of 1% with confidence level of 95% and our population size is 100, then we still need 99 samples. Because confidence level of 95 is still quite high and we basically accept almost no error in our measurement. If we accept not a 5% margin of error, then we only need 80 samples. So you can see how confidence level and margin of error kind of affect each other. Yeah, how the confidence level and margin of error kind of affect each other. If we have a population size of 100, then we need to take actually relatively a lot of samples. If our population size is 5000, we can get the same confidence and the same margin of error by taking fewer samples. So for example, in this case, we have 80%, we need to get 80% of the entire population to get a 5% margin of error if the population size is 100. But if we look at 500, the population size of 500, we only have to take 217 samples to get a confidence level of 95% and have a margin of error of 5%. So in this case, we're taking well over half for a population size 100 and less than half for a population size of 500. So and then if we keep going up, let's see, we keep going up. If the population gets very, very big, so for example, the population of 1 million, then we can still, we only need really 384, a sample of 384 to get a margin of error of 5% and a confidence level of 95%. This is why we can do phone interviews of what seems like a relatively low number of people whenever a population is very large. So for example, if somebody is doing a telephone interview, looking at people in Seoul, and we want to describe something about the people of Seoul, well, we can still get a relatively high confidence level and a relatively low margin of error. Even if we don't interview a lot of people, well, 384 is still going to be a very difficult amount of people to contact and actually have a reply to surveys or something like that. But we can still statistically be relatively confident of our answer even with somewhat kind of low numbers. Now, if we look at confidence level 99% margin of error of 5%, we actually have almost double, well, not quite double, but almost double the amount of samples that we need to have a confidence level that's very high. Now, if we look at our margin of error, you notice that basically, where are we? Yeah, our margin of error, if we want a very low margin of error, then we need a very large sample. And most surveys that I see that are sampling entire populations are not looking for a 1% margin of error, so they don't sample nearly close to, for example, in this case, 10,000 people for a population of 1 million. So just keep this in mind, there are these charts that can help you a little bit, but usually we're collecting around, for example, population size of 100, and a lot of people are getting 30 samples from a population size of 100. And if that's the only samples you're getting, well, either your margin of error is going to be very large or your confidence error is going to be very small. So if your confidence error is very small and your margin of error is also small, then basically that says that you're not very confident about your prediction. If your margin of error is very large, but your confidence error is also very large, then that says that you are accepting a huge range of error. You know that it's going to be somewhere in here. So basically, in that case, they wouldn't have any predictive power. What this confidence level and margin of error does is say, we're going to reduce our level of error to a specified, very relatively small range. And I'm very confident that my true value is going to fall within that range. If you start to increase that range, then you're not really making any predictions. You're just saying, well, our value is going to be somewhere within there and has no predictive power at that point. So real studies tend to use confidence levels about 95% and margins of error maximum 5%. So all of this is a little bit can be complicated, especially if we want to figure out, for example, sample size. So I'd like you to go look at this sample size calculator at SurveyMonkey.com. It's a very nice calculator. You can basically put in the size of the population and the confidence level, confidence intervals that you want, and it will tell you how big your sample actually needs to be. So it will do all these calculations for you. So that's it. This is Confidence Intervals and we will be working with them more this week. So that's it. Thank you very much.