 Hello everyone. Today we're going to be talking about the normal distribution, and this lecture, this presentation isn't very long, but it is quite important because so many things that we find in nature are normally distributed, so it's important to know about the distribution and how we use it. So, like I've said, it's extremely important distribution as some people consider the most important of all distributions because this is what we tend to see in nature. This is what tends to happen for most data sets that we have. So, the things that we measure, for example, anything that we would measure in nature, we tend to eventually get to a normal distribution the more times we measure things. So, just to describe it a little bit, you can see it here. The normal distribution is a bell-shaped curve, and you might have seen this distribution used before whenever we're thinking about, for example, grades. Grades in a class tend to be normally distributed, which with the majority of people kind of landing somewhere in the center here, so it's very probable that someone will land in the center, and then the best grades up kind of at the top, but there's very few of the best grades and then very few of the worst grades kind of. So, grades tend to be normally distributed. If we think about heights, for example, if we measured everyone's height in this class, the distribution of heights in this class would also be normally distributed, which with the average being somewhere in the center, and then people who are extremely tall, very few people would be what we would consider extremely tall, very few people would be what we consider extremely short, and then most people would land somewhere kind of in the middle here. So, most things that we measure tend to end up with some sort of bell-shaped distribution, or a normal distribution curve. So, it is very important to understand it and know how to use it. We will use it more in this class after this lecture, but we will also definitely use it next semester in advanced statistics as well. So, the normal distribution has two parameters. First off, the mean, and the mean we would expect to find in the middle. It's just kind of the average here, or the center of this arc. So, if we just drew a line down the exact center, that's where the mean is located, and what we know about the mean is that that is going to be the most average point. So, think again about heights. There's going to be an average height for the class. There's going to be an average grade for the class, and the mean is exactly average, right? And then we have something called standard deviation. And standard deviation is a measurement that is a distance from the mean. So, remember last time we were talking about distributions, basically everything underneath the curve eventually equals a probability of one. So, if we add all of the probabilities of landing anywhere within this curve, if we add them all up, it would be a probability of one, right? So, the standard deviation, we can basically say how far away from the mean is some measurement, or we can then start to ask questions like, for example, what is the likelihood of some value landing at some distance from the mean? We'll talk a little bit more about that. So, there's two parameters for standard distributions, for normal distributions, sorry, and that is the mean and the standard deviation, okay? So, the curve is symmetrical about a vertical line through the mean, and we're again looking at average. So, if we put a line down the center, it would be symmetrical. We tend to see, you know, there's very few people that are going to be on top, there's very few people that are going to be on bottom, and there's kind of this average in the center that we tend to see with most types of data. Like other distributions, the area underneath the curve equals a probability of one. So, if we added up all the probabilities, it's probability distribution, all of the probabilities underneath the curve are going to be one. So, now think about that for a second. So, if all of the probabilities underneath the curve add up to one, now, just like the other curves as well, probability above the curve plus the probability below the curve must equal one, right? So, the probability underneath the curve added up all of these probabilities would eventually equal one, and the probability of underneath the curve plus above the curve would also equal one, just like the other distributions before. We don't tend to think about the probabilities above the curve too often, but yeah, for normal distribution, we're almost always dealing with underneath the curve, okay? The standard normal distribution, a normal distribution of standardized values called z-squares. So, a z-square tells how many standard deviations the value x is above or below the mean, and that's really what we use normal distributions for, is to be able to guess or evaluate how above or below the mean a certain value is. What is the probability that a value is going to fall above or below the mean? What's the probability that a randomly selected grade will be above or below a mean? So, z-scores can tell us a lot about where things are going to fall on the mean, yeah, where things fall relative to the mean. So, z-score allows comparison of data that is scaled differently. Now, think about that. We might have, for example, the weights of, let's say, the weights of people in Korea, and I think people in Korea actually are very skinny, usually, right? So, most people that I see in Korea are not overweight or not obese by any means, and the weights of people, for example, in the U.S., there's a huge weight problem, right? So, think about the average weight of a Korean and the average weight of an American, and I really should have looked this up. I'm just assuming that the average weight of Americans is at least a little bit higher, right? So, whenever we're measuring, we might actually have different scales. So, we might say that for an American, maybe the scale of weights in the U.S. is actually much greater for considering whether somebody is obese or not, because the average height was taller, at least Korea caught up with heights, and now Korea basically has almost the same average height. So, I'm making this much more complicated than it should be. Basically, we're looking at weights where in the U.S., because of the height difference, we might say that a weight that is considered obese, the scale of that is a little bit wider because the average height is taller, whereas in Korea, the scale of obesity would be slightly lower because the height is a little bit lower, just a little bit now. So, what does that mean? Well, we want to be able to actually compare those two weights, even though they're scaled slightly differently. So, what's the chance that, let's say somebody weighs 200 pounds, and I'm sorry, I don't know that in kilo off the top of my head, what's the chance of somebody who's the next person that we measure being 200 pounds or more? Well, in Korea or the Korean scale, maybe 200 pounds American is equal to like a 175 pound Korean, something like that. So, they're actually the same relative distance from the mean, and that's the whole point of this. We have two different scales, but we want to know how far from the mean, the relative distance from the mean is this value. So, in this case, we can use z-scores to compare two things that might be scaled slightly differently, and how can we do that? Very basically, if you forget my horrible example that I just tried to give, very basically it works like the z-score tells us how many standard deviations it is away from the mean. So, if we have two differently scaled measurements, we can still compare them because we know that they're both a normal distribution. We know the mean of both of them, and the mean can be a different value, no problem. And then we want to calculate from the mean of each of these standard distributions, how many standard deviations are there from that mean, right? What that does is I can have a z-score of, for example, three for both of those distributions. If I'm looking at both of those distributions and I have a z-score of three, then I know that both of those are actually equal, even if they don't give me, for example, the same weights, they are both the same distance away from the mean, and therefore relative to the mean, they're exactly the same, even if they're a different value. And I'll give you an example of this because I'll give you a better example. So basically, z-scores allow comparison of data that is scaled differently. Just think about not the actual values, but how far are we away from the mean itself, right? We're just calculating how far are we away from the mean. We don't really care about the values when we're comparing z-scores. Now, it's also a measurement of standard deviations relative to the mean. So just like I said, calculate the mean, and then look how far away we are from that mean, and we can compare regardless of what the value scale is. Now, normal distribution is, let's say, interesting because we can calculate the probabilities that something is going to fall within a certain range or a certain value, let's say. We want to know kind of, on average, what's the likelihood that something is going to happen or something we measure is going to be a certain value? We can use something called the empirical rule. So if x is a random variable and has a normal distribution with a mean and a standard deviation, about 68% of the x-values lie between negative 1 and positive 1 standard deviation of the mean. So within one standard deviation of the mean to the left, basically above and below, or to the left and to the right. So about 68% of the x-values lie within one standard deviation of the mean is what we normally say. About 95% of the x-values lie within two standard deviations of the mean and about 99.7% of the values lie between three standard deviations of the mean. And I'll show you an example of this in a second. But what this means is that within three standard deviations, we have the majority of the values that are likely to occur. So once we have the data, once we understand what our curve is for the normal distribution, then we can confidently say that 99.7% of all of the values lie between three standard deviations from the mean. That's very good for predicting things. It's also very useful for detecting things that are what we call outliers or things that don't really conform with the data set. When there might be a lot of different reasons for outliers, we'll talk about probably next semester. So notice that almost all the x-values lie within three standard deviations of the mean. Very few are going to lie outside of that. So now let's take a look at this. So here we have our normal distribution and it's split up by standard deviation. So here is our mean, of course, right in the center and it's symmetrical on both sides. And then one standard deviation above is basically this right side here. So we have one standard deviation and that covers quite a large portion of our probability. There's a very high probability that something is going to fall within this one positive one standard deviation. And because it's symmetrical, one standard deviation below the mean is the same probability. So if we combine one standard deviation overall, then we have a very high likelihood above a 50% chance of the values falling within this standard deviation because we have so much area underneath this curve here. Okay, so now let's look at two standard deviations. So if I go, we already have our one standard. If I go to standard deviations, then I'm adding all of this area as well. And if I just say two standard, then I'm adding both negative and positive. So basically we have, if I could fill this in, I would have all of this area up to two standard deviations above and below the mean. So the first part is about 60, what was the 66, 68, sorry, 68% of the overall values will fall within that and then 95% will fall within two standard deviations. And then if I add the three, notice the line for the three standard deviations is here and a very small portion of our next values are going to fall above three standard deviations. So it's going to be 99.7 total whenever I go up to three standard deviations. So the majority of my values are going to fall somewhere within here. Now, this is very useful because we now have some properties that we always see with standard deviations. So then we can look for, well, if we know that our data should be in standard deviation, or it should have a normal distribution, if we, yeah, sorry, if we know that our data should have a normal distribution, then it must have certain properties. For example, we know that 68% are going to be within one standard deviation, 99.7% are going to be within three standard deviations. And we can do a lot of probability calculations within that. Now, what's interesting is if we should expect this normal distribution and we see something very different, we can start to ask, why is it different? So imagine, for example, grades. If I have, if I look at everyone's grade, and I see either too many up at the top, because I'm kind of expecting a standard deviation, if I'm seeing too many at the top, then I might think either my quiz or whatever it was, the test was too easy, or people are cheating, because everyone kind of did well. If I see it skewed to the right, then I would kind of suspect maybe it's either, well, I would first initially think that it's too easy. And then if everyone was to the right, then I might think that, I don't know, there was cheating or something. If it was to the left, then that would tell me that the exam is basically too hard, right? Because everyone did way lower than I would expect. So the mean should be around a certain area. And there should be a certain mean. If it's skewed to the left too far, that tells me that it's too hard. If it's skewed to the right too far, that tells me basically that it's too easy or something else. So using normal distribution, if we understand what properties normal distribution distributions should have, then we can use those probabilities to calculate the probability that somebody is going to have a value that falls within a certain range. Or if it's a deviant from the normal distribution, whenever it should be a normal distribution, then we can start to make some predictions about why would it deviate? Why are we not seeing the normal distribution that we would expect? So it's quite useful for a lot of different things, like I said, because this is a very, very common distribution. So again, think about heights, think about weights, potentially the, let's say, average salary in an area. So the average salary that somebody makes in Chuncheon, there's going to be kind of the average. And there's going to be people in the low end, there's going to be people on the high end, and it's going to get, you know, people making a lot of money is going to be less probable than people making the average salary, right? So salaries, the amount of property, the value of property in Chuncheon, things like that will also be normally distributed. So we see normal distribution a lot. So it's good to be able to calculate or be able to manipulate the data in a normal distribution. So the normal distribution can be used to calculate the probability of a value falling within a range. So just like before, whenever we're dealing with distributions, we want to know what is the probability of a value falling within some particular range that we want to know about, basically, here, 68% of the values will be within one standard deviation. So we start at the mean, and then we go over to the left, to the right, and to the left, one standard deviation, and 68% of the values will fall within there. 95% of the values within two, excuse me, two standard deviations, and 99% within three standard deviations. And then we have these kind of outliers or very unlikely values, essentially. Okay. Now whenever we're trying to calculate the probability of a value falling within a range, here we have an example where we have the shaded area that is the probability of x less than x. And out of these two x's, the one at the bottom is actually a small x. The x over here on the right hand side is the big x, as far as I understand it. So the shaded area is px less than x. So here we have x, the small x, and we're calculating the probability of this shaded purple area, right? So what is the probability that our value will fall within that range? What's the probability of big x less than small x? Okay. And I wish they had used different different variables here, but that's it. White area is px greater than x equals one minus the probability of big x less than small x. So what they're saying here, or the way that we calculate the probability of this white area is essentially by calculating the probability of this shaded area and subtracting from one, because the overall probability has to be one, right? So if we know the probability of this, this shaded area is, let's say, 0.2, then we know that this is 0.8 overall, which it's not quite that much, but that's it. So here, we can calculate the probability that we're going to have some value. What's the next value that we look for? What is the probability that's going to land somewhere within here? And it will actually be pretty low, right? Okay. So that's pretty much it for normal distributions. We are going to do a lot more in terms of normal distribution calculations, especially z score. So again, like I say, this isn't a very long lecture about normal distributions, because they're pretty straightforward. We just have to figure out how to use them and be able to use them and manipulate them properly. So that's it for today. Thank you very much.