 Hi, I'm Zor. Welcome to Unisor Education. Well, let's continue the course of advanced mathematics for teenagers and high school students. It's presented on Unisor.com. That's where I suggest you to watch this lecture from because it has notes and exams for registered students and other goodies, and it's free. Today we'll talk about one particular aspect of statistics. It's related to normal distribution, random variables which are distributed according to the normal distribution law. So, you know what normal distribution is. It's this bell curve usually represents this particular distribution. Now, it's very important in statistics to deal with a random variable which is distributed normally because then it's just sufficient to evaluate its mean value, mathematical expectation and standard deviation to basically know everything about the distribution. So, distribution is defined by only two parameters which we usually use mu for mean or mathematical expectation and sigma for standard deviation. Now, knowing these two from whatever our empirical data are, we basically can recreate the entire distribution and find the probability of our random variable to be within any kind of a boundaries from two which is basically what distribution is all about, right? So, it's very important for distribution to be normal in statistical research because it's easier to deal with. And then we always know all these nice rules like two sigma rules for normal distribution which means that the probability of random variable to be within two sigma boundaries from its mean it's 95%, 0.95. So, we know these nice things about normal distributions and that's why it's easier to deal with. Also, let's not forget the fact that in many, many statistical research we are dealing with random variables which actually do behave like variables with normal distribution. And let me explain you why. Let's talk about one simple natural phenomenon like temperature of the body, of the healthy person. Well, we know that this temperature is somewhere around, what, 98.6 Fahrenheit or about 37 degrees Celsius, right? We know that. That's kind of an average temperature. Well, mathematically speaking, temperature of a randomly chosen healthy person is a random variable which has mean, mathematical expectation, this particular value and some relatively narrow distribution, statistical standard deviation from this mean variable. Now, why do we consider this particular random variable, the temperature of the human body to be distributed normally? Well, and actually many other phenomenons in nature. Well, here's a very simple explanation. We all should recall central limit theorem in theory of probabilities which in its simplest form states that if you have random variables which are independent and identically distributed, their average behaves more and more normally as n goes to infinity. So this is the central limit theorem. The more random variables we mix together, the more result looks like the normally distributed random variables. I do encourage you actually to go to a lecture about central limit theorem in Unisor.com to basically feel that this is true. Now, it's not only for independent identically distributed random variables that this particular central limit theorem is true. Actually, it's true for much broader category of random variables which we are mixing together. They're not necessarily completely independent. They're not necessarily identically distributed. But again, the central limit theorem has certain very, very broad conditions under which the more we have these mixed together, the closer the distribution is to normal. And now, let's go back to temperature of the body. How this particular random variable, which is the temperature of the randomly chosen person, how is it actually formed? Well, let's just think about it. The human body is extremely complex organism. It has millions, billions, trillions, whatever different cells which each of them is doing something, probably emitting certain amount of heat. And all of this is basically added up together. So all these miniscule amounts of heat emitted by every cell of the body is added together. And so what we have right now is that the temperature is the result of a mixture of huge number of random variables, each of them associated with some little cell in our organism. And that's why it's reasonably to assume that the distribution of this random variable, the temperature, is normal. It's normally distributed with certain mean and certain standard deviation. And we do know actually the mean for the healthy person, these numbers, because that's the result of our statistical research throughout all the humanity. We know all the temperatures of healthy people and they're all somewhere around these numbers. Not exactly. They're still healthy if this is 98.7, for instance, or in Celsius, 36.6. But anyway, it's very normally distributed around these mean values. So if you don't know anything about how this particular random variable, which you are measuring, is distributed normally and not normally, and you would prefer it to be normal because then you can use all these nice properties with two sigma, etc. So it's very nice to make sure that your distribution, the distribution of the random variable which you are sampling, is really normal. And now I will just talk about very quickly. I mean, it's a very big introduction into a very short story about how you can basically make a judgment whether a particular random variable is normal or not normal based on sampling of the data. Okay, so I'm suggesting two different ways. There are actually much more precise and much more interesting but there are much more complex ways to prove or disprove normality. What I'm going to prove is something which is really, really very, very simple and most likely it will be sufficient for some kind of a rude experiments. So number one is related to histogram. Now, we all know that normally distributed random variable has bell-shaped distribution. It's the density of probabilities, if you wish. So what we can do is we have a certain amount of data. Based on this certain amount of data, we can have maximum and minimum, which we have obtained. Let's say we have a thousand different elements of data. So we can choose something like 25-30 intervals between minimum and maximum and then we just count how many values fall into this interval, this interval, this interval. And you see, I'm drawing my histogram more or less in accordance with this bell-shaped form. So if your histogram looks like this, well, it's a very good indication that your random variable, which you are measuring, does have normal distribution. If, however, using these samples, you get something like this, this doesn't look like normally distributed random variable. So what's the properties of a histogram which we can actually pay attention to? Well, the first property which indicates normality of the distribution is symmetry. It should be symmetrical relatively to the middle between the maximum and minimum. So symmetry is important. Another, what's important? We have to have a hump on the top here. So you probably will have one maximum by height rectangle in the histogram and then symmetrically down all others. So that would be a better choice for a histogram which has normal distribution. It would be very nice if you can identify concavity, if you wish, on the top. Concavity directed downwards, which is gradually converted into concave upwards as you go to periphery from this. So if you have these symmetry hump on the top and then concave directed upwards on the periphery, then it looks like a bell shape and then there is a very good chance that your distribution might be normal distribution. What's next? So here is my first methodology, if you wish. Just look at the histogram. It should look normal. Okay, now what's the next one? Next one is more related to calculations. Let's think about this way. If we know that a particular distribution is normal, then there is a rule, if you remember, the sigma rule, two sigma rule and three sigma rule. Now what is sigma rule? Well, the probability of our variable, random variable, to deviate from its standard deviation, the probability that it will be no more than sigma is 0.65, 65% or 68, I think 68. Yeah, it's 68%. For normal distribution with mean mu and standard deviation sigma. Now similarly, probability of this deviation of the random variable from its mean that it's less than two sigma is 0.95. That's the most frequently used evaluation. And the probability of our variable to deviate from its mean by no more than three sigma is 0.997. So we know these rules. So what does it mean? It means that if our sample from which we can calculate sample standard, the sample mathematical expectation and sample standard deviation after calculation of these, and then we can basically construct the interval from mu minus sigma and mu plus sigma, right? And see how many values fall into this interval and compare it with the total number of values. If we don't have something closer to 0.68 as a ratio, well, that means it's not probably as normal as we expected. I mean, if it's 0.70, it's probably okay. But if it's 0.1, that's definitely not okay. That's definitely different from whatever we expect to be from the normal distribution. So after we have calculated sample mean and sample standard deviation, we can actually evaluate. So here is my mu, here is my sigma, two sigma and three sigma interval. So this is from mu minus three sigma and two mu plus three sigma. This is my mu minus two sigma and mu plus two sigma interval. This is my mu. And finally, this smaller one is from mu minus sigma to mu plus sigma. So after we have calculated mu and sigma based on the sample which we have, we just count how many out of an entire population which we got fall into this smaller interval, how many into bigger interval, and how many to the biggest interval. If our ratios are not around these, then it's probably not normally distributed random variables. If it is, then the chances are it is. So if you have histogram which looks like a bell shape, and you have sample data which you can just calculate the frequencies of falling into these intervals based on your sample calculations, if these frequencies are around these, well then the chances are you can really think that this particular random variable is normally distributed. And for the future, you can actually predict certain probabilities based on this. Since you know the distribution is normal with these parameters, then you can actually predict because what's the purpose of the probabilities? The probabilities purpose is to predict what's the purpose of statistics. The purpose of statistics is to find the probabilities. So you found these two values, mean and standard deviation. Using this, you have proved that your distribution is normal. And now since you know the normality of this distribution and main parameters, new and sigma of this distribution, then you can basically say that the probability of the new experiment to have the data in this particular interval, whatever the interval is, it may be even from this to this, doesn't matter which one. Since you know new and sigma, you can evaluate any probability using the corresponding formulas in case of normal distribution, it's integral, but it doesn't really matter. You can do it, that's the problem. So you can predict what's the probability of your random variable to be within certain range. And that basically ends this problem of identifying normal distribution. Again, there are much more precise, much more scientific ways to determine the normality of the distribution. And don't forget that you cannot really determine normality with 100% certainty. As everything else in statistics, whatever you are stating about anybody or anything, or any random variable, etc., it has certain probability of being true. It has certain certainty level. So there are very important approaches, methodologies, which can tell that, okay, this particular distribution is normal and this is a certainty level of this particular statement. Now, I'm not going into these details, they are really very, very advanced and it's probably only for professionals who really specialize in this area. But this probably would be the most important and kind of the most precise characteristic, which you can give to a random variable observed through the sampling data, that this particular random variable is normal with such and such parameters and the certainty level of this statement is not 100%, significantly less. All right, so that's it for today. Thank you very much. Try to go to unizord.com again and go through the lecture, through the notes. It's always helpful. There are a couple of pictures of normal and not normal distribution. And that's it for today. Thank you very much and good luck.