 Hi, I'm Zor. Welcome to a new Zor education. Let's continue this course of advanced mathematics for teenagers and high school students presented on unizor.com. That's where I suggest you to watch this lecture from because there are very important notes, very detailed notes for each lecture. Plus, if you are a registered student, you can take exams and it's all free. So the today's lecture is about a particular example of test for distribution of certain random variable to be normal. Well, first of all, why is it important for any variable to be normal and in general, why is it important to know its distribution? Well, let me start from the second part. Well, obviously theory of probabilities allows us to predict the values with certain level of certainty of random variables, which we are not really 100% sure about what kind of value it will take. So if we know the distribution of probabilities, we can say that with certain degree, certain probability level, we will have the value within certain range, right? So that's general purpose. Why normality of the distribution is important? Well, normal distribution is very well researched. There are many very nice theorems about normal distributions, which allow us to predict the future results of certain experiments if we know beforehand that these distributions are normal. There are only two parameters we need to define the normal distribution, which is mathematical expectation or mean and variance or square root of variance, which is standard deviation. These two parameters are sufficient to completely define all the probabilities related to random variables. What's also very important is that a lot of random variables, which we deal with in real life, they are actually normal. And the reason for this is well, the reason is certain natural phenomena, but mathematicians would love actually to say that the reason is the central limit theorem. Well, central limit theorem just reflects the way how the nature is working. And it basically says that if you have a lot of different factors contributing to the value of the random variable, well, the more factors contributing, the closer the distribution will be to normal. And they're very very broad conditions. So it's no wonder that lots and lots and lots of different random variables, which we meet in real life, are actually almost normally distributed. Nothing is absolute, obviously, but it's as close as it's theoretically possible to prove. Let's put it this way. And one of the examples which I presented to you in some other lecture was the temperature of a human body when it's a normal healthy body. It's not always exact the temperature. There is a certain range and within this range we still have this normally distributed random variable with its maximum at certain point like 37 Celsius or 96.98.6 Fahrenheit. And there is a certain well-shaped curve around this mean value, which basically reflects how all the different normal temperatures are distributed. Now, in this particular lecture, I would like to touch yet another natural phenomenon. And I will try to basically determine whether its distribution is normal or not normal. So that's the purpose. The test for normality, if you wish. Well, normality. Normality not in the medical sense, not in psychiatric sense. It's just normality in the theory of probability sense. All right. So what exactly I'm going to research? Well, I wanted to research the sea level at certain point on Earth and I needed to know basically some statistics. So I found the website and I basically in my notes on the website on Unisor.com I present information about the website, which is the source. It's University of Hawaii and in particular, they have accumulated an hourly sea level at many different points. I chose the point which is Midway Island in Pacific Ocean and the data which I have accumulated from, which they accumulated and I took from that site, was hourly sea level during three years, 2012, 2013 and 2014. So altogether, I have hourly rate, which means every hour, every day of the year, three years in a row, which amounts to about 26,000 plus entries, which I have, 26,000-304, to be exact. So it's a lot of data. I mean, 26,000, it's a lot of data. Now, my approach to basically suggesting that this particular distribution of the sea level is or is not normal, basically, it's based on two criteria, which I have presented in the previous lecture about the methods of the normality test. The first is histogram. It should be well-shaped and the second one is to check these one sigma, two sigma and three sigma rules. So that's what I did. Now, I've got raw data from this website of the University of Hawaii. Every hour, on the hour, I had 26,000 plus different measurements. Now, I use the spreadsheet to basically construct the histogram of this data and some calculations as well. So my first point in research is have some general sample data from this whole set, which are minimum, maximum, mean, sample variance and sample standard deviation. So these things. So here is what happens. My range from minimum to maximum was from 660 to 1859. I think it's millimeters from a certain level. I don't remember exactly what it is, doesn't really matter. For us, it's just numbers, right? So the level of the sea was measured on the minimum 660 maximum 1859. Okay. Now the mean value was 1125. My standard deviation was 142. And as I said, the total number of points was this. So I've got these raw data and I have calculated minimum, maximum, mean and standard deviation using spreadsheet. By the way, I'm not sure what's your particular preference of what kind of tools you use to calculate these things. If you have 26,000 something elements of data, I don't think you can avoid using a spreadsheet. And I understand that some of you might not actually be used to this. So, well, just take whatever I'm saying for granted. Just trust me. That's the way what I did. However, if you would like to do it yourself, that would be a great exercise. Anyway, so that's what my first step is. Now, my second step is we have to build histogram. Now, how can we build histogram? If you remember, the best way to approach this is to take this range from maximum to minimum or from minimum to maximum. Divide into certain bins and calculate how many elements of data fall into each bin. So, what I did was I had 26,000 plus and I had decided to have 30 bins. It will be relatively representative graph, relatively representative histogram. So that's exactly what I did. Now, if I have 30 bins, my difference between these is something like 100, 1200, something like this, right? Plus or minus. Divided by 30, so it would be 40 the widths of the interval. So it's from 660 to 700, from 700 to 740, etc. Up to 1860, I guess. That's my maximum. So these are my bins. Now, I'm calculating how many of these fall into this category, to this category, this, this, this, and this. And I had numbers, something like one here, one here, then it would increase and somewhere in the middle it was really largest, it was 2000 something and at the end then it was diminishing again and at the end it was also like two and one, something like this. Which means if I will build the chart based on these numbers and these intervals, I can actually see my histogram, right? Because the histogram is a frequency. I used, by the way, the function frequency in spreadsheet to calculate these data. As soon as I have these intervals, I can calculate the frequencies. There is a function on spreadsheet and then once I had the frequencies, I can build the chart. And the chart looked like, so that's my 660, that's my 1859. And my chart really looked like this. Obviously it was bars here, right? Which obviously resembles the bell shape. So my histogram test was positive. Okay, this is really, the distribution really looks like normal distribution. My next test was, well, I know that with normal distribution the probability of normal, normally distributed variable to deviate from, sorry, to deviate from the mean value by no more than 3 sigma is 0.97. No, more. I think it's 99, 99. And then something else, 997 I believe. Yeah, 997. 997, which is 99.7%. So within 3 sigma, from my average, from my mathematical expectation, random variable will be in 997 cases out of each 1000 on average. Now in this particular case, my 3 sigma, okay, 2 sigma is 284, right? I think it was actually 285 because there was some rounding. And the 3 sigma was for 26 or 27, something like this, including rounding. So let's say this is 3 sigma. Now my mean value is this one. So minus 3 sigma and plus 3 sigma, right? Difference should be less than 3 sigma. So if I subtract, I will have 1100 minus 400. That's what, 700, right? Plus or minus, well, 698, right? And plus 3 sigma would be 1552, 1552. So this is a very, very wide interval, considering this is my maximums. So this one is very close to this. And this one is maybe doesn't seem to be very close. But as far as frequencies are concerned, you see, everything above this was really very, very few. Most of them will be narrower than that. So I have calculated, basically, what is the frequency of my data, my 26304. So the number of these was 26 to 23. 26 to 23. You see? Almost everything went within this interval. There were cases above this and below this, because these are real minimum and maximum, but very, very few. Like in this case, what do we have? Like 80 of them or something like this. Only 80 cases, only 80 measurements were outside of this. And this number relative to this is actually 0.9969. As you see, it's very close to theoretical value, how it's supposed to be. Okay. Three sigma is fine. Let's check the two sigma. Now the two sigma, the two sigma, it's about 0.95 in theory for a normally distributed random variable. Now what do I have? Now this is my two sigma. So mu minus two sigma would be what? So it's 925, 920, 840, right? I think it's 840. It's 1020. Yeah, something like this. And plus would be 131410, right? Okay. This is a narrower interval. It's only two sigma interval around my mean value. So lesser number of data elements should fall into this because, you know, there are something outside. Now the number of these guys was 25625, 25625. And if you divide it by this number, by the total number of data elements, you would have 0.9605. Okay. So my theory, my theoretical number is 0.95, my actual 0.926, which is relatively close. So we are okay on this. And finally, a single sigma, single sigma, which is, which has theoretical value 0.68. Now in this case, it's even narrower than this. Sigma is this. So it's 983, is that right? 1125 minus this. Yeah, looks like 983, right? And the plus would be 1267. Okay. So it's even narrower range around my mathematical expectation. And in this case, my number of data elements which fall within this range was 1765, 17651, which is relative to this number 0.6710. Again, theoretical is this. This is my sample value relatively close. I mean, 100's difference is really kind of normal thing, which means that all these four tests, the first one is if my histogram resembles the bell shape curve of the normal distribution with a single maximum, the curvature on the top, there is a hump, and then the curvature goes this way. It's concavity up on the sides and concavity down in the middle, right? Okay. So that's my first test. And then my three tests related to standard deviation. Single, double, and triple standard deviation supposed to be 0.68, 0.95, and 0.997, whatever. And they all did actually have more or less the same numbers, approximately. Which means that this is basically a very good indication that the C level is a random variable, which basically has this standard, this mathematical expectation and this standard deviation. Why do we need it? Well, for instance, using these data, we can basically predict what is the probability of the flood or something like this, depending on whatever, wherever our structure, for instance, or a city street level is located. So we can definitely have certain probability. And having the probability, we can evaluate our risk, so to speak. So we can have certain amount of damage, for instance, if it's done by the flood. We basically multiply it by the probability and we have something like mathematical expectation of the damage. And then we can somehow handle this situation, either allocate certain amount of funds to build the wall or something like this, whatever it is. So these calculations are very important to, number one, determine the distribution of the random variable. And in this case, it's kind of convincing ourselves that this is a normally distributed random variable. And since we know that this is normally distributed random variable, determined mathematical expectation and standard deviation allows us to make certain prediction for the future behavior of this variable, how the sea level will be in the future. Now, of course, everything is different. Things are changing. The wind is a contributing factor. The currents, the ocean currents are contributing factors. Rain or no rain or whatever else. I mean, there are many different factors which basically influence the level of the water, the sea level in this particular case. And precisely because these factors are so numerous, I mean, the whole climate of the earth actually is a major contributing factor and how many factors really make up the climate. Even the volcano eruption might actually change something. So there are huge number of factors and precisely because of that, and central limit theorem, we are not surprised that many real natural random variables occurring in real life are actually normally distributed. And this is just one of the examples. It's a real example, real data. And it really falls quite well into normal distribution with certain parameters. All right, so what is interesting is, I think it will be very interesting. Read the notes for this lecture. There are some references to raw data and the picture which presents the histogram and another spreadsheet which presents the calculations based on these 30 bins and frequencies, etc., whatever goes into the histogram. Well, maybe you can just read it yourself first, whatever I have calculated, and then try to do something similar as a very good practice with something else. Doesn't really matter. You can take, for instance, the data from the same site. There are many, many points, not only the midway island where I took it from. There are many points which you can actually use. And that would be very, very great. I think it's very educational. It will really make you feel what exactly the statistical research is about. That's it for today. Thank you very much and good luck.