 Hi, I'm Zor. Welcome to a new Zor education. We are continuing talking about statistical regression, about linear statistical regression, and I would like to consider one very practical problem. It just came to my mind. It doesn't mean that this is like a real problem, which people are facing. But I needed some statistics, so I found some statistics and decided maybe there was some correlation which might lead actually to some linear regression. Well, let me just tell you in advance. It was an unsuccessful attempt, which means that my linear regression can actually be constructed, but the validity of this particular formula which can, which relates to variables which I was, which I have found, which I will be talking about. So validity is basically zero because the error is too big. However, I think it has some educational value in any case. So I'm going to present it right now to you. I would actually suggest you to go to Unisor.com to watch this particular lecture and also to read the notes for every lecture. Well, in this particular case, I have the problem explained in more detail. So that's probably very useful. In addition, Unisor.com allows you to basically establish the educational process with enrolling in certain classes, taking exams, etc. And it's free. So anyway, so let's go into this particular problem and let me explain what kind of statistics I have found and how I'm going to use these data. So what I have found was information about how many miles total were traveled in United States during a particular year, actually every year from something 1920s up until 2014. And how many accident fatalities have actually been happening during these times. Now, my initial thought was that, well, obviously the automobile industry was developing. In 1920s, it was really a very small number of cars and the mileage which they have covered during the year was very small. Well, this is actually for an entire decade. Now, as the industry developed, obviously, number of cars and number of miles traveled increased, which, in my view, was supposed to increase the number of fatalities on the roads. And as we see, these are increasing numbers. So first of all, I have combined every 10 years together and got the sum. So from 1925 to 34, actually the raw data and the reference to the raw data is in notes on the website to this lecture. So I have combined raw data into decades. So I have 10 years, number of miles traveled and number of fatalities, which is actually a lot. So as we see, it's increasing. And my thought was that there is some kind of linear regression, which might actually be observed here. So let me just tell you, again, in the beginning, that it's not really entirely true, because as we see, the miles are increasing. Now, fatalities are increasing up to this decade from 65 to 74. And starting from 75, it's decreasing. Now, obviously, it distorts the linear relationship between these two numbers. Now, why it's decreasing, while the number of miles increasing? Well, I think it's related to certain safety devices, which were probably started around 1975, like seatbelts and airbags. Obviously, this decrees the number of fatalities. And as the number of these devices filled up more and more, all the new cars, as you see, the number of fatalities going down. So that distorts my linear regression model from some kind of logical standpoint. However, being a stubborn person, I decided, okay, let's just investigate if there is any kind of linear dependency. And what's the error? If error is not really very big, then it still makes sense. Well, obviously, again, I felt that it doesn't really make much sense. So the error will be very large, primarily because of this, because it goes down. However, I might actually suggest you as an exercise to do this only on these data from 1925 to 74. And that's where you might actually observe a linear regression with a relatively small error. But I have decided to do it entirely, again, for educational purposes. And what I am trying to establish is some kind of a dependency, which looks like this. Where y is number, a random variable, which is number of fatalities, x is mileage covered, and epsilon is some kind of error, the deviation of this number from this. So my purpose is to establish a, to establish b based on whatever the data I have, and then using these, find out the error and analyze, statistically analyze this error. Now, I assumed when I was deriving all these formulas in the previous lectures that epsilon is a normally distributed random variable with, and that's very important, the mathematical expectation of zero. Well, that's why we have this b as a constant. And some standard deviation, which we have to establish after a and b are established, and since we know the real data, observed data for x and y, after we know b, a and b, we can establish the data for epsilon and find sample deviation. All right, so that's the plan. Now, I can refer you back to previous lectures where I have basically derived the formula. I'm going to use it right now that a is sigma xi, yi divided by sigma xi square, where i from 1 to n, i from 1 to n. Now, what is xi and yi in this particular case? It's not these guys. x and y. It's these guys centered around their mathematical expectation. So, x, y, xi is equal to lowercase xi minus average of all x's, and yi is equal to yi minus average, observed average of all y's. So, x with the bar on the top is average of these. xi is each individual one, and the difference is just a new value which is a new random variable with mathematical expectation zero. And same thing with y. Now, these two numbers are actually averages. So, this is x and this is y with the bar. So, by subtracting, we have new variables, capital X, I, and capital Y, I. And obviously, the average of these minus this is equal to zero, and the average of these minus this is also equals to zero. That's why we have centralised the values. Now, with these centralised values, I can calculate A. And this is my result. Now, as long as I know A, I can calculate the B. Because if I will take average of these guys, now, we have agreed that this is zero. Epsilon is an arrow which has mathematical expectation of zero and some standard variation. So, that's exactly why we put this constant B. And now, we can find constant B, which is equal to average of these minus average of these, right? And I have calculated it. This is B. Okay. So, this minus three times this is equal to this. Now, I know A and B. And now, I can actually have the values for Epsilon I. It's equal to Y I minus A X I. Y I minus, and minus B, sorry. Minus A X Y, A times this. So, this minus three point whatever times this, and minus 343,000 whatever. Now, if my calculations are correct, then the sum of all Epsilon I should be approximately equal to zero, right? Because that's that's how we calculated from the very beginning. And it is. Actually, I have calculated this. I mean, if I will add all the Epsilon's calculated like this, I did it on a spreadsheet actually. The result is something like minus one or something like this. Very, very small number relative to these big numbers. But the problem is this. So, when I calculated based on these values, when I have calculated Sigma Epsilon, well, minus zero, so So, just this. This is variation. And took a square root of it. It was 68,508. Now, Is it good or bad? Well, it's terrible. Why? Because we are trying to evaluate the value of Y. The values of Y I as I have explained, it's these guys, right? Average is 3,389,000 something. Now, if my standard deviation is 68,000, if I would like to predict the value with a two Sigma interval to the left and to the right. So, this is my average, right? Now, two Sigma to the left, it's minus one hundred and like thirty-five thousand. So, it goes to, what, two fifty or something like this, thousand. And one thirty to the right. So, that would be like more than four hundred. So, I have a range like from two hundred and fifty thousand to four hundred thousand. This is too big a range, obviously. I mean, the whole range is, what, like half of this. Now, which means that my estimate is, well, at least, well, like fifty percent incorrect. That's definitely wrong. If this is something like a couple of percentage points relative to this, then it's okay. So, you have a certain average and then you have certain deviation from this average, which is no, no, no, no bigger than five percent, let's say. Then it makes some sense. But if deviation is like fifty percent, that's absolutely not good. In any case, but my point was that this is just an exercise in how to do the calculations. And let me just repeat again what I did. First, I centralized all my x values using their average. So, I subtracted from each value the average. Some of them will be negative, some of them will be positive. Then I did exactly the same with y, my dependent variable. So, this is independent variable, mileage. This is dependent variable. And again, I centralized it. And then I used the formula, which I have already wiped out from the board, the formula which gives me the coefficient A. Now, knowing coefficient A, I can actually calculate the B using these averages, obviously, right? Because if you average this, this, and this, epsilon will be averaged to zero. So, you'll have just an equation where you can resolve it for B, knowing A and x and y averages. So, knowing A and B, now you can calculate the real value of your arrow. So, your regression, which is A x plus B, per axis this value. And knowing A and B, like this and this, are approximating y. And by differencing between y and A x plus B, you have a sample data for epsilon for your arrow. And using these epsilon data for arrow, you can calculate its average and standard deviation. Average, because of our calculations, should be close to zero, and it was. But then you have to basically evaluate how big your deviation is. So, that's the short plan of how to attack this problem of linear regression between two different variables. Now, as you understand, we had a problem here. The problem is that I did not take into account the technological progress, the safety devices which have been introduced into the car. So, maybe, maybe, if I will use instead of one variable, independent variable, x. But three variables, like number of airbag-equipped cars and number of seatbelt-equipped cars. Well, maybe, I'm not sure about seatbelt-equipped, equipped, they were always equipped. But at some point, there was actually a legislation in the United States that you have to really wear the seatbelts if you are driving. A mandatory legislation was mandatory. So, maybe, at that time, again, somehow I should actually take into account. So, if I will take into account more independent variable, definitely the number of airbags installed, then my approximation, also linear, it would be, let's say, a1x1 plus a2x2 plus maybe a3x3, if it's seatbelts involved, plus some constant b. In this case, my approximation might be better and my arrow will be smaller. But this multibarriant analysis regression is much more complex and I really don't want to complicate your life even more than it is right now. So, this is basically something which people are studying whenever they're specializing in statistics. For you, again, my purpose was just to explain, in the simplest case, when one dependent and one independent variable are involved in this supposedly observed relationship and then how to calculate it through such a relationship. And, again, this is a hypothesis. There is a hypothesis and we checked it by basically analyzing the arrow. The hypothesis may be wrong in the very beginning, so maybe it's supposed to be somehow different equation. Maybe it's x square with and or some two in the power of x or something like this. Who knows what kind of real dependency is. Linear is just the easiest one and it makes sense to check if it makes sense, right? So, we check, we have the calculations and we check this standard deviation of my arrow. If it's reasonable, then you can use it. Now, how can you use it? Most likely it's used for the purpose of forecasting because you know that at a certain point you see how this mileage is increasing. It obviously depends on the number of cars which are manufactured and this is the planned number, planned in advance. So, basically you understand how many cars will be sold and that's how you can evaluate for the next year what will be the total mileage, again, approximately. And that would allow you to calculate in this particular case an unfortunate accidents with fatalities. Well, that's it. In many other cases it has much more practical usage, these type of calculations, but the procedure is exactly the same. Well, that's it for today. I might actually recommend you to go to the raw data for each year instead of summarizing by decade, but only for the period up to 1975. Then you might actually find a more precise linear regression between number of fatalities and number of miles traveled during the year, because at that point we did not have really so much safety equipment installed into the car, primarily airbags. Well, okay, that's it for today and if you will engage in this analysis of regression in the early days of the automobile industry up until 1975, let's say, I'll be more than happy if you send it to me. I will be more than happy to put it on the website with proper attribution. That's it. Thanks very much and good luck.